Git & GitHub Best Practices

Lab GitHub Organization

Our lab's GitHub organization is hosted at: https://github.com/CellProfiling

To get added to the organization, contact Frede at fredbn@stanford.edu.

Best Practices

Repository per Project

We recommend creating a separate repository for each project. Understanding when and why to create repositories is crucial for maintaining organized, reproducible, and collaborative research.

When to Create a New Repository

Create a new repository when you are:

Starting a New Research Project: Any distinct research question, experiment series, or analysis pipeline should have its own repo
Example: "single-cell-rna-seq-analysis", "protein-localization-study", "drug-response-prediction"
Developing Reusable Tools or Pipelines: Code that will be used across multiple projects
Example: Custom preprocessing scripts, analysis packages, or workflow automation tools
Writing Papers or Reports: Computational work associated with a specific publication
Example: "nature-2024-cell-profiling-paper" containing all analysis code and figures
Creating Lab Resources: Shared utilities, documentation, or datasets for the lab
Example: "lab-protocols", "imaging-analysis-toolkit"
Building Software Tools: Standalone applications or packages
Example: A new visualization tool or data processing library

When NOT to Create a Repository

Avoid creating repos for:

Exploratory notebooks that won't be reused: Keep these in a personal "scratch" or "experiments" repo
Temporary analyses: One-off data checks or quick visualizations
Personal notes: Use a separate personal knowledge base or notes system
Data files: Raw data should typically be stored in appropriate data repositories (not Git)

Why Use Repositories for Projects

1. Reproducibility - Future you (or collaborators) can reproduce results exactly - Clear history of what changed and when - Pin specific versions for publications

2. Development Workflow - Develop and test on your local machine where you have your preferred IDE and tools - Debug interactively without dealing with cluster job queues - Instantly push updates to Sherlock or HAI clusters with git pull - No need for scp, rsync, or manual file transfers

3. Disaster Recovery - Code is backed up automatically on GitHub - Accidentally deleted files? Just clone again - Cluster file system issues? Your code is safe

4. Collaboration - Team members can contribute and review code - Track who made what changes and why - Coordinate work without overwriting each other's files

5. Environment Consistency - Same codebase across your laptop, Sherlock, HAI, and lab workstations - Include environment files (requirements.txt, environment.yml) for reproducible dependencies - Avoid "it works on my machine" problems

6. Documentation and Organization - README files explain project structure and usage - Issues track TODOs and bugs - Releases mark important milestones

Repository Structure Best Practices

A well-organized project repository typically includes:

project-name/
├── README.md              # Project overview and setup instructions
├── requirements.txt       # Python dependencies
├── environment.yml        # Conda environment (if applicable)
├── data/                  # Small example data or data documentation
│   └── README.md          # Where to find full datasets
├── src/                   # Source code
│   ├── preprocessing.py
│   ├── analysis.py
│   └── visualization.py
├── notebooks/             # Jupyter notebooks for exploration
├── scripts/               # Command-line scripts for cluster runs
├── tests/                 # Unit tests
├── results/               # Generated outputs (may be .gitignore'd)
└── docs/                  # Additional documentation

Typical Workflow

Here's how the repository-per-project approach works in practice:

On Your Local Machine: 1. Create and clone your project repository 2. Develop and test code in your preferred environment 3. Commit logical changes with descriptive messages 4. Push changes to GitHub

On Sherlock/HAI Clusters: 5. Clone the repository once: git clone git@github.com:CellProfiling/your-project.git 6. When you've pushed updates from local: git pull to sync 7. Run experiments or analyses with the latest code 8. Optionally commit and push results or analysis updates

Key Benefits of This Workflow: - No manual file transfers needed - Always know you're running the latest code - Easy to switch between machines - Changes are tracked and reversible

Setting Up SSH Keys for GitHub

Setting up SSH authentication makes working with GitHub repositories much easier, especially when using remote clusters. You won't need to enter your username and password every time you push or pull.

Step 1: Generate an SSH Key

On your local machine or remote cluster, generate a new SSH key:

ssh-keygen -t ed25519 -C "your_email@stanford.edu"

When prompted: - Press Enter to accept the default file location (~/.ssh/id_ed25519) - Optionally, enter a passphrase for added security (or press Enter for no passphrase)

Step 2: Add SSH Key to ssh-agent

Start the ssh-agent and add your key:

eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519

Step 3: Copy Your Public Key

Display and copy your public key:

cat ~/.ssh/id_ed25519.pub

Copy the entire output (it should start with ssh-ed25519).

Step 4: Add SSH Key to GitHub

Go to GitHub SSH Settings
Click "New SSH key"
Give your key a descriptive title (e.g., "Sherlock Cluster" or "Local MacBook")
Paste your public key into the "Key" field
Click "Add SSH key"

Step 5: Test Your Connection

Verify that your SSH key is working:

ssh -T git@github.com

You should see a message like:

Hi username! You've successfully authenticated, but GitHub does not provide shell access.

Step 6: Clone Repositories Using SSH

When cloning repositories, use the SSH URL instead of HTTPS:

git clone git@github.com:CellProfiling/repository-name.git

Multiple Machines

If you work on multiple machines (local computer, Sherlock, HAI), you should: - Generate a separate SSH key on each machine - Add each public key to your GitHub account with descriptive names - This allows you to identify and revoke access from specific machines if needed