Git & GitHub Best Practices
Lab GitHub Organization
Our lab's GitHub organization is hosted at: https://github.com/CellProfiling
To get added to the organization, contact Frede at fredbn@stanford.edu.
Best Practices
Repository per Project
We recommend creating a separate repository for each project. Understanding when and why to create repositories is crucial for maintaining organized, reproducible, and collaborative research.
When to Create a New Repository
Create a new repository when you are:
- Starting a New Research Project: Any distinct research question, experiment series, or analysis pipeline should have its own repo
-
Example: "single-cell-rna-seq-analysis", "protein-localization-study", "drug-response-prediction"
-
Developing Reusable Tools or Pipelines: Code that will be used across multiple projects
-
Example: Custom preprocessing scripts, analysis packages, or workflow automation tools
-
Writing Papers or Reports: Computational work associated with a specific publication
-
Example: "nature-2024-cell-profiling-paper" containing all analysis code and figures
-
Creating Lab Resources: Shared utilities, documentation, or datasets for the lab
-
Example: "lab-protocols", "imaging-analysis-toolkit"
-
Building Software Tools: Standalone applications or packages
- Example: A new visualization tool or data processing library
When NOT to Create a Repository
Avoid creating repos for:
- Exploratory notebooks that won't be reused: Keep these in a personal "scratch" or "experiments" repo
- Temporary analyses: One-off data checks or quick visualizations
- Personal notes: Use a separate personal knowledge base or notes system
- Data files: Raw data should typically be stored in appropriate data repositories (not Git)
Why Use Repositories for Projects
1. Reproducibility - Future you (or collaborators) can reproduce results exactly - Clear history of what changed and when - Pin specific versions for publications
2. Development Workflow
- Develop and test on your local machine where you have your preferred IDE and tools
- Debug interactively without dealing with cluster job queues
- Instantly push updates to Sherlock or HAI clusters with git pull
- No need for scp, rsync, or manual file transfers
3. Disaster Recovery - Code is backed up automatically on GitHub - Accidentally deleted files? Just clone again - Cluster file system issues? Your code is safe
4. Collaboration - Team members can contribute and review code - Track who made what changes and why - Coordinate work without overwriting each other's files
5. Environment Consistency
- Same codebase across your laptop, Sherlock, HAI, and lab workstations
- Include environment files (requirements.txt, environment.yml) for reproducible dependencies
- Avoid "it works on my machine" problems
6. Documentation and Organization - README files explain project structure and usage - Issues track TODOs and bugs - Releases mark important milestones
Repository Structure Best Practices
A well-organized project repository typically includes:
project-name/
├── README.md # Project overview and setup instructions
├── requirements.txt # Python dependencies
├── environment.yml # Conda environment (if applicable)
├── data/ # Small example data or data documentation
│ └── README.md # Where to find full datasets
├── src/ # Source code
│ ├── preprocessing.py
│ ├── analysis.py
│ └── visualization.py
├── notebooks/ # Jupyter notebooks for exploration
├── scripts/ # Command-line scripts for cluster runs
├── tests/ # Unit tests
├── results/ # Generated outputs (may be .gitignore'd)
└── docs/ # Additional documentation
Typical Workflow
Here's how the repository-per-project approach works in practice:
On Your Local Machine: 1. Create and clone your project repository 2. Develop and test code in your preferred environment 3. Commit logical changes with descriptive messages 4. Push changes to GitHub
On Sherlock/HAI Clusters:
5. Clone the repository once: git clone git@github.com:CellProfiling/your-project.git
6. When you've pushed updates from local: git pull to sync
7. Run experiments or analyses with the latest code
8. Optionally commit and push results or analysis updates
Key Benefits of This Workflow: - No manual file transfers needed - Always know you're running the latest code - Easy to switch between machines - Changes are tracked and reversible
Setting Up SSH Keys for GitHub
Setting up SSH authentication makes working with GitHub repositories much easier, especially when using remote clusters. You won't need to enter your username and password every time you push or pull.
Step 1: Generate an SSH Key
On your local machine or remote cluster, generate a new SSH key:
ssh-keygen -t ed25519 -C "your_email@stanford.edu"
When prompted:
- Press Enter to accept the default file location (~/.ssh/id_ed25519)
- Optionally, enter a passphrase for added security (or press Enter for no passphrase)
Step 2: Add SSH Key to ssh-agent
Start the ssh-agent and add your key:
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
Step 3: Copy Your Public Key
Display and copy your public key:
cat ~/.ssh/id_ed25519.pub
Copy the entire output (it should start with ssh-ed25519).
Step 4: Add SSH Key to GitHub
- Go to GitHub SSH Settings
- Click "New SSH key"
- Give your key a descriptive title (e.g., "Sherlock Cluster" or "Local MacBook")
- Paste your public key into the "Key" field
- Click "Add SSH key"
Step 5: Test Your Connection
Verify that your SSH key is working:
ssh -T git@github.com
You should see a message like:
Hi username! You've successfully authenticated, but GitHub does not provide shell access.
Step 6: Clone Repositories Using SSH
When cloning repositories, use the SSH URL instead of HTTPS:
git clone git@github.com:CellProfiling/repository-name.git
Multiple Machines
If you work on multiple machines (local computer, Sherlock, HAI), you should: - Generate a separate SSH key on each machine - Add each public key to your GitHub account with descriptive names - This allows you to identify and revoke access from specific machines if needed