Sherlock

Setting up account

- Email srcc-support <srcc-support@stanford.edu> to request a Sherlock account. Include your SUNet ID. - Email srcc-support <srcc-support@stanford.edu> to request to be added to the emmalu group. CC Emma, as she will need to approve your addition before it is processed.

Jobs

Running Basic Jobs

sh_dev: Starts an interactive shell as a job with the resources you request -Usage: sh_dev -c 4 -g 1 -m 16 -p emmalu -t 2:00:00

 Usage: sh_dev [OPTIONS]
    Optional arguments:
        -c      number of CPU cores to request (OpenMP/pthreads, default: 1)
        -g      number of GPUs to request (default: none)
        -n      number of tasks to request (MPI ranks, default: 1)
        -N      number of nodes to request (default: 1)
        -m      memory amount to request (default: 4GB)
        -p      partition to run the job in (default: dev)
        -t      time limit (default: 01:00:00)
        -r      allocate resources from the named reservation (default: none)
        -J      job name (default: sh_dev)
        -q      quality of service to request for the job (default: normal)

-When to use: Useful for development or quick debugging that cannot be done with the resources on the login node. (ex. Debugging a script that requires a GPU)

sbatch: Request resources and launches an asynchronous job -Usage: sbatch ./test.sh where test.sh looks like the following...

#!/usr/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_job.%j.out
#SBATCH --error=test_job.%j.err
#SBATCH --time=10:00:00
#SBATCH -p emmalu
#SBATCH -c 16
#SBATCH --mem=32GB

module load python/3.6.1 #load any necessary modules
conda activate my_env  #activate your environment
python3 mycode.py #run your script

-More options for requesting resources via slurm available at https://slurm.schedmd.com/sbatch.html -When to use: Useful for launching jobs that are slow or require more compute/memory than avilable on the login node (ex running a python script to process data, train a model, evaluate a model etc.) -wrap: You can also launch a job in one line using wrap. This lets you avoid having a separate bash file for requesting resources -ex: sbatch --mem=5G -c 10 --wrap="python my_script.py" -scancel: Cancel a job that is running or in the queue -Usage: - scancel <job_id> to cancel a single job - scancel -u <sunetid> to cancel all of your jobs -Note: Time limits are formated as <day>-<hours>:<minute>:<second> ex: 1-4:00:00 is one day and four hours -Note: The time limit for jobs in the emmalu partition is 7-days

Running Multiple Jobs

Job Arrays

-Job arrays let you launch one job that spawns subjobs. -Note: Claude code and githubcopilot are very good at helping create bash files for launching slurm job arrays. -Ex 1: Maintain a separate file containing all of the files you would like to run a script on - With --array, you can specify a range of lines within the given file (this does not always need to start from 1). Additionally, if you would like to be kind to your coworkers, you can limit the number of jobs run at a single time using %. For example, --array=1-10%2 will only run 2 jobs at once – when jobs 1-2 are complete, jobs 3-4 will start, and so forth. - The job file in this format

…
#SBATCH --ntasks=1                      # 1 task (default) - entire job (do not need task per array job)
#SBATCH --cpus=1                         # number of cores per array job (default)
#SBATCH --array=1-10                   # number of jobs in array
…
job_file=`sed "${SLURM_ARRAY_TASK_ID}q;d" /path/to/list_of_job_files.txt` #reads file line-by-line
python ${job_file}

-Ex 2: Alternatively, you can contruct an array in the bash script with the arguments to your script

#SBATCH --ntasks=1                      # 1 task  (default)           
#SBATCH --cpus=1                         # number of cores per task (default)
#SBATCH --array=0-3

CONFIGS=(
    ../../configs/config_esm2.yaml
    ../../configs/config_esm3.yaml
    ../../configs/config_prott5.yaml
    ../../configs/config_protbert.yaml)
# Select config file for this task
CONFIG_FILE=${CONFIGS[$SLURM_ARRAY_TASK_ID]}
srun python ../../main.py --sweep_config $CONFIG_FILE

N-tasks

- Multiple tasks: allows you to run multiple commands simultaneously, split across CPU/GPU resources, from a single sbatch script - To use, set the –ntasks parameter to the number of tasks you would like to run and list the tasks at the bottom of the file. It is important to separate each task with the “&” character (allowing tasks to run in parallel) and to add wait to the last line, to ensure that the sbatch script waits until all tasks are completed before terminating. - The tasks will show up in the Slurm queue as a single job using ntasks * cpu/gpu-per-task resources (in the example shown below, 4 tasks * 1 cpu-per-task, ie 4 CPUs).

…
#SBATCH --ntasks=4                      # 4 tasks
#SBATCH --cpus-per-task=1          # number of cores per task
…
python task1.py &
python task2.py &
python task3.py &
python task4.py &
wait

Multithreaded python script

TODO:

Job Dependecies

-You can launch jobs that are "dependent" on other jobs. They will run one their dependencies have started/terminated/crashed -Usage: sbatch --dependency=<type>:<job_id> <job_file> - is one of the following: -after = job can begin after the specified jobs have started -afterany = job can begin after the specified jobs have terminated -aftertok = job can begin after the specified jobs have run to completion with an exit code of zero -afternotok = job can begin after the specified jobs have failed -Example use case: If you want to run job that 1) trains a model, 2) does inference and then 3) evalutates the model you might need different amounts of resources for each step. You can launch each as a separate jobs (with differen resource allocations) that depend on each other

Compute resources

How to see what resources are available and running jobs

squeue -u {SUNet}: check what jobs you have queued and running (replace {SUNet} with your ID)
sh_part: shows the resources (GPUs, CPUs, memory) in the partitions that are available to you and whether these are currently in use or available
sh_part -p emmalu: shows the resources specific to the emmalu partition
squeue -p emmalu -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %.4C %.6b %.10m": shows all jobs running/queued on our partitions and how many CPUs, GPUs and memory they are using
echo "=== emmalu Partition CPU Summary ===" && sinfo -p emmalu -h -o "%C" | awk -F/ '{print "Total CPUs: " $4 "\nIdle CPUs: " $2 "\nAllocated CPUs: " $1 "\nOffline/Other CPUs: " $3}' && echo "---" && squeue -p emmalu -t R -h -o "%C" | awk '{sum+=$1} END {print "CPUs used by RUNNING jobs: " sum}' && squeue -p emmalu -t PD -h -o "%C" | awk '{sum+=$1} END {print "CPUs requested by PENDING jobs: " sum}': shows the number of CPUs being used, how many are idle, and how many are unavailable (for reasons like node drained or under maintenance)
sinfo -p emmalu -N -l: shows the state of each node in our partition
show node {node}, for example show node sh03-18n01: details about state of one node

Partitions

- emmalu: Lundberg lab partition, where you will probably run most of your jobs; Lundberg lab members have exclusive/non-preemptible access to the resources in this partition (so queue times are expected to be lower) - Shared partitions: you can use these resources but queue times are likely to be longer - normal: general purpose jobs (max runtime 2 days) - bigmem: for jobs requiring >256GB memory (max runtime 1 day) - gpu: for jobs requiring a gpu (max runtime 16 GPUs/user) - dev: development and testing jobs (max runtime 2 hours, max 4 cores + 2 GPUs/user) - service: lightweight, recurring administrative tasks - owners: jobs submitted here use available resources from other partitions, but your jobs are preemptible (can be canceled if someone with priority for these resources submits a job) - See more details: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/#available-resources

Resources available in `emmalu`

CPUs: 192
GPUs: 16
8x A100s with 80 GB GPU memory; node sh03-18n01
8x H100s with 80 GB GPU memory; node sh04-18n01

Specifying a GPU type

Different partitions and nodes have different GPUs
node_feat: shows all computational resource types available to you
node_feat -p emmalu: shows all computational resource types in the emmalu partition; look for GPU models identified with GPU_SKU; change the partition name to check other partitions (eg gpu)
Specific GPU types can be requested using slurm with the -C (constraint) flag and GPU_SKU (GPU model)
In an sbatch script, include the following lines to request, for example, an H100_SXM5 GPU:
```
#SBATCH -G 1
#SBATCH -C GPU_SKU:H100_SXM5 
```
In an interactive session, include the following flags to request, for example, an H100_SXM5 GPU: salloc -p emmalu -G 1 -C GPU_SKU:H100_SXM5
See more details: https://www.sherlock.stanford.edu/docs/user-guide/gpu/#gpu-types and https://www.sherlock.stanford.edu/docs/advanced-topics/node-features/#listing-the-features-available-in-a-partition -nvidia-smi: shows GPU information and status

Common Slurm errors

Out-of-Memory Error: If your job crashes and ther .err file displays an error like the following it means your script ran out of memory. To fix this increase the amount of memory allocated by your job

slurmstepd: error: *** JOB 8197553 ON sh03-18n01 CANCELLED AT 2025-10-23T23:52:24 ***
slurmstepd: error: Detected 2 oom_kill events in StepId=8197553.batch. Some of the step tasks have been OOM Killed.

Environments

It is recommended to install Python dependencies in a #virtual environment to manage per-project python dependencies.

System dependencies on Sherlock are typically very old. For example, many packages on PyPI and other package repositories often lack pre-built binaries for Sherlock's system configuration. This can result in some packages needing to be built from source, which can take a long time, or even not being supported at all. This can be circumvented by loading more recent environment modules (see #Environment modules) or using a conda environment (see #Conda). Another option is to use a containerized development environment using #Apptainer (formerly Singularity) containers.

Python virtual environments

Create: python -m venv .venv
Activate: source .venv/bin/activate (or .venv/bin/activate on Windows)
Deactivate: deactivate
Once activated, python, pip, etc. use the virtual environment
Specific Python version: Use the desired Python binary: /path/to/python3.X -m venv .venv

uv

Modern dependency management tool that creates and manages #virtual environments with additional features:
- Lockfile support: uv.lock pins exact versions for reproducibility
- Python version management: Automatically downloads and uses specified Python versions
- Integration: Works with pyproject.toml and requirements.txt
- Speed: Much faster than pip
Common commands:
- Create environment: uv venv (or uv venv -p 3.12 for specific version)
- Activate: source .venv/bin/activate (same as regular venv)
- Run without activating: uv run <command>
- Add project dependency: uv add <package-name> (updates pyproject.toml + uv.lock)
- Add dev-only dependency: uv pip install <package-name> (or use uv add --dev)
- Install from lockfile: uv sync
- Reset to lockfile state: uv sync (removes any manual changes)

Environment modules

Check available modules: module avail
Optionally with substring search module avail <keyword>
Load module: module load <module-name>
Unload module: module unload <module-name>
Unload all modules: module purge

Environment modules

Check available modules: module avail
- Optionally with substring search: module avail <keyword>
Load module: module load <module-name>
Unload module: module unload <module-name>
Unload all modules: module purge
List currently loaded modules: module list
Show module details: module show <module-name>

Conda

- Installation - Download Miniconda installer for Linux from https://www.anaconda.com/download/success - Copy installer to sherlock - Recommended to put in your folder in $GROUP_HOME - you will not have enough space for all your environments in your $HOME folder and $SCRATCH auto-deletes every 90 days - Install by running bash {installer_file.sh} eg bash Miniconda3-latest-Linux-x86_64.sh - Helpful conda commands - List all environments: conda env list - Create a new environment: conda create -n myenv; to specify the python version: conda create -n myenv python=3.11 - Activate an environment: conda activate myenv - Deactivate current environment: conda deactivate - Delete an environment: conda env remove -n myenv

Mamba

Mamba is an alternative conda distribution that has been optimized for faster dependency resolution
Most conda commands have been ported to mamba and are identically specified (also supporting the mamba command/entrypoint instead of conda)

Apptainer container

Build container: srun --cpus-per-task=16 --mem=32G --time=1:00:00 apptainer build container.sif container.def
Launch container:
- Interactive shell: apptainer shell container.sif
- Run command: apptainer exec container.sif <command>
- Note: Use slurm to allocate resources before launching the container (e.g., srun --pty apptainer shell container.sif), as slurm commands are not available inside the container
- By default, your home directory and current working directory are mounted and accessible inside the container
Any of the above environment managers (conda, uv, etc.) can be used inside the container for project-specific configurations

Example definition file (container.def):

Bootstrap: docker
From: ubuntu:25.10

%post
    # Install unminimize package for man pages etc
    apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y unminimize
    yes | unminimize

    # Install system packages of the container - these can be customized according to need
    apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
        python3.13 \
        python3.13-venv \
        python3.13-dev \
        zsh \
        git \
        curl \
        wget \
        vim \
        tmux \
        htop \
        build-essential \
        locales \
        man-db \
        manpages \
        manpages-dev \
        ca-certificates \
        gnupg \
        xsel \
        xclip \
        file

    # Generate and set locale
    locale-gen en_US.UTF-8
    update-locale LANG=en_US.UTF-8

    # Set python3.13 as the default python3
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.13 1
    update-alternatives --install /usr/bin/python python /usr/bin/python3.13 1

    # Install pip for Python 3.13
    curl -sS https://bootstrap.pypa.io/get-pip.py | python3.13 - --break-system-packages

    # Install uv
    curl -LsSf https://astral.sh/uv/install.sh | sh
    mv /root/.local/bin/uv /usr/local/bin/uv
    mv /root/.local/bin/uvx /usr/local/bin/uvx

%environment
    export LANG=en_US.UTF-8
    export LC_ALL=en_US.UTF-8

%runscript
    # Default to bash if no command specified
    if [ $# -eq 0 ]; then
        exec /bin/bash
    else
        exec "$@"
    fi

Data storage locations (HOME/SCRATCH)

$HOME, $GROUP_HOME are designed for small, important files (source code, executable files, configuration files...), the files there will not be deleted.
$HOME has 15 GB total and is for your own personal use.
$GROUP_HOME -Has 1 TB total. This directory is shared with the lab, so to use it add a subdirectory with your name.
$SCRATCH, $GROUP_SCRATCH are designed large, temporary files (checkpoints, raw application output...), the files will be automatically deleted 90 days after their last content edit. (ref: https://www.sherlock.stanford.edu/docs/storage/?h=purge#features-and-purpose) -$SCRATCH has 100TB as is for your own personal use -$GROUP_SCRATCH has 100TB as if shared with the group. If you have datasets that will be used with other people, you can put them here.
90 days deletion + work-arounds
Files are automatically purged from $SCRATCH after an inactivity period: files that are not modified after 90 days are automatically deleted, contents need to change for a file to be considered modified. The touch command does not modify file contents and thus does not extend a file's lifetime on the filesystem. $SCRATCH is not meant to store permanent data, and should only be used for data associated with currently running jobs. It's not a target for backups, archived data, etc.
Backing up to ell vault with script
The best practice of using Sherlock storage is regularly backing thins up, and you may want to allow enough time for data transferring before they are deleted.
One way of data transferring is using webdav protocal, please find ($SOME_PATH) for an example script.
Another way is using rclone protocal. See
Transferring data (dtn nodes)
For smaller size data transfer, just do any SSH-based protocols, e.g. $ scp foo <sunetid>@login.sherlock.stanford.edu:SOME_PATH or SFTP (Secure File Transfer Protocol).
Sherlock has a pool of dedicated Data Transfer Nodes is available on Sherlock, to provide exclusive resources for large-scale data transfers (ref: https://www.sherlock.stanford.edu/docs/storage/data-transfer/#cli-and-api).
For example: $ scp foo <sunetid>@dtn.sherlock.stanford.edu:~/foo
Check the space available with df -h
Filter to only see a folder eg emmalu with $GROUP_HOME with df -h | grep emmalu
Check how much space you are using with 'du -sh'
Filter to only check a directory with 'du -sh /path/to/folder/'

Visual studio code

- On Demand codeserver connection - This is an online version of VS Code so you can edit files and run them in Sherlock. - To access it, go here and log in: https://ondemand.sherlock.stanford.edu/pun/sys/dashboard. - Then, go to the top menu -> Interactive Apps -> Code Server - Here, you can specify what configurations you want for a dev node. Once done, hit "Launch". - This will open up something that looks a lot like VS Code, but online. You can edit code, run jobs, etc. - The only downside to this is that (to our knowledge) there is no way to use code complete or any AI tools like Copilot. If you want to be able to do that, look no further than the next section. - VS Code Connection - You can't ssh normally into Sherlock, but you can use SSH-FS (might have to add this extension as a first step) to edit files and access a remote terminal locally. This is especially nice if you want to use something like Cursor or Copilot which you can't use in the OnDemand server. The steps to set this up are: - Open VS Code - On the vertical menu bar on the side, there should be an icon depicting a folder with a terminal symbol inside of it. Click this. - The third icon by "Configurations" in the newly-opened menu should be two horizontal bars that look like sliders. Click this. - Hit the "Add" button. - Name it what you want and you can use default settings for location. Click "Save" and another configuration menu should show up. - You can leave Label, Group, Merge, PuTTY blank. Set host to login.sherlock.stanford.edu. Port = 22. Root should be the directory you want to open to (for example "/scratch/groups/emmalu/[your_user]"). Agent can be left blank. Username should be your Sherlock username. The rest of the fields can be unchanged/left blank. Click save. - The steps to connect to Sherlock once this is set up are: - Click on the icon depicting a folder with a terminal symbol inside of it on the vertical menu bar. Hover over your configuration name and click the icon that looks like a folder with a plus sign in it. - At the top center of the screen, it will prompt you to type in your password and do Duo 2FA. Once done, your Sherlock file system should open in the left-hand menu. - To access the sherlock terminal, right click on any folder in the file system and click "Open Remote SSH Terminal". This will open you on a login node and acts just like any other remote connection into Sherlock. - You can enter a dev node as you would usually (something like sdev -p emmalu -t 2:00:00 -m 16GB). - If you want to keep this node open even if you close VS code, you can use tmux to run an infinite command (can just be "python"). - Steps to connect to a jupyter notebook when you are connected to Sherlock in VS Code. - To use a jupyter notebook, you can open a tmux session (tmux new -s jupyter_session) and run the command "jupyter notebook" after activating the conda enviroinment your want your kernel to run on. Two links will be generated, with the first containing the numbers 8888. Copy this link. Once copied, you can navigate back to your non-tmux terminal by typing Control+b+d. - Then, in a local terminal on your computer, port forward using this command: ssh -t -L 7888:localhost:9696 [USERNAME]@login.sherlock.stanford.edu ssh -L 9696:localhost:8888 [NODE]. Note that the first port is 7888 instead of 8888 to make it easy to differentiate different ports. The NODE parameter will be something like sh03-17n11, or whichever dev node you are working on. - Once these steps are done, open in VS Code the jupyter notebook that you want to work with. Click "Kernel" at the top-> Select another kernel -> Exisiting Jupyter server -> and paste in the link you copied from above. ** Be sure to change the 8888 in the link to 7888 or this will not work**. Then, when selecting a kernel, pick the default Python3 Kernel (since this was created in the context of the conda environment you activated in tmux. - Annoying things to take note of: - If you close your laptop, your connection will be lost and you have to connect again. - If you are using code that is only compatible on Linux (like wrapped C code that isn't compatible with your local system), you cannot run it in a jupyter notebook because it is port forwarding to your local machine. - If you use agentic features, keep a close eye that it is not editing/deleting things accidentally within the sherlock cluster since this is all shared.

Alternative compute options

-HAI Cluster -TODO: Add details - Marlowe: currently not used by Lundberg lab, you can apply for usage and you get some free credits to start

Sherlock

Setting up account

Jobs

Running Basic Jobs

Running Multiple Jobs

Job Arrays

N-tasks

Multithreaded python script

Job Dependecies

Compute resources

How to see what resources are available and running jobs

Partitions

Resources available in emmalu

Specifying a GPU type

Common Slurm errors

Environments

Python virtual environments

uv

Environment modules

Environment modules

Conda

Mamba

Mamba is an alternative conda distribution that has been optimized for faster dependency resolution

Apptainer container

Data storage locations (HOME/SCRATCH)

Visual studio code

Alternative compute options

Resources available in `emmalu`