Sherlock
Setting up account
- Email srcc-support <srcc-support@stanford.edu> to request a Sherlock account. Include your SUNet ID.
- Email srcc-support <srcc-support@stanford.edu> to request to be added to the emmalu group. CC Emma, as she will need to approve your addition before it is processed.
Jobs
Running Basic Jobs
sh_dev: Starts an interactive shell as a job with the resources you request
-Usage: sh_dev -c 4 -g 1 -m 16 -p emmalu -t 2:00:00
Usage: sh_dev [OPTIONS]
Optional arguments:
-c number of CPU cores to request (OpenMP/pthreads, default: 1)
-g number of GPUs to request (default: none)
-n number of tasks to request (MPI ranks, default: 1)
-N number of nodes to request (default: 1)
-m memory amount to request (default: 4GB)
-p partition to run the job in (default: dev)
-t time limit (default: 01:00:00)
-r allocate resources from the named reservation (default: none)
-J job name (default: sh_dev)
-q quality of service to request for the job (default: normal)
sbatch: Request resources and launches an asynchronous job
-Usage: sbatch ./test.sh where test.sh looks like the following...
#!/usr/bin/bash
#SBATCH --job-name=test
#SBATCH --output=test_job.%j.out
#SBATCH --error=test_job.%j.err
#SBATCH --time=10:00:00
#SBATCH -p emmalu
#SBATCH -c 16
#SBATCH --mem=32GB
module load python/3.6.1 #load any necessary modules
conda activate my_env #activate your environment
python3 mycode.py #run your script
wrap: You can also launch a job in one line using wrap. This lets you avoid having a separate bash file for requesting resources
-ex: sbatch --mem=5G -c 10 --wrap="python my_script.py"
-scancel: Cancel a job that is running or in the queue
-Usage:
- scancel <job_id> to cancel a single job
- scancel -u <sunetid> to cancel all of your jobs
-Note: Time limits are formated as <day>-<hours>:<minute>:<second>
ex: 1-4:00:00 is one day and four hours
-Note: The time limit for jobs in the emmalu partition is 7-days
Running Multiple Jobs
Job Arrays
-Job arrays let you launch one job that spawns subjobs.
-Note: Claude code and githubcopilot are very good at helping create bash files for launching slurm job arrays.
-Ex 1: Maintain a separate file containing all of the files you would like to run a script on
- With --array, you can specify a range of lines within the given file (this does not always need to start from 1). Additionally, if you would like to be kind to your coworkers, you can limit the number of jobs run at a single time using %. For example, --array=1-10%2 will only run 2 jobs at once – when jobs 1-2 are complete, jobs 3-4 will start, and so forth.
- The job file in this format
…
#SBATCH --ntasks=1 # 1 task (default) - entire job (do not need task per array job)
#SBATCH --cpus=1 # number of cores per array job (default)
#SBATCH --array=1-10 # number of jobs in array
…
job_file=`sed "${SLURM_ARRAY_TASK_ID}q;d" /path/to/list_of_job_files.txt` #reads file line-by-line
python ${job_file}
#SBATCH --ntasks=1 # 1 task (default)
#SBATCH --cpus=1 # number of cores per task (default)
#SBATCH --array=0-3
CONFIGS=(
../../configs/config_esm2.yaml
../../configs/config_esm3.yaml
../../configs/config_prott5.yaml
../../configs/config_protbert.yaml)
# Select config file for this task
CONFIG_FILE=${CONFIGS[$SLURM_ARRAY_TASK_ID]}
srun python ../../main.py --sweep_config $CONFIG_FILE
N-tasks
- Multiple tasks: allows you to run multiple commands simultaneously, split across CPU/GPU resources, from a single sbatch script
- To use, set the –ntasks parameter to the number of tasks you would like to run and list the tasks at the bottom of the file. It is important to separate each task with the “&” character (allowing tasks to run in parallel) and to add wait to the last line, to ensure that the sbatch script waits until all tasks are completed before terminating.
- The tasks will show up in the Slurm queue as a single job using ntasks * cpu/gpu-per-task resources (in the example shown below, 4 tasks * 1 cpu-per-task, ie 4 CPUs).
…
#SBATCH --ntasks=4 # 4 tasks
#SBATCH --cpus-per-task=1 # number of cores per task
…
python task1.py &
python task2.py &
python task3.py &
python task4.py &
wait
Multithreaded python script
TODO:
Job Dependecies
-You can launch jobs that are "dependent" on other jobs. They will run one their dependencies have started/terminated/crashed
-Usage: sbatch --dependency=<type>:<job_id> <job_file>
-
Compute resources
How to see what resources are available and running jobs
squeue -u {SUNet}: check what jobs you have queued and running (replace {SUNet} with your ID)sh_part: shows the resources (GPUs, CPUs, memory) in the partitions that are available to you and whether these are currently in use or availablesh_part -p emmalu: shows the resources specific to the emmalu partitionsqueue -p emmalu -o "%.18i %.9P %.30j %.8u %.2t %.10M %.6D %.4C %.6b %.10m": shows all jobs running/queued on our partitions and how many CPUs, GPUs and memory they are usingecho "=== emmalu Partition CPU Summary ===" && sinfo -p emmalu -h -o "%C" | awk -F/ '{print "Total CPUs: " $4 "\nIdle CPUs: " $2 "\nAllocated CPUs: " $1 "\nOffline/Other CPUs: " $3}' && echo "---" && squeue -p emmalu -t R -h -o "%C" | awk '{sum+=$1} END {print "CPUs used by RUNNING jobs: " sum}' && squeue -p emmalu -t PD -h -o "%C" | awk '{sum+=$1} END {print "CPUs requested by PENDING jobs: " sum}': shows the number of CPUs being used, how many are idle, and how many are unavailable (for reasons like node drained or under maintenance)sinfo -p emmalu -N -l: shows the state of each node in our partitionshow node {node}, for exampleshow node sh03-18n01: details about state of one node
Partitions
- emmalu: Lundberg lab partition, where you will probably run most of your jobs; Lundberg lab members have exclusive/non-preemptible access to the resources in this partition (so queue times are expected to be lower)
- Shared partitions: you can use these resources but queue times are likely to be longer
- normal: general purpose jobs (max runtime 2 days)
- bigmem: for jobs requiring >256GB memory (max runtime 1 day)
- gpu: for jobs requiring a gpu (max runtime 16 GPUs/user)
- dev: development and testing jobs (max runtime 2 hours, max 4 cores + 2 GPUs/user)
- service: lightweight, recurring administrative tasks
- owners: jobs submitted here use available resources from other partitions, but your jobs are preemptible (can be canceled if someone with priority for these resources submits a job)
- See more details: https://www.sherlock.stanford.edu/docs/user-guide/running-jobs/#available-resources
Resources available in emmalu
- CPUs: 192
- GPUs: 16
- 8x A100s with 80 GB GPU memory; node sh03-18n01
- 8x H100s with 80 GB GPU memory; node sh04-18n01
Specifying a GPU type
- Different partitions and nodes have different GPUs
node_feat: shows all computational resource types available to younode_feat -p emmalu: shows all computational resource types in theemmalupartition; look for GPU models identified withGPU_SKU; change the partition name to check other partitions (eggpu)- Specific GPU types can be requested using slurm with the
-C(constraint) flag andGPU_SKU(GPU model) - In an sbatch script, include the following lines to request, for example, an
H100_SXM5GPU:#SBATCH -G 1 #SBATCH -C GPU_SKU:H100_SXM5 - In an interactive session, include the following flags to request, for example, an
H100_SXM5GPU:salloc -p emmalu -G 1 -C GPU_SKU:H100_SXM5 - See more details: https://www.sherlock.stanford.edu/docs/user-guide/gpu/#gpu-types and https://www.sherlock.stanford.edu/docs/advanced-topics/node-features/#listing-the-features-available-in-a-partition
-
nvidia-smi: shows GPU information and status
Common Slurm errors
Out-of-Memory Error: If your job crashes and ther .err file displays an error like the following it means your script ran out of memory. To fix this increase the amount of memory allocated by your job
slurmstepd: error: *** JOB 8197553 ON sh03-18n01 CANCELLED AT 2025-10-23T23:52:24 ***
slurmstepd: error: Detected 2 oom_kill events in StepId=8197553.batch. Some of the step tasks have been OOM Killed.
Environments
It is recommended to install Python dependencies in a #virtual environment to manage per-project python dependencies.
System dependencies on Sherlock are typically very old. For example, many packages on PyPI and other package repositories often lack pre-built binaries for Sherlock's system configuration. This can result in some packages needing to be built from source, which can take a long time, or even not being supported at all. This can be circumvented by loading more recent environment modules (see #Environment modules) or using a conda environment (see #Conda). Another option is to use a containerized development environment using #Apptainer (formerly Singularity) containers.
Python virtual environments
- Create:
python -m venv .venv - Activate:
source .venv/bin/activate(or.venv/bin/activateon Windows) - Deactivate:
deactivate - Once activated,
python,pip, etc. use the virtual environment - Specific Python version: Use the desired Python binary:
/path/to/python3.X -m venv .venv
uv
- Modern dependency management tool that creates and manages #virtual environments with
additional features:
- Lockfile support:
uv.lockpins exact versions for reproducibility - Python version management: Automatically downloads and uses specified Python versions
- Integration: Works with
pyproject.tomlandrequirements.txt - Speed: Much faster than pip
- Lockfile support:
- Common commands:
- Create environment:
uv venv(oruv venv -p 3.12for specific version) - Activate:
source .venv/bin/activate(same as regular venv) - Run without activating:
uv run <command> - Add project dependency:
uv add <package-name>(updates pyproject.toml + uv.lock) - Add dev-only dependency:
uv pip install <package-name>(or useuv add --dev) - Install from lockfile:
uv sync - Reset to lockfile state:
uv sync(removes any manual changes)
- Create environment:
Environment modules
- Check available modules:
module avail - Optionally with substring search
module avail <keyword> - Load module:
module load <module-name> - Unload module:
module unload <module-name> - Unload all modules:
module purge
Environment modules
- Check available modules:
module avail- Optionally with substring search:
module avail <keyword>
- Optionally with substring search:
- Load module:
module load <module-name> - Unload module:
module unload <module-name> - Unload all modules:
module purge - List currently loaded modules:
module list - Show module details:
module show <module-name>
Conda
- Installation
- Download Miniconda installer for Linux from https://www.anaconda.com/download/success
- Copy installer to sherlock
- Recommended to put in your folder in $GROUP_HOME - you will not have enough space for all your environments in your $HOME folder and $SCRATCH auto-deletes every 90 days
- Install by running bash {installer_file.sh} eg bash Miniconda3-latest-Linux-x86_64.sh
- Helpful conda commands
- List all environments: conda env list
- Create a new environment: conda create -n myenv; to specify the python version: conda create -n myenv python=3.11
- Activate an environment: conda activate myenv
- Deactivate current environment: conda deactivate
- Delete an environment: conda env remove -n myenv
Mamba
-
Mamba is an alternative conda distribution that has been optimized for faster dependency resolution
- Most conda commands have been ported to mamba and are identically specified (also supporting the
mambacommand/entrypoint instead ofconda)
Apptainer container
- Build container:
srun --cpus-per-task=16 --mem=32G --time=1:00:00 apptainer build container.sif container.def - Launch container:
- Interactive shell:
apptainer shell container.sif - Run command:
apptainer exec container.sif <command> - Note: Use slurm to allocate resources before launching the container (e.g.,
srun --pty apptainer shell container.sif), as slurm commands are not available inside the container - By default, your home directory and current working directory are mounted and accessible inside the container
- Interactive shell:
- Any of the above environment managers (conda, uv, etc.) can be used inside the container for project-specific configurations
Example definition file (container.def):
Bootstrap: docker
From: ubuntu:25.10
%post
# Install unminimize package for man pages etc
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y unminimize
yes | unminimize
# Install system packages of the container - these can be customized according to need
apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y \
python3.13 \
python3.13-venv \
python3.13-dev \
zsh \
git \
curl \
wget \
vim \
tmux \
htop \
build-essential \
locales \
man-db \
manpages \
manpages-dev \
ca-certificates \
gnupg \
xsel \
xclip \
file
# Generate and set locale
locale-gen en_US.UTF-8
update-locale LANG=en_US.UTF-8
# Set python3.13 as the default python3
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.13 1
update-alternatives --install /usr/bin/python python /usr/bin/python3.13 1
# Install pip for Python 3.13
curl -sS https://bootstrap.pypa.io/get-pip.py | python3.13 - --break-system-packages
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
mv /root/.local/bin/uv /usr/local/bin/uv
mv /root/.local/bin/uvx /usr/local/bin/uvx
%environment
export LANG=en_US.UTF-8
export LC_ALL=en_US.UTF-8
%runscript
# Default to bash if no command specified
if [ $# -eq 0 ]; then
exec /bin/bash
else
exec "$@"
fi
Data storage locations (HOME/SCRATCH)
$HOME,$GROUP_HOMEare designed for small, important files (source code, executable files, configuration files...), the files there will not be deleted.$HOMEhas 15 GB total and is for your own personal use.$GROUP_HOME-Has 1 TB total. This directory is shared with the lab, so to use it add a subdirectory with your name.$SCRATCH,$GROUP_SCRATCHare designed large, temporary files (checkpoints, raw application output...), the files will be automatically deleted 90 days after their last content edit. (ref: https://www.sherlock.stanford.edu/docs/storage/?h=purge#features-and-purpose) -$SCRATCHhas 100TB as is for your own personal use -$GROUP_SCRATCHhas 100TB as if shared with the group. If you have datasets that will be used with other people, you can put them here.- 90 days deletion + work-arounds
- Files are automatically purged from
$SCRATCHafter an inactivity period: files that are not modified after 90 days are automatically deleted, contents need to change for a file to be considered modified. The touch command does not modify file contents and thus does not extend a file's lifetime on the filesystem.$SCRATCHis not meant to store permanent data, and should only be used for data associated with currently running jobs. It's not a target for backups, archived data, etc. - Backing up to
ell vaultwith script - The best practice of using Sherlock storage is regularly backing thins up, and you may want to allow enough time for data transferring before they are deleted.
- One way of data transferring is using webdav protocal, please find (
$SOME_PATH) for an example script. - Another way is using rclone protocal. See
- Transferring data (dtn nodes)
-
For smaller size data transfer, just do any SSH-based protocols, e.g.
$ scp foo <sunetid>@login.sherlock.stanford.edu:SOME_PATHor SFTP (Secure File Transfer Protocol). -
Sherlock has a pool of dedicated Data Transfer Nodes is available on Sherlock, to provide exclusive resources for large-scale data transfers (ref: https://www.sherlock.stanford.edu/docs/storage/data-transfer/#cli-and-api).
-
For example:
$ scp foo <sunetid>@dtn.sherlock.stanford.edu:~/foo -
Check the space available with
df -h - Filter to only see a folder eg
emmaluwith$GROUP_HOMEwithdf -h | grep emmalu - Check how much space you are using with 'du -sh'
- Filter to only check a directory with 'du -sh /path/to/folder/'
Visual studio code
- On Demand codeserver connection - This is an online version of VS Code so you can edit files and run them in Sherlock.
- To access it, go here and log in: https://ondemand.sherlock.stanford.edu/pun/sys/dashboard.
- Then, go to the top menu -> Interactive Apps -> Code Server
- Here, you can specify what configurations you want for a dev node. Once done, hit "Launch".
- This will open up something that looks a lot like VS Code, but online. You can edit code, run jobs, etc.
- The only downside to this is that (to our knowledge) there is no way to use code complete or any AI tools like Copilot. If you want to be able to do that, look no further than the next section.
- VS Code Connection - You can't ssh normally into Sherlock, but you can use SSH-FS (might have to add this extension as a first step) to edit files and access a remote terminal locally. This is especially nice if you want to use something like Cursor or Copilot which you can't use in the OnDemand server. The steps to set this up are:
- Open VS Code
- On the vertical menu bar on the side, there should be an icon depicting a folder with a terminal symbol inside of it. Click this.
- The third icon by "Configurations" in the newly-opened menu should be two horizontal bars that look like sliders. Click this.
- Hit the "Add" button.
- Name it what you want and you can use default settings for location. Click "Save" and another configuration menu should show up.
- You can leave Label, Group, Merge, PuTTY blank. Set host to login.sherlock.stanford.edu. Port = 22. Root should be the directory you want to open to (for example "/scratch/groups/emmalu/[your_user]"). Agent can be left blank. Username should be your Sherlock username. The rest of the fields can be unchanged/left blank. Click save.
- The steps to connect to Sherlock once this is set up are:
- Click on the icon depicting a folder with a terminal symbol inside of it on the vertical menu bar. Hover over your configuration name and click the icon that looks like a folder with a plus sign in it.
- At the top center of the screen, it will prompt you to type in your password and do Duo 2FA. Once done, your Sherlock file system should open in the left-hand menu.
- To access the sherlock terminal, right click on any folder in the file system and click "Open Remote SSH Terminal". This will open you on a login node and acts just like any other remote connection into Sherlock.
- You can enter a dev node as you would usually (something like sdev -p emmalu -t 2:00:00 -m 16GB).
- If you want to keep this node open even if you close VS code, you can use tmux to run an infinite command (can just be "python").
- Steps to connect to a jupyter notebook when you are connected to Sherlock in VS Code.
- To use a jupyter notebook, you can open a tmux session (tmux new -s jupyter_session) and run the command "jupyter notebook" after activating the conda enviroinment your want your kernel to run on. Two links will be generated, with the first containing the numbers 8888. Copy this link. Once copied, you can navigate back to your non-tmux terminal by typing Control+b+d.
- Then, in a local terminal on your computer, port forward using this command: ssh -t -L 7888:localhost:9696 [USERNAME]@login.sherlock.stanford.edu ssh -L 9696:localhost:8888 [NODE]. Note that the first port is 7888 instead of 8888 to make it easy to differentiate different ports. The NODE parameter will be something like sh03-17n11, or whichever dev node you are working on.
- Once these steps are done, open in VS Code the jupyter notebook that you want to work with. Click "Kernel" at the top-> Select another kernel -> Exisiting Jupyter server -> and paste in the link you copied from above. ** Be sure to change the 8888 in the link to 7888 or this will not work**. Then, when selecting a kernel, pick the default Python3 Kernel (since this was created in the context of the conda environment you activated in tmux.
- Annoying things to take note of:
- If you close your laptop, your connection will be lost and you have to connect again.
- If you are using code that is only compatible on Linux (like wrapped C code that isn't compatible with your local system), you cannot run it in a jupyter notebook because it is port forwarding to your local machine.
- If you use agentic features, keep a close eye that it is not editing/deleting things accidentally within the sherlock cluster since this is all shared.
Alternative compute options
-HAI Cluster -TODO: Add details - Marlowe: currently not used by Lundberg lab, you can apply for usage and you get some free credits to start