Skip to content

ML Experiment Tracking: WandB, Neptune, and TensorBoard

Overview: Why Experiment Tracking Matters

Machine learning experiment tracking is essential for maintaining reproducibility, comparing model performance, and collaborating effectively on ML projects. Without proper tracking, it becomes nearly impossible to:

  • Reproduce successful experiments - Remember which hyperparameters, data versions, and code snapshots produced your best results
  • Compare models systematically - Track metrics across dozens or hundreds of training runs to identify patterns
  • Debug training issues - Visualize learning curves, gradients, and system metrics to diagnose problems
  • Collaborate with team members - Share experiment results and insights with colleagues working on the same project
  • Document your research - Create a historical record of what you've tried and why certain approaches worked or failed

When to Use Experiment Tracking

You should implement experiment tracking when:

  • Running multiple training experiments with different hyperparameters
  • Training models that take more than a few minutes to complete
  • Working on a project where you'll need to reproduce results weeks or months later
  • Collaborating with team members who need visibility into your experiments
  • Tuning models where subtle changes in metrics matter
  • Tracking system resources (GPU utilization, memory usage) during training

Team Cost Considerations

Important: Due to team license costs on platforms like WandB and Neptune, we recommend that each team member set up their own individual free account rather than using shared team accounts. Both platforms offer generous free tiers that are sufficient for most research and development work:

  • WandB Free: Unlimited runs, 100GB storage, up to 2 team members
  • Neptune Free: 200 hours of tracking per month, unlimited projects

This approach: - Keeps costs manageable for the team - Gives each member full control over their experiment organization - Allows for personal workspaces that can be shared selectively with team members - Avoids hitting team quotas during heavy experimentation periods

For critical shared projects requiring team-wide visibility, we can evaluate paid team accounts on a case-by-case basis.

Platform Comparison

Here's a detailed comparison of the three main experiment tracking platforms:

Feature WandB Neptune TensorBoard
Hosting Cloud-based (can self-host) Cloud-based (can self-host) Local/HPC (requires setup for remote access)
Ease of Setup Very easy (pip install + API key) Very easy (pip install + API token) Easy locally, complex for secure remote access
Real-time Tracking Excellent (live updating) Excellent (live updating) Good (requires refresh)
Collaboration Built-in sharing & teams Built-in sharing & teams Manual (share files or set up server)
Cost Free tier, then $50+/user/month Free tier, then $59+/user/month Free (open-source)
Integration Ecosystem Extensive (PyTorch, TensorFlow, Keras, HuggingFace, etc.) Extensive (similar to WandB) Good (primarily PyTorch & TensorFlow)
Visualization Quality Excellent interactive plots Excellent interactive plots Good but more basic
Model Registry Yes Yes No (requires separate tools)
Hyperparameter Sweeps Built-in sweep functionality Built-in optimization Manual setup required
Data Versioning Artifacts + dataset tracking Dataset versioning built-in Not supported
Security on HPC API key authentication API token authentication Requires special setup (see TensorBoard section)
Mobile App Yes Yes No
API & Custom Dashboards Comprehensive API Comprehensive API Limited API
Storage Limits (Free) 100GB Based on tracking hours Unlimited (local storage)
Learning Curve Low Low Low for basics, steeper for advanced

Which Platform to Choose?

Choose WandB if: - You want the most popular platform with the largest community - You need extensive integrations with modern ML frameworks (especially HuggingFace) - You value real-time collaboration features and team dashboards - You want powerful hyperparameter sweep functionality out of the box - You're working on research that might benefit from public project sharing

Choose Neptune if: - You prefer a cleaner, more structured approach to metadata logging - You need robust dataset versioning capabilities - You want more flexible organization with workspaces and projects - You prefer Neptune's UI/UX (subjective, but some find it more intuitive) - You need longer data retention on the free tier

Choose TensorBoard if: - You're working on an HPC cluster like Sherlock where data privacy is critical - You want to avoid external dependencies and cloud services - Your experiments are already generating TensorBoard logs - You don't need real-time collaboration features - You're comfortable with more technical setup for remote access - Budget is a hard constraint (completely free)

Best Practice: Many teams use TensorBoard during active development on HPC clusters (for immediate feedback), then log final results to WandB or Neptune for long-term tracking and team visibility.


WandB (Weights & Biases)

Overview

Weights & Biases is the most widely adopted ML experiment tracking platform, known for its ease of use, real-time collaboration features, and extensive integration ecosystem. It's particularly popular in the deep learning community and among researchers.

Getting Started

1. Installation

pip install wandb

2. Account Setup

  1. Create a free account at https://wandb.ai/signup
  2. Get your API key from https://wandb.ai/authorize
  3. Login from your terminal:
wandb login

Paste your API key when prompted. This stores your credentials in ~/.netrc for future use.

3. Basic Integration

Here's a minimal example with PyTorch:

import wandb
import torch
import torch.nn as nn

# Initialize a new run
wandb.init(
    project="my-project",
    name="experiment-1",
    config={
        "learning_rate": 0.001,
        "epochs": 10,
        "batch_size": 32,
        "architecture": "ResNet-50",
        "dataset": "CIFAR-10"
    }
)

# Access config
config = wandb.config

# Training loop
model = create_model(config)
for epoch in range(config.epochs):
    train_loss = train_epoch(model)
    val_loss, val_acc = validate(model)

    # Log metrics
    wandb.log({
        "epoch": epoch,
        "train_loss": train_loss,
        "val_loss": val_loss,
        "val_accuracy": val_acc
    })

# Finish the run
wandb.finish()

Advanced Features

Logging Different Data Types

import wandb
import matplotlib.pyplot as plt
import numpy as np

# Log images
images = [wandb.Image(img, caption=f"Sample {i}") for i, img in enumerate(batch)]
wandb.log({"examples": images})

# Log matplotlib figures
fig, ax = plt.subplots()
ax.plot(x, y)
wandb.log({"chart": wandb.Image(fig)})
plt.close()

# Log histograms
wandb.log({"gradients": wandb.Histogram(gradient_values)})

# Log tables
table = wandb.Table(
    columns=["id", "prediction", "truth"],
    data=[[1, 0.9, 1], [2, 0.1, 0]]
)
wandb.log({"predictions": table})

# Log videos
wandb.log({"video": wandb.Video(video_array, fps=4, format="mp4")})

# Log audio
wandb.log({"audio": wandb.Audio(audio_array, sample_rate=16000)})

# Log 3D point clouds
wandb.log({"point_cloud": wandb.Object3D(points)})

Model Checkpointing and Artifacts

WandB Artifacts allow you to version datasets, models, and other files:

import wandb

run = wandb.init(project="my-project")

# Save a model checkpoint as an artifact
artifact = wandb.Artifact(
    name="model-checkpoint",
    type="model",
    description="ResNet-50 trained on CIFAR-10"
)
artifact.add_file("model.pth")
artifact.add_file("config.yaml")

# Log the artifact
run.log_artifact(artifact)

# Later, load an artifact
run = wandb.init(project="my-project")
artifact = run.use_artifact("model-checkpoint:latest")
artifact_dir = artifact.download()

Dataset Versioning

import wandb

run = wandb.init(project="my-project")

# Create a dataset artifact
dataset_artifact = wandb.Artifact(
    name="cifar10-preprocessed",
    type="dataset",
    description="CIFAR-10 with augmentation pipeline v2"
)

# Add files or directories
dataset_artifact.add_dir("data/processed/")
dataset_artifact.add_file("data/metadata.json")

# Log the artifact with aliases
run.log_artifact(dataset_artifact, aliases=["latest", "v2.0"])

# Use the dataset in another run
run = wandb.init(project="my-project")
dataset = run.use_artifact("cifar10-preprocessed:latest")
data_dir = dataset.download()

Hyperparameter Sweeps

WandB sweeps automate hyperparameter tuning:

# sweep_config.yaml or defined in code
sweep_config = {
    "method": "bayes",  # or "grid", "random"
    "metric": {
        "name": "val_accuracy",
        "goal": "maximize"
    },
    "parameters": {
        "learning_rate": {
            "distribution": "log_uniform_values",
            "min": 1e-5,
            "max": 1e-1
        },
        "batch_size": {
            "values": [16, 32, 64, 128]
        },
        "dropout": {
            "distribution": "uniform",
            "min": 0.1,
            "max": 0.5
        },
        "optimizer": {
            "values": ["adam", "sgd", "rmsprop"]
        }
    }
}

# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")

# Define training function
def train():
    run = wandb.init()
    config = wandb.config

    # Train with these hyperparameters
    model = create_model(config)
    accuracy = train_model(model, config)

    wandb.log({"val_accuracy": accuracy})

# Run sweep agent
wandb.agent(sweep_id, function=train, count=50)  # Run 50 trials

To run multiple agents in parallel (e.g., on a cluster):

# On each compute node
wandb agent <sweep_id>

PyTorch Lightning:

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger

wandb_logger = WandbLogger(project="my-project", log_model=True)
trainer = Trainer(logger=wandb_logger)
trainer.fit(model, datamodule=dm)

HuggingFace Transformers:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    report_to="wandb",  # Automatically logs to WandB
    run_name="bert-finetuning"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset
)

trainer.train()

Keras:

from wandb.keras import WandbCallback

model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[WandbCallback(save_model=True)]
)

System Monitoring

WandB automatically tracks system metrics (GPU utilization, memory, CPU, disk I/O) but you can customize:

wandb.init(
    project="my-project",
    settings=wandb.Settings(
        # Log system metrics every 30 seconds
        _stats_sample_rate_seconds=30,
        # Log system metrics every 50 steps
        _stats_samples_to_average=50
    )
)

Alerts

Set up alerts for important events:

import wandb

run = wandb.init(project="my-project")

# Send alert if validation loss is too high
if val_loss > 1.0:
    wandb.alert(
        title="High validation loss",
        text=f"Validation loss is {val_loss:.4f}",
        level=wandb.AlertLevel.WARN
    )

Best Practices for WandB

  1. Use Meaningful Project and Run Names

    wandb.init(
        project="image-classification-cifar10",
        name=f"resnet50-lr{lr}-bs{batch_size}",
        tags=["baseline", "resnet", "augmentation-v2"]
    )
    

  2. Group Related Experiments

    wandb.init(
        project="my-project",
        group="experiment-1",  # Group related runs
        job_type="train"  # train, eval, preprocess, etc.
    )
    

  3. Log Configuration Comprehensively

    config = {
        # Model architecture
        "architecture": "ResNet-50",
        "num_layers": 50,
    
        # Training hyperparameters
        "learning_rate": 0.001,
        "batch_size": 32,
        "epochs": 100,
        "optimizer": "adam",
    
        # Data configuration
        "dataset": "CIFAR-10",
        "data_augmentation": True,
        "train_samples": 50000,
    
        # System info
        "gpu": torch.cuda.get_device_name(0),
        "random_seed": 42
    }
    wandb.init(project="my-project", config=config)
    

  4. Add Notes and Tags

    run = wandb.init(
        project="my-project",
        notes="Testing new data augmentation strategy",
        tags=["experimental", "data-aug", "high-priority"]
    )
    

  5. Resume Failed Runs

    # Resume a crashed run
    run = wandb.init(
        project="my-project",
        id="unique-run-id",
        resume="must"  # or "allow" for optional resume
    )
    

  6. Use Offline Mode for Debugging

    # Set environment variable
    export WANDB_MODE=offline
    

Or in code:

wandb.init(project="my-project", mode="offline")

Sync later:

wandb sync wandb/offline-run-xxx

Common Issues and Solutions

Issue: API key not found - Solution: Run wandb login or set WANDB_API_KEY environment variable

Issue: Runs are slow to sync - Solution: Reduce logging frequency with wandb.log(..., step=step, commit=False) and commit every N steps

Issue: Large model files filling storage - Solution: Be selective about what you save with artifacts, use .wandb_ignore file

Issue: Can't access runs from cluster - Solution: Make projects public or add team members to your workspace


Neptune

Overview

Neptune.ai is a comprehensive MLOps platform with a focus on metadata organization and experiment tracking. It offers a cleaner, more structured approach compared to WandB, with strong support for dataset versioning and flexible workspace organization.

Getting Started

1. Installation

pip install neptune

2. Account Setup

  1. Create a free account at https://neptune.ai/register
  2. Get your API token from your profile settings
  3. Set up authentication:
export NEPTUNE_API_TOKEN="your-api-token"

Or in Python:

import neptune

run = neptune.init_run(
    project="your-workspace/your-project",
    api_token="your-api-token"  # Better to use environment variable
)

3. Basic Integration

import neptune
import torch

# Initialize run
run = neptune.init_run(
    project="your-workspace/project-name",
    name="experiment-1",
    tags=["baseline", "resnet"]
)

# Log hyperparameters
run["parameters"] = {
    "learning_rate": 0.001,
    "batch_size": 32,
    "epochs": 10,
    "optimizer": "adam"
}

# Training loop
for epoch in range(epochs):
    train_loss = train_epoch()
    val_loss, val_acc = validate()

    # Log metrics
    run["train/loss"].append(train_loss)
    run["validation/loss"].append(val_loss)
    run["validation/accuracy"].append(val_acc)

# Stop tracking
run.stop()

Advanced Features

Hierarchical Metadata Structure

Neptune uses a namespace-based system for organizing metadata:

import neptune

run = neptune.init_run(project="workspace/project")

# Organize metrics in namespaces
run["train/loss"].append(0.5)
run["train/accuracy"].append(0.85)
run["validation/loss"].append(0.6)
run["validation/accuracy"].append(0.82)

# Model configuration
run["model/architecture"] = "ResNet-50"
run["model/parameters/conv1/filters"] = 64
run["model/parameters/fc/units"] = 1000

# Dataset information
run["dataset/name"] = "CIFAR-10"
run["dataset/train_size"] = 50000
run["dataset/preprocessing"] = "standardization + augmentation"

# System information
run["sys/gpu"] = torch.cuda.get_device_name(0)
run["sys/cuda_version"] = torch.version.cuda

Logging Different Data Types

import neptune
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

run = neptune.init_run(project="workspace/project")

# Log single values
run["metrics/final_accuracy"] = 0.95

# Log series (for metrics over time)
for i in range(100):
    run["metrics/loss"].append(loss_value)

# Log images
run["images/sample"].upload("path/to/image.png")
# Or from PIL
run["images/prediction"].upload(Image.fromarray(img_array))

# Log files
run["model/checkpoint"].upload("model.pth")
run["config"].upload("config.yaml")

# Log text
run["notes"] = "Increased learning rate, improved convergence"

# Log matplotlib figures
fig, ax = plt.subplots()
ax.plot(x, y)
run["plots/learning_curve"].upload(fig)

# Log HTML
run["reports/summary"].upload(neptune.types.File.as_html(html_content))

# Log pickled objects
run["data/predictions"].upload(neptune.types.File.as_pickle(predictions))

# Log dataframes
import pandas as pd
df = pd.DataFrame({"prediction": preds, "truth": labels})
run["tables/predictions"].upload(neptune.types.File.as_html(df))

Model Versioning

import neptune
from neptune.types import File

# Initialize model version
model_version = neptune.init_model_version(
    model="workspace/project-models",
    name="resnet50-v1"
)

# Log model metadata
model_version["architecture"] = "ResNet-50"
model_version["framework"] = "PyTorch"
model_version["signature"] = "Input: (B, 3, 224, 224) -> Output: (B, 1000)"

# Upload model files
model_version["model"].upload("model.pth")
model_version["preprocessing"].upload("preprocessing.py")

# Link to training run
model_version["training/run_id"] = run._sys_id

# Log validation metrics
model_version["validation/accuracy"] = 0.92
model_version["validation/f1_score"] = 0.89

# Change stage
model_version.change_stage("staging")  # or "production", "archived"

model_version.stop()

Dataset Versioning

import neptune

# Initialize dataset version
dataset_version = neptune.init_dataset_version(
    dataset="workspace/project-datasets",
    name="cifar10-v2"
)

# Log dataset metadata
dataset_version["size"] = 60000
dataset_version["preprocessing"] = "augmentation + normalization"
dataset_version["split/train"] = 50000
dataset_version["split/val"] = 10000

# Upload dataset files
dataset_version["data/train"].upload_files("data/train/")
dataset_version["data/val"].upload_files("data/val/")

# Log data statistics
dataset_version["statistics/mean"] = [0.485, 0.456, 0.406]
dataset_version["statistics/std"] = [0.229, 0.224, 0.225]

dataset_version.stop()

Integration with Frameworks

PyTorch Lightning:

from pytorch_lightning import Trainer
from neptune.pytorch_lightning import NeptuneLogger

neptune_logger = NeptuneLogger(
    project="workspace/project",
    api_token=os.getenv("NEPTUNE_API_TOKEN"),
    name="experiment-1",
    tags=["pytorch-lightning", "baseline"]
)

trainer = Trainer(logger=neptune_logger)
trainer.fit(model, datamodule)

HuggingFace Transformers:

from transformers import Trainer, TrainingArguments
from neptune.integrations.transformers import NeptuneCallback

run = neptune.init_run(project="workspace/project")

training_args = TrainingArguments(
    output_dir="./results",
    report_to="none"  # We'll use Neptune callback instead
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    callbacks=[NeptuneCallback(run=run)]
)

trainer.train()

Keras:

from neptune.integrations.tensorflow_keras import NeptuneCallback

run = neptune.init_run(project="workspace/project")

model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    callbacks=[NeptuneCallback(run=run, base_namespace="metrics")]
)

Scikit-learn:

from neptune.integrations.sklearn import NeptuneCallback, create_regressor_summary
import neptune

run = neptune.init_run(project="workspace/project")

# Log model parameters
run["model/parameters"] = model.get_params()

# Log model summary
run["model/summary"] = create_regressor_summary(
    model, X_train, X_test, y_train, y_test
)

# Log pickled model
run["model/pickled"].upload(neptune.types.File.as_pickle(model))

Querying Runs

Neptune provides a powerful API to fetch and analyze runs:

import neptune

# Fetch specific run
run = neptune.init_run(
    with_id="PROJ-123",  # Run ID
    project="workspace/project",
    mode="read-only"
)

# Access metadata
accuracy = run["validation/accuracy"].fetch()
params = run["parameters"].fetch()

# Fetch multiple runs
project = neptune.init_project(
    name="workspace/project",
    mode="read-only"
)

# Query runs with filters
runs_df = project.fetch_runs_table(
    columns=["sys/id", "sys/name", "parameters/learning_rate", "metrics/accuracy"],
    query='(parameters/learning_rate: float > 0.001) AND (tags: string CONTAINS "baseline")'
).to_pandas()

print(runs_df)

Comparing Runs

import neptune

project = neptune.init_project(name="workspace/project", mode="read-only")

# Fetch runs to compare
runs_table = project.fetch_runs_table(
    tag=["experiment-1"],
    columns=[
        "sys/id",
        "parameters/learning_rate",
        "parameters/batch_size",
        "metrics/final_accuracy"
    ]
).to_pandas()

# Sort by accuracy
best_runs = runs_table.sort_values("metrics/final_accuracy", ascending=False)
print(best_runs.head(10))

Best Practices for Neptune

  1. Use Consistent Namespace Structure

    # Good organization
    run["parameters/model/learning_rate"] = 0.001
    run["parameters/training/batch_size"] = 32
    run["metrics/train/loss"].append(loss)
    run["metrics/validation/accuracy"].append(acc)
    run["artifacts/model/checkpoint"].upload("model.pth")
    

  2. Tag Runs Appropriately

    run = neptune.init_run(
        project="workspace/project",
        tags=["baseline", "resnet50", "aug-v2", "experiment-1"]
    )
    

  3. Use Model Registry for Production Models

    # Register model when it's production-ready
    model_version = neptune.init_model_version(
        model="workspace/production-models",
        name="text-classifier-v1.2"
    )
    model_version["model"].upload("model.pth")
    model_version.change_stage("production")
    

  4. Log Dependencies and Environment

    run["environment/requirements"].upload("requirements.txt")
    run["environment/python_version"] = sys.version
    run["environment/cuda_version"] = torch.version.cuda
    

  5. Use Read-Only Mode for Analysis

    # Don't modify runs during analysis
    run = neptune.init_run(
        with_id="PROJ-123",
        project="workspace/project",
        mode="read-only"
    )
    

  6. Handle Interruptions Gracefully

    import neptune
    import signal
    import sys
    
    run = neptune.init_run(project="workspace/project")
    
    def signal_handler(sig, frame):
        print("Stopping Neptune run...")
        run.stop()
        sys.exit(0)
    
    signal.signal(signal.SIGINT, signal_handler)
    

Common Issues and Solutions

Issue: API token not recognized - Solution: Ensure NEPTUNE_API_TOKEN is set correctly, check for extra spaces

Issue: Project not found - Solution: Verify project format is workspace/project-name, check workspace name in Neptune UI

Issue: Slow uploads - Solution: Use asynchronous mode with run["metric"].append(value, wait=False)

Issue: Quota exceeded on free tier - Solution: Archive old runs, delete unnecessary data, or upgrade plan


TensorBoard

Overview

TensorBoard is TensorFlow's open-source visualization toolkit, now framework-agnostic and widely used for ML experiment tracking. Unlike WandB and Neptune, TensorBoard runs locally or requires manual setup for remote access, making it ideal for HPC environments where data privacy is critical.

Why TensorBoard on HPC?

On shared HPC clusters like Sherlock, TensorBoard has historically posed security challenges because it lacks built-in authentication mechanisms. As noted in Stanford's Sherlock documentation:

"There is no notion of user session, credentials, nor access control in TensorBoard."

This means any cluster user could potentially access another user's TensorBoard instance and their associated data. Sherlock's solution uses an authenticating reverse proxy through the OnDemand portal, which provides secure, authenticated access to TensorBoard sessions with browser cookie-based verification.

Getting Started

1. Installation

pip install tensorboard torch
# Or for TensorFlow
pip install tensorflow  # Includes TensorBoard

2. Basic Integration with PyTorch

import torch
from torch.utils.tensorboard import SummaryWriter

# Create a SummaryWriter (logs to ./runs by default)
writer = SummaryWriter(log_dir='runs/experiment_1')

# Log hyperparameters
hparams = {
    'learning_rate': 0.001,
    'batch_size': 32,
    'epochs': 10
}

# Training loop
for epoch in range(epochs):
    train_loss = train_epoch()
    val_loss, val_acc = validate()

    # Log scalars
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Loss/validation', val_loss, epoch)
    writer.add_scalar('Accuracy/validation', val_acc, epoch)

# Log hyperparameters with final metrics
writer.add_hparams(
    hparams,
    {'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
)

# Close writer
writer.close()

3. Viewing TensorBoard Locally

tensorboard --logdir=runs

Then open http://localhost:6006 in your browser.

Secure TensorBoard on Sherlock (HPC)

For users on Stanford's Sherlock cluster, use the TensorBoard OnDemand app for secure access:

Setting Up TensorBoard on Sherlock OnDemand

  1. Access Sherlock OnDemand Portal
  2. Navigate to the Sherlock OnDemand portal
  3. Log in with your SUNet credentials

  4. Launch a TensorBoard Session

  5. Go to "Interactive Apps" or "My Interactive Sessions"
  6. Select "TensorBoard"
  7. Configure your session:

    • Log Directory: Path to your TensorBoard logs (e.g., $HOME/runs or $SCRATCH/experiments/logs)
    • Partition: Select appropriate compute partition
    • Time: Session duration (hours)
    • Memory: Required memory
  8. Connect to Your Session

  9. Once the session starts, click "Connect to TensorBoard"
  10. The OnDemand system automatically handles authentication via a security cookie
  11. Your session remains private - only you can access it

How Security Works

The OnDemand implementation uses: - Authenticating reverse proxy: Verifies your identity before allowing access - Browser cookie-based verification: The proxy checks your security cookie on each request - Transparent authentication: No manual configuration required - Cookie regeneration: If you lose your cookie, revisit "My Interactive Sessions" and select "Connect"

Best Practices for Sherlock TensorBoard

# 1. Organize logs by experiment in your scratch space
LOG_DIR=$SCRATCH/tensorboard_logs/experiment_name
mkdir -p $LOG_DIR

# 2. In your Python script
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir=os.getenv('LOG_DIR', './runs'))

# 3. Submit your training job
sbatch train_script.sh

# 4. Launch TensorBoard session via OnDemand pointing to $SCRATCH/tensorboard_logs
# Monitor training in real-time through the secure portal

Advanced TensorBoard Features

Logging Different Data Types

from torch.utils.tensorboard import SummaryWriter
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt

writer = SummaryWriter('runs/advanced_logging')

# Scalars (single values)
writer.add_scalar('Loss/train', train_loss, global_step)
writer.add_scalar('Accuracy/test', test_acc, global_step)

# Scalars from multiple sources (creates comparison charts)
writer.add_scalars('Loss/comparison', {
    'train': train_loss,
    'validation': val_loss,
    'test': test_loss
}, global_step)

# Images (tensors of shape [B, C, H, W])
writer.add_image('Input/sample', img_tensor, global_step)
# Multiple images in a grid
img_grid = torchvision.utils.make_grid(img_batch)
writer.add_image('Batch/samples', img_grid, global_step)

# Histograms (for weights, gradients, activations)
writer.add_histogram('Model/fc1.weight', model.fc1.weight, global_step)
writer.add_histogram('Gradients/fc1', model.fc1.weight.grad, global_step)

# Model graph (network architecture)
writer.add_graph(model, input_tensor)

# Embeddings (for dimensionality reduction visualization)
writer.add_embedding(
    features,  # [N, D] tensor
    metadata=labels,  # [N] labels
    label_img=images,  # [N, C, H, W] thumbnail images
    global_step=global_step
)

# Text
writer.add_text('Notes', 'Experiment with increased dropout', global_step)

# Precision-Recall curves
writer.add_pr_curve('PR-Curve', labels, predictions, global_step)

# Custom matplotlib figures
fig = plt.figure()
plt.plot(x, y)
writer.add_figure('Custom/plot', fig, global_step)

# Audio
writer.add_audio('Audio/sample', audio_tensor, global_step, sample_rate=44100)

# Video
writer.add_video('Video/training', video_tensor, global_step, fps=30)

writer.close()

Hyperparameter Tuning Visualization

from torch.utils.tensorboard import SummaryWriter

# Run multiple experiments with different hyperparameters
for lr in [0.001, 0.01, 0.1]:
    for batch_size in [16, 32, 64]:
        writer = SummaryWriter(f'runs/lr_{lr}_bs_{batch_size}')

        # Train model
        final_acc = train_model(lr, batch_size)

        # Log hyperparameters and metrics
        writer.add_hparams(
            {'learning_rate': lr, 'batch_size': batch_size},
            {'accuracy': final_acc, 'loss': final_loss}
        )
        writer.close()

# View parallel coordinates plot in TensorBoard's HPARAMS tab

Profiling (Performance Analysis)

import torch
import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
    on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    for step, batch in enumerate(dataloader):
        output = model(batch)
        loss = criterion(output, target)
        loss.backward()
        optimizer.step()
        prof.step()

# View in TensorBoard's PROFILE tab

Integration with PyTorch Lightning

from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLogger

logger = TensorBoardLogger(
    save_dir='logs/',
    name='my_experiment',
    version='v1',
    log_graph=True,
    default_hp_metric=False
)

trainer = Trainer(
    logger=logger,
    log_every_n_steps=50,
    max_epochs=10
)

trainer.fit(model, datamodule)

Integration with TensorFlow/Keras

import tensorflow as tf
from tensorflow import keras

# Create callback
tensorboard_callback = keras.callbacks.TensorBoard(
    log_dir='logs/fit',
    histogram_freq=1,  # Log histograms every epoch
    write_graph=True,
    write_images=True,
    update_freq='epoch',
    profile_batch='500,520',  # Profile batches 500-520
    embeddings_freq=1
)

# Train model
model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    callbacks=[tensorboard_callback]
)

Organizing TensorBoard Logs

Directory Structure

Here's a recommended directory structure for organizing your TensorBoard logs:

logs/
 experiment_1/
    run_1/
       events.out.tfevents.xxx
    run_2/
       events.out.tfevents.xxx
    run_3/
        events.out.tfevents.xxx
 experiment_2/
    baseline/
       events.out.tfevents.xxx
    improved/
        events.out.tfevents.xxx

View all experiments:

tensorboard --logdir=logs

View specific experiment:

tensorboard --logdir=logs/experiment_1

Compare specific runs:

tensorboard --logdir_spec=baseline:logs/experiment_2/baseline,improved:logs/experiment_2/improved

Custom Naming with SummaryWriter

import datetime
from torch.utils.tensorboard import SummaryWriter

# Timestamped runs
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
writer = SummaryWriter(f'runs/experiment_{timestamp}')

# Parametrized runs
writer = SummaryWriter(f'runs/lr_{lr}_bs_{batch_size}_{timestamp}')

# Hierarchical organization
writer = SummaryWriter(f'runs/{project}/{experiment}/{run_name}')

Remote Access to TensorBoard (General HPC)

If you're not using Sherlock OnDemand or need alternative access methods:

SSH Tunneling (Basic Method)

# On HPC cluster
tensorboard --logdir=runs --port=6006 --bind_all

# On local machine
ssh -N -L 6006:compute-node:6006 username@hpc-cluster.edu

Then access http://localhost:6006

Security Warning: This method lacks authentication. Only use on trusted networks.

Using ngrok (For Temporary Sharing)

# On compute node
tensorboard --logdir=runs --port=6006 &
ngrok http 6006

Security Warning: This exposes your TensorBoard publicly. Only use for temporary demos.

Best Practice for Secure HPC Access

  1. Use institutional solutions like Sherlock OnDemand when available
  2. Never expose TensorBoard directly to the internet without authentication
  3. Use VPN + SSH tunneling if institutional solutions aren't available
  4. Consider exporting logs and viewing locally for maximum security:
# On HPC cluster
rsync -avz username@hpc:/path/to/runs ./local_runs

# On local machine
tensorboard --logdir=./local_runs

TensorBoard.dev (Public Sharing)

For sharing results publicly (e.g., with papers):

# Upload logs to TensorBoard.dev
tensorboard dev upload --logdir runs/experiment_1 \
    --name "My Experiment" \
    --description "ResNet-50 on CIFAR-10"

# Returns a public URL like https://tensorboard.dev/experiment/xxx

Note: Data is public and permanent. Only use for published research.

Comparing Multiple Experiments

In TensorBoard UI

  1. Launch TensorBoard with multiple runs:

    tensorboard --logdir=runs
    

  2. Use the run selector (left sidebar) to toggle runs on/off

  3. Use regex filtering to select runs matching patterns:

    .*lr_0\.01.*  # All runs with lr=0.01
    

  4. Compare metrics in the SCALARS tab

Programmatic Comparison

from tensorboard.backend.event_processing import event_accumulator
import pandas as pd

def load_tensorboard_data(log_dir):
    ea = event_accumulator.EventAccumulator(log_dir)
    ea.Reload()

    # Extract scalar data
    data = {}
    for tag in ea.Tags()['scalars']:
        events = ea.Scalars(tag)
        data[tag] = [(e.step, e.value) for e in events]

    return data

# Load and compare
run1_data = load_tensorboard_data('runs/experiment_1')
run2_data = load_tensorboard_data('runs/experiment_2')

# Create comparison DataFrame
df = pd.DataFrame({
    'run1_loss': [v for _, v in run1_data['Loss/train']],
    'run2_loss': [v for _, v in run2_data['Loss/train']]
})

print(df.describe())

Best Practices for TensorBoard

  1. Log at Consistent Intervals

    if global_step % log_interval == 0:
        writer.add_scalar('Loss/train', loss, global_step)
    

  2. Use Hierarchical Tag Names

    writer.add_scalar('Loss/train', train_loss, step)
    writer.add_scalar('Loss/validation', val_loss, step)
    writer.add_scalar('Metrics/accuracy', accuracy, step)
    writer.add_scalar('Metrics/f1_score', f1, step)
    

  3. Log Model Graph Early

    # Log graph once at the beginning
    dummy_input = torch.randn(1, 3, 224, 224)
    writer.add_graph(model, dummy_input)
    

  4. Monitor Gradients and Weights

    for name, param in model.named_parameters():
        writer.add_histogram(f'Weights/{name}', param, step)
        if param.grad is not None:
            writer.add_histogram(f'Gradients/{name}', param.grad, step)
    

  5. Clean Up Old Runs

    # Delete runs older than 30 days
    find runs/ -type f -mtime +30 -delete
    

  6. Use Context Managers

    with SummaryWriter('runs/experiment_1') as writer:
        # Training code
        writer.add_scalar('Loss', loss, step)
    # Writer automatically closed
    

  7. Flush Regularly for Real-time Updates

    writer.add_scalar('Loss/train', loss, step)
    writer.flush()  # Ensure data is written immediately
    

Common Issues and Solutions

Issue: TensorBoard not updating in real-time - Solution: Call writer.flush() after logging, or reduce flush_secs in SummaryWriter

Issue: "No dashboards are active" message - Solution: Ensure log directory contains valid event files, check file permissions

Issue: Port already in use - Solution: Use a different port: tensorboard --logdir=runs --port=6007

Issue: TensorBoard consuming too much disk space - Solution: Reduce logging frequency, delete old runs, or use --purge_orphaned_data flag

Issue: Cannot access TensorBoard on HPC - Solution: Use institutional solution (e.g., Sherlock OnDemand) or SSH tunneling

Issue: Graph not displaying - Solution: Ensure model and input are on the same device, try model.eval() before logging graph


Hybrid Approach: Combining Tools

Many teams use multiple tools together for optimal workflow:

TensorBoard + WandB/Neptune

Use TensorBoard for immediate feedback during development, and WandB/Neptune for long-term tracking:

import wandb
from torch.utils.tensorboard import SummaryWriter

# Initialize both
writer = SummaryWriter('runs/experiment_1')
run = wandb.init(project="my-project", name="experiment-1")

# Log to both
for epoch in range(epochs):
    train_loss = train_epoch()
    val_acc = validate()

    # TensorBoard (for immediate feedback)
    writer.add_scalar('Loss/train', train_loss, epoch)
    writer.add_scalar('Accuracy/val', val_acc, epoch)

    # WandB (for team sharing and long-term storage)
    wandb.log({
        "train_loss": train_loss,
        "val_accuracy": val_acc,
        "epoch": epoch
    })

# Sync TensorBoard logs to WandB
wandb.save("runs/experiment_1/*")

writer.close()
wandb.finish()

Local Development + Cloud Backup

import os
from torch.utils.tensorboard import SummaryWriter

# Always use TensorBoard locally
writer = SummaryWriter('runs/experiment_1')

# Log to cloud if available (e.g., not on HPC)
if os.getenv('WANDB_API_KEY') and not os.getenv('SLURM_JOB_ID'):
    import wandb
    run = wandb.init(project="my-project")
    use_wandb = True
else:
    use_wandb = False

# Training loop
for step in range(num_steps):
    loss = train_step()

    writer.add_scalar('Loss', loss, step)

    if use_wandb:
        wandb.log({"loss": loss}, step=step)

Model Development Workflow

  1. Rapid iteration (HPC): Use TensorBoard for quick experiments
  2. Promising results: Log to WandB/Neptune for team visibility
  3. Model selection: Compare across all platforms
  4. Production deployment: Use Neptune Model Registry or WandB Artifacts

Conclusion

Each platform has its strengths:

  • WandB: Best for teams needing real-time collaboration, comprehensive integrations, and powerful sweeps
  • Neptune: Best for structured metadata organization, dataset versioning, and cleaner UI
  • TensorBoard: Best for HPC environments, privacy-sensitive work, and zero-cost tracking

Recommendation for most users: - Start with TensorBoard if working on Sherlock or similar HPC clusters - Set up a personal WandB or Neptune account for long-term experiment tracking and collaboration - Use institutional solutions like Sherlock OnDemand for secure TensorBoard access - Consider a hybrid approach: TensorBoard for immediate feedback + WandB/Neptune for team sharing

The most important thing is to start tracking your experiments systematically. Even basic tracking is infinitely better than no tracking when you need to reproduce results or understand what worked six months later.