ML Experiment Tracking: WandB, Neptune, and TensorBoard
Overview: Why Experiment Tracking Matters
Machine learning experiment tracking is essential for maintaining reproducibility, comparing model performance, and collaborating effectively on ML projects. Without proper tracking, it becomes nearly impossible to:
- Reproduce successful experiments - Remember which hyperparameters, data versions, and code snapshots produced your best results
- Compare models systematically - Track metrics across dozens or hundreds of training runs to identify patterns
- Debug training issues - Visualize learning curves, gradients, and system metrics to diagnose problems
- Collaborate with team members - Share experiment results and insights with colleagues working on the same project
- Document your research - Create a historical record of what you've tried and why certain approaches worked or failed
When to Use Experiment Tracking
You should implement experiment tracking when:
- Running multiple training experiments with different hyperparameters
- Training models that take more than a few minutes to complete
- Working on a project where you'll need to reproduce results weeks or months later
- Collaborating with team members who need visibility into your experiments
- Tuning models where subtle changes in metrics matter
- Tracking system resources (GPU utilization, memory usage) during training
Team Cost Considerations
Important: Due to team license costs on platforms like WandB and Neptune, we recommend that each team member set up their own individual free account rather than using shared team accounts. Both platforms offer generous free tiers that are sufficient for most research and development work:
- WandB Free: Unlimited runs, 100GB storage, up to 2 team members
- Neptune Free: 200 hours of tracking per month, unlimited projects
This approach: - Keeps costs manageable for the team - Gives each member full control over their experiment organization - Allows for personal workspaces that can be shared selectively with team members - Avoids hitting team quotas during heavy experimentation periods
For critical shared projects requiring team-wide visibility, we can evaluate paid team accounts on a case-by-case basis.
Platform Comparison
Here's a detailed comparison of the three main experiment tracking platforms:
| Feature | WandB | Neptune | TensorBoard |
|---|---|---|---|
| Hosting | Cloud-based (can self-host) | Cloud-based (can self-host) | Local/HPC (requires setup for remote access) |
| Ease of Setup | Very easy (pip install + API key) | Very easy (pip install + API token) | Easy locally, complex for secure remote access |
| Real-time Tracking | Excellent (live updating) | Excellent (live updating) | Good (requires refresh) |
| Collaboration | Built-in sharing & teams | Built-in sharing & teams | Manual (share files or set up server) |
| Cost | Free tier, then $50+/user/month | Free tier, then $59+/user/month | Free (open-source) |
| Integration Ecosystem | Extensive (PyTorch, TensorFlow, Keras, HuggingFace, etc.) | Extensive (similar to WandB) | Good (primarily PyTorch & TensorFlow) |
| Visualization Quality | Excellent interactive plots | Excellent interactive plots | Good but more basic |
| Model Registry | Yes | Yes | No (requires separate tools) |
| Hyperparameter Sweeps | Built-in sweep functionality | Built-in optimization | Manual setup required |
| Data Versioning | Artifacts + dataset tracking | Dataset versioning built-in | Not supported |
| Security on HPC | API key authentication | API token authentication | Requires special setup (see TensorBoard section) |
| Mobile App | Yes | Yes | No |
| API & Custom Dashboards | Comprehensive API | Comprehensive API | Limited API |
| Storage Limits (Free) | 100GB | Based on tracking hours | Unlimited (local storage) |
| Learning Curve | Low | Low | Low for basics, steeper for advanced |
Which Platform to Choose?
Choose WandB if: - You want the most popular platform with the largest community - You need extensive integrations with modern ML frameworks (especially HuggingFace) - You value real-time collaboration features and team dashboards - You want powerful hyperparameter sweep functionality out of the box - You're working on research that might benefit from public project sharing
Choose Neptune if: - You prefer a cleaner, more structured approach to metadata logging - You need robust dataset versioning capabilities - You want more flexible organization with workspaces and projects - You prefer Neptune's UI/UX (subjective, but some find it more intuitive) - You need longer data retention on the free tier
Choose TensorBoard if: - You're working on an HPC cluster like Sherlock where data privacy is critical - You want to avoid external dependencies and cloud services - Your experiments are already generating TensorBoard logs - You don't need real-time collaboration features - You're comfortable with more technical setup for remote access - Budget is a hard constraint (completely free)
Best Practice: Many teams use TensorBoard during active development on HPC clusters (for immediate feedback), then log final results to WandB or Neptune for long-term tracking and team visibility.
WandB (Weights & Biases)
Overview
Weights & Biases is the most widely adopted ML experiment tracking platform, known for its ease of use, real-time collaboration features, and extensive integration ecosystem. It's particularly popular in the deep learning community and among researchers.
Getting Started
1. Installation
pip install wandb
2. Account Setup
- Create a free account at https://wandb.ai/signup
- Get your API key from https://wandb.ai/authorize
- Login from your terminal:
wandb login
Paste your API key when prompted. This stores your credentials in ~/.netrc for future use.
3. Basic Integration
Here's a minimal example with PyTorch:
import wandb
import torch
import torch.nn as nn
# Initialize a new run
wandb.init(
project="my-project",
name="experiment-1",
config={
"learning_rate": 0.001,
"epochs": 10,
"batch_size": 32,
"architecture": "ResNet-50",
"dataset": "CIFAR-10"
}
)
# Access config
config = wandb.config
# Training loop
model = create_model(config)
for epoch in range(config.epochs):
train_loss = train_epoch(model)
val_loss, val_acc = validate(model)
# Log metrics
wandb.log({
"epoch": epoch,
"train_loss": train_loss,
"val_loss": val_loss,
"val_accuracy": val_acc
})
# Finish the run
wandb.finish()
Advanced Features
Logging Different Data Types
import wandb
import matplotlib.pyplot as plt
import numpy as np
# Log images
images = [wandb.Image(img, caption=f"Sample {i}") for i, img in enumerate(batch)]
wandb.log({"examples": images})
# Log matplotlib figures
fig, ax = plt.subplots()
ax.plot(x, y)
wandb.log({"chart": wandb.Image(fig)})
plt.close()
# Log histograms
wandb.log({"gradients": wandb.Histogram(gradient_values)})
# Log tables
table = wandb.Table(
columns=["id", "prediction", "truth"],
data=[[1, 0.9, 1], [2, 0.1, 0]]
)
wandb.log({"predictions": table})
# Log videos
wandb.log({"video": wandb.Video(video_array, fps=4, format="mp4")})
# Log audio
wandb.log({"audio": wandb.Audio(audio_array, sample_rate=16000)})
# Log 3D point clouds
wandb.log({"point_cloud": wandb.Object3D(points)})
Model Checkpointing and Artifacts
WandB Artifacts allow you to version datasets, models, and other files:
import wandb
run = wandb.init(project="my-project")
# Save a model checkpoint as an artifact
artifact = wandb.Artifact(
name="model-checkpoint",
type="model",
description="ResNet-50 trained on CIFAR-10"
)
artifact.add_file("model.pth")
artifact.add_file("config.yaml")
# Log the artifact
run.log_artifact(artifact)
# Later, load an artifact
run = wandb.init(project="my-project")
artifact = run.use_artifact("model-checkpoint:latest")
artifact_dir = artifact.download()
Dataset Versioning
import wandb
run = wandb.init(project="my-project")
# Create a dataset artifact
dataset_artifact = wandb.Artifact(
name="cifar10-preprocessed",
type="dataset",
description="CIFAR-10 with augmentation pipeline v2"
)
# Add files or directories
dataset_artifact.add_dir("data/processed/")
dataset_artifact.add_file("data/metadata.json")
# Log the artifact with aliases
run.log_artifact(dataset_artifact, aliases=["latest", "v2.0"])
# Use the dataset in another run
run = wandb.init(project="my-project")
dataset = run.use_artifact("cifar10-preprocessed:latest")
data_dir = dataset.download()
Hyperparameter Sweeps
WandB sweeps automate hyperparameter tuning:
# sweep_config.yaml or defined in code
sweep_config = {
"method": "bayes", # or "grid", "random"
"metric": {
"name": "val_accuracy",
"goal": "maximize"
},
"parameters": {
"learning_rate": {
"distribution": "log_uniform_values",
"min": 1e-5,
"max": 1e-1
},
"batch_size": {
"values": [16, 32, 64, 128]
},
"dropout": {
"distribution": "uniform",
"min": 0.1,
"max": 0.5
},
"optimizer": {
"values": ["adam", "sgd", "rmsprop"]
}
}
}
# Initialize sweep
sweep_id = wandb.sweep(sweep_config, project="my-project")
# Define training function
def train():
run = wandb.init()
config = wandb.config
# Train with these hyperparameters
model = create_model(config)
accuracy = train_model(model, config)
wandb.log({"val_accuracy": accuracy})
# Run sweep agent
wandb.agent(sweep_id, function=train, count=50) # Run 50 trials
To run multiple agents in parallel (e.g., on a cluster):
# On each compute node
wandb agent <sweep_id>
Integration with Popular Frameworks
PyTorch Lightning:
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import WandbLogger
wandb_logger = WandbLogger(project="my-project", log_model=True)
trainer = Trainer(logger=wandb_logger)
trainer.fit(model, datamodule=dm)
HuggingFace Transformers:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./results",
report_to="wandb", # Automatically logs to WandB
run_name="bert-finetuning"
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
Keras:
from wandb.keras import WandbCallback
model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[WandbCallback(save_model=True)]
)
System Monitoring
WandB automatically tracks system metrics (GPU utilization, memory, CPU, disk I/O) but you can customize:
wandb.init(
project="my-project",
settings=wandb.Settings(
# Log system metrics every 30 seconds
_stats_sample_rate_seconds=30,
# Log system metrics every 50 steps
_stats_samples_to_average=50
)
)
Alerts
Set up alerts for important events:
import wandb
run = wandb.init(project="my-project")
# Send alert if validation loss is too high
if val_loss > 1.0:
wandb.alert(
title="High validation loss",
text=f"Validation loss is {val_loss:.4f}",
level=wandb.AlertLevel.WARN
)
Best Practices for WandB
-
Use Meaningful Project and Run Names
wandb.init( project="image-classification-cifar10", name=f"resnet50-lr{lr}-bs{batch_size}", tags=["baseline", "resnet", "augmentation-v2"] ) -
Group Related Experiments
wandb.init( project="my-project", group="experiment-1", # Group related runs job_type="train" # train, eval, preprocess, etc. ) -
Log Configuration Comprehensively
config = { # Model architecture "architecture": "ResNet-50", "num_layers": 50, # Training hyperparameters "learning_rate": 0.001, "batch_size": 32, "epochs": 100, "optimizer": "adam", # Data configuration "dataset": "CIFAR-10", "data_augmentation": True, "train_samples": 50000, # System info "gpu": torch.cuda.get_device_name(0), "random_seed": 42 } wandb.init(project="my-project", config=config) -
Add Notes and Tags
run = wandb.init( project="my-project", notes="Testing new data augmentation strategy", tags=["experimental", "data-aug", "high-priority"] ) -
Resume Failed Runs
# Resume a crashed run run = wandb.init( project="my-project", id="unique-run-id", resume="must" # or "allow" for optional resume ) -
Use Offline Mode for Debugging
# Set environment variable export WANDB_MODE=offline
Or in code:
wandb.init(project="my-project", mode="offline")
Sync later:
wandb sync wandb/offline-run-xxx
Common Issues and Solutions
Issue: API key not found
- Solution: Run wandb login or set WANDB_API_KEY environment variable
Issue: Runs are slow to sync
- Solution: Reduce logging frequency with wandb.log(..., step=step, commit=False) and commit every N steps
Issue: Large model files filling storage
- Solution: Be selective about what you save with artifacts, use .wandb_ignore file
Issue: Can't access runs from cluster - Solution: Make projects public or add team members to your workspace
Neptune
Overview
Neptune.ai is a comprehensive MLOps platform with a focus on metadata organization and experiment tracking. It offers a cleaner, more structured approach compared to WandB, with strong support for dataset versioning and flexible workspace organization.
Getting Started
1. Installation
pip install neptune
2. Account Setup
- Create a free account at https://neptune.ai/register
- Get your API token from your profile settings
- Set up authentication:
export NEPTUNE_API_TOKEN="your-api-token"
Or in Python:
import neptune
run = neptune.init_run(
project="your-workspace/your-project",
api_token="your-api-token" # Better to use environment variable
)
3. Basic Integration
import neptune
import torch
# Initialize run
run = neptune.init_run(
project="your-workspace/project-name",
name="experiment-1",
tags=["baseline", "resnet"]
)
# Log hyperparameters
run["parameters"] = {
"learning_rate": 0.001,
"batch_size": 32,
"epochs": 10,
"optimizer": "adam"
}
# Training loop
for epoch in range(epochs):
train_loss = train_epoch()
val_loss, val_acc = validate()
# Log metrics
run["train/loss"].append(train_loss)
run["validation/loss"].append(val_loss)
run["validation/accuracy"].append(val_acc)
# Stop tracking
run.stop()
Advanced Features
Hierarchical Metadata Structure
Neptune uses a namespace-based system for organizing metadata:
import neptune
run = neptune.init_run(project="workspace/project")
# Organize metrics in namespaces
run["train/loss"].append(0.5)
run["train/accuracy"].append(0.85)
run["validation/loss"].append(0.6)
run["validation/accuracy"].append(0.82)
# Model configuration
run["model/architecture"] = "ResNet-50"
run["model/parameters/conv1/filters"] = 64
run["model/parameters/fc/units"] = 1000
# Dataset information
run["dataset/name"] = "CIFAR-10"
run["dataset/train_size"] = 50000
run["dataset/preprocessing"] = "standardization + augmentation"
# System information
run["sys/gpu"] = torch.cuda.get_device_name(0)
run["sys/cuda_version"] = torch.version.cuda
Logging Different Data Types
import neptune
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
run = neptune.init_run(project="workspace/project")
# Log single values
run["metrics/final_accuracy"] = 0.95
# Log series (for metrics over time)
for i in range(100):
run["metrics/loss"].append(loss_value)
# Log images
run["images/sample"].upload("path/to/image.png")
# Or from PIL
run["images/prediction"].upload(Image.fromarray(img_array))
# Log files
run["model/checkpoint"].upload("model.pth")
run["config"].upload("config.yaml")
# Log text
run["notes"] = "Increased learning rate, improved convergence"
# Log matplotlib figures
fig, ax = plt.subplots()
ax.plot(x, y)
run["plots/learning_curve"].upload(fig)
# Log HTML
run["reports/summary"].upload(neptune.types.File.as_html(html_content))
# Log pickled objects
run["data/predictions"].upload(neptune.types.File.as_pickle(predictions))
# Log dataframes
import pandas as pd
df = pd.DataFrame({"prediction": preds, "truth": labels})
run["tables/predictions"].upload(neptune.types.File.as_html(df))
Model Versioning
import neptune
from neptune.types import File
# Initialize model version
model_version = neptune.init_model_version(
model="workspace/project-models",
name="resnet50-v1"
)
# Log model metadata
model_version["architecture"] = "ResNet-50"
model_version["framework"] = "PyTorch"
model_version["signature"] = "Input: (B, 3, 224, 224) -> Output: (B, 1000)"
# Upload model files
model_version["model"].upload("model.pth")
model_version["preprocessing"].upload("preprocessing.py")
# Link to training run
model_version["training/run_id"] = run._sys_id
# Log validation metrics
model_version["validation/accuracy"] = 0.92
model_version["validation/f1_score"] = 0.89
# Change stage
model_version.change_stage("staging") # or "production", "archived"
model_version.stop()
Dataset Versioning
import neptune
# Initialize dataset version
dataset_version = neptune.init_dataset_version(
dataset="workspace/project-datasets",
name="cifar10-v2"
)
# Log dataset metadata
dataset_version["size"] = 60000
dataset_version["preprocessing"] = "augmentation + normalization"
dataset_version["split/train"] = 50000
dataset_version["split/val"] = 10000
# Upload dataset files
dataset_version["data/train"].upload_files("data/train/")
dataset_version["data/val"].upload_files("data/val/")
# Log data statistics
dataset_version["statistics/mean"] = [0.485, 0.456, 0.406]
dataset_version["statistics/std"] = [0.229, 0.224, 0.225]
dataset_version.stop()
Integration with Frameworks
PyTorch Lightning:
from pytorch_lightning import Trainer
from neptune.pytorch_lightning import NeptuneLogger
neptune_logger = NeptuneLogger(
project="workspace/project",
api_token=os.getenv("NEPTUNE_API_TOKEN"),
name="experiment-1",
tags=["pytorch-lightning", "baseline"]
)
trainer = Trainer(logger=neptune_logger)
trainer.fit(model, datamodule)
HuggingFace Transformers:
from transformers import Trainer, TrainingArguments
from neptune.integrations.transformers import NeptuneCallback
run = neptune.init_run(project="workspace/project")
training_args = TrainingArguments(
output_dir="./results",
report_to="none" # We'll use Neptune callback instead
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
callbacks=[NeptuneCallback(run=run)]
)
trainer.train()
Keras:
from neptune.integrations.tensorflow_keras import NeptuneCallback
run = neptune.init_run(project="workspace/project")
model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
callbacks=[NeptuneCallback(run=run, base_namespace="metrics")]
)
Scikit-learn:
from neptune.integrations.sklearn import NeptuneCallback, create_regressor_summary
import neptune
run = neptune.init_run(project="workspace/project")
# Log model parameters
run["model/parameters"] = model.get_params()
# Log model summary
run["model/summary"] = create_regressor_summary(
model, X_train, X_test, y_train, y_test
)
# Log pickled model
run["model/pickled"].upload(neptune.types.File.as_pickle(model))
Querying Runs
Neptune provides a powerful API to fetch and analyze runs:
import neptune
# Fetch specific run
run = neptune.init_run(
with_id="PROJ-123", # Run ID
project="workspace/project",
mode="read-only"
)
# Access metadata
accuracy = run["validation/accuracy"].fetch()
params = run["parameters"].fetch()
# Fetch multiple runs
project = neptune.init_project(
name="workspace/project",
mode="read-only"
)
# Query runs with filters
runs_df = project.fetch_runs_table(
columns=["sys/id", "sys/name", "parameters/learning_rate", "metrics/accuracy"],
query='(parameters/learning_rate: float > 0.001) AND (tags: string CONTAINS "baseline")'
).to_pandas()
print(runs_df)
Comparing Runs
import neptune
project = neptune.init_project(name="workspace/project", mode="read-only")
# Fetch runs to compare
runs_table = project.fetch_runs_table(
tag=["experiment-1"],
columns=[
"sys/id",
"parameters/learning_rate",
"parameters/batch_size",
"metrics/final_accuracy"
]
).to_pandas()
# Sort by accuracy
best_runs = runs_table.sort_values("metrics/final_accuracy", ascending=False)
print(best_runs.head(10))
Best Practices for Neptune
-
Use Consistent Namespace Structure
# Good organization run["parameters/model/learning_rate"] = 0.001 run["parameters/training/batch_size"] = 32 run["metrics/train/loss"].append(loss) run["metrics/validation/accuracy"].append(acc) run["artifacts/model/checkpoint"].upload("model.pth") -
Tag Runs Appropriately
run = neptune.init_run( project="workspace/project", tags=["baseline", "resnet50", "aug-v2", "experiment-1"] ) -
Use Model Registry for Production Models
# Register model when it's production-ready model_version = neptune.init_model_version( model="workspace/production-models", name="text-classifier-v1.2" ) model_version["model"].upload("model.pth") model_version.change_stage("production") -
Log Dependencies and Environment
run["environment/requirements"].upload("requirements.txt") run["environment/python_version"] = sys.version run["environment/cuda_version"] = torch.version.cuda -
Use Read-Only Mode for Analysis
# Don't modify runs during analysis run = neptune.init_run( with_id="PROJ-123", project="workspace/project", mode="read-only" ) -
Handle Interruptions Gracefully
import neptune import signal import sys run = neptune.init_run(project="workspace/project") def signal_handler(sig, frame): print("Stopping Neptune run...") run.stop() sys.exit(0) signal.signal(signal.SIGINT, signal_handler)
Common Issues and Solutions
Issue: API token not recognized
- Solution: Ensure NEPTUNE_API_TOKEN is set correctly, check for extra spaces
Issue: Project not found
- Solution: Verify project format is workspace/project-name, check workspace name in Neptune UI
Issue: Slow uploads
- Solution: Use asynchronous mode with run["metric"].append(value, wait=False)
Issue: Quota exceeded on free tier - Solution: Archive old runs, delete unnecessary data, or upgrade plan
TensorBoard
Overview
TensorBoard is TensorFlow's open-source visualization toolkit, now framework-agnostic and widely used for ML experiment tracking. Unlike WandB and Neptune, TensorBoard runs locally or requires manual setup for remote access, making it ideal for HPC environments where data privacy is critical.
Why TensorBoard on HPC?
On shared HPC clusters like Sherlock, TensorBoard has historically posed security challenges because it lacks built-in authentication mechanisms. As noted in Stanford's Sherlock documentation:
"There is no notion of user session, credentials, nor access control in TensorBoard."
This means any cluster user could potentially access another user's TensorBoard instance and their associated data. Sherlock's solution uses an authenticating reverse proxy through the OnDemand portal, which provides secure, authenticated access to TensorBoard sessions with browser cookie-based verification.
Getting Started
1. Installation
pip install tensorboard torch
# Or for TensorFlow
pip install tensorflow # Includes TensorBoard
2. Basic Integration with PyTorch
import torch
from torch.utils.tensorboard import SummaryWriter
# Create a SummaryWriter (logs to ./runs by default)
writer = SummaryWriter(log_dir='runs/experiment_1')
# Log hyperparameters
hparams = {
'learning_rate': 0.001,
'batch_size': 32,
'epochs': 10
}
# Training loop
for epoch in range(epochs):
train_loss = train_epoch()
val_loss, val_acc = validate()
# Log scalars
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Loss/validation', val_loss, epoch)
writer.add_scalar('Accuracy/validation', val_acc, epoch)
# Log hyperparameters with final metrics
writer.add_hparams(
hparams,
{'hparam/accuracy': final_acc, 'hparam/loss': final_loss}
)
# Close writer
writer.close()
3. Viewing TensorBoard Locally
tensorboard --logdir=runs
Then open http://localhost:6006 in your browser.
Secure TensorBoard on Sherlock (HPC)
For users on Stanford's Sherlock cluster, use the TensorBoard OnDemand app for secure access:
Setting Up TensorBoard on Sherlock OnDemand
- Access Sherlock OnDemand Portal
- Navigate to the Sherlock OnDemand portal
-
Log in with your SUNet credentials
-
Launch a TensorBoard Session
- Go to "Interactive Apps" or "My Interactive Sessions"
- Select "TensorBoard"
-
Configure your session:
- Log Directory: Path to your TensorBoard logs (e.g.,
$HOME/runsor$SCRATCH/experiments/logs) - Partition: Select appropriate compute partition
- Time: Session duration (hours)
- Memory: Required memory
- Log Directory: Path to your TensorBoard logs (e.g.,
-
Connect to Your Session
- Once the session starts, click "Connect to TensorBoard"
- The OnDemand system automatically handles authentication via a security cookie
- Your session remains private - only you can access it
How Security Works
The OnDemand implementation uses: - Authenticating reverse proxy: Verifies your identity before allowing access - Browser cookie-based verification: The proxy checks your security cookie on each request - Transparent authentication: No manual configuration required - Cookie regeneration: If you lose your cookie, revisit "My Interactive Sessions" and select "Connect"
Best Practices for Sherlock TensorBoard
# 1. Organize logs by experiment in your scratch space
LOG_DIR=$SCRATCH/tensorboard_logs/experiment_name
mkdir -p $LOG_DIR
# 2. In your Python script
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter(log_dir=os.getenv('LOG_DIR', './runs'))
# 3. Submit your training job
sbatch train_script.sh
# 4. Launch TensorBoard session via OnDemand pointing to $SCRATCH/tensorboard_logs
# Monitor training in real-time through the secure portal
Advanced TensorBoard Features
Logging Different Data Types
from torch.utils.tensorboard import SummaryWriter
import torch
import torchvision
import numpy as np
import matplotlib.pyplot as plt
writer = SummaryWriter('runs/advanced_logging')
# Scalars (single values)
writer.add_scalar('Loss/train', train_loss, global_step)
writer.add_scalar('Accuracy/test', test_acc, global_step)
# Scalars from multiple sources (creates comparison charts)
writer.add_scalars('Loss/comparison', {
'train': train_loss,
'validation': val_loss,
'test': test_loss
}, global_step)
# Images (tensors of shape [B, C, H, W])
writer.add_image('Input/sample', img_tensor, global_step)
# Multiple images in a grid
img_grid = torchvision.utils.make_grid(img_batch)
writer.add_image('Batch/samples', img_grid, global_step)
# Histograms (for weights, gradients, activations)
writer.add_histogram('Model/fc1.weight', model.fc1.weight, global_step)
writer.add_histogram('Gradients/fc1', model.fc1.weight.grad, global_step)
# Model graph (network architecture)
writer.add_graph(model, input_tensor)
# Embeddings (for dimensionality reduction visualization)
writer.add_embedding(
features, # [N, D] tensor
metadata=labels, # [N] labels
label_img=images, # [N, C, H, W] thumbnail images
global_step=global_step
)
# Text
writer.add_text('Notes', 'Experiment with increased dropout', global_step)
# Precision-Recall curves
writer.add_pr_curve('PR-Curve', labels, predictions, global_step)
# Custom matplotlib figures
fig = plt.figure()
plt.plot(x, y)
writer.add_figure('Custom/plot', fig, global_step)
# Audio
writer.add_audio('Audio/sample', audio_tensor, global_step, sample_rate=44100)
# Video
writer.add_video('Video/training', video_tensor, global_step, fps=30)
writer.close()
Hyperparameter Tuning Visualization
from torch.utils.tensorboard import SummaryWriter
# Run multiple experiments with different hyperparameters
for lr in [0.001, 0.01, 0.1]:
for batch_size in [16, 32, 64]:
writer = SummaryWriter(f'runs/lr_{lr}_bs_{batch_size}')
# Train model
final_acc = train_model(lr, batch_size)
# Log hyperparameters and metrics
writer.add_hparams(
{'learning_rate': lr, 'batch_size': batch_size},
{'accuracy': final_acc, 'loss': final_loss}
)
writer.close()
# View parallel coordinates plot in TensorBoard's HPARAMS tab
Profiling (Performance Analysis)
import torch
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
],
schedule=torch.profiler.schedule(wait=1, warmup=1, active=3, repeat=2),
on_trace_ready=torch.profiler.tensorboard_trace_handler('./runs/profiler'),
record_shapes=True,
profile_memory=True,
with_stack=True
) as prof:
for step, batch in enumerate(dataloader):
output = model(batch)
loss = criterion(output, target)
loss.backward()
optimizer.step()
prof.step()
# View in TensorBoard's PROFILE tab
Integration with PyTorch Lightning
from pytorch_lightning import Trainer
from pytorch_lightning.loggers import TensorBoardLogger
logger = TensorBoardLogger(
save_dir='logs/',
name='my_experiment',
version='v1',
log_graph=True,
default_hp_metric=False
)
trainer = Trainer(
logger=logger,
log_every_n_steps=50,
max_epochs=10
)
trainer.fit(model, datamodule)
Integration with TensorFlow/Keras
import tensorflow as tf
from tensorflow import keras
# Create callback
tensorboard_callback = keras.callbacks.TensorBoard(
log_dir='logs/fit',
histogram_freq=1, # Log histograms every epoch
write_graph=True,
write_images=True,
update_freq='epoch',
profile_batch='500,520', # Profile batches 500-520
embeddings_freq=1
)
# Train model
model.fit(
X_train, y_train,
validation_data=(X_val, y_val),
epochs=10,
callbacks=[tensorboard_callback]
)
Organizing TensorBoard Logs
Directory Structure
Here's a recommended directory structure for organizing your TensorBoard logs:
logs/
experiment_1/
run_1/
events.out.tfevents.xxx
run_2/
events.out.tfevents.xxx
run_3/
events.out.tfevents.xxx
experiment_2/
baseline/
events.out.tfevents.xxx
improved/
events.out.tfevents.xxx
View all experiments:
tensorboard --logdir=logs
View specific experiment:
tensorboard --logdir=logs/experiment_1
Compare specific runs:
tensorboard --logdir_spec=baseline:logs/experiment_2/baseline,improved:logs/experiment_2/improved
Custom Naming with SummaryWriter
import datetime
from torch.utils.tensorboard import SummaryWriter
# Timestamped runs
timestamp = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
writer = SummaryWriter(f'runs/experiment_{timestamp}')
# Parametrized runs
writer = SummaryWriter(f'runs/lr_{lr}_bs_{batch_size}_{timestamp}')
# Hierarchical organization
writer = SummaryWriter(f'runs/{project}/{experiment}/{run_name}')
Remote Access to TensorBoard (General HPC)
If you're not using Sherlock OnDemand or need alternative access methods:
SSH Tunneling (Basic Method)
# On HPC cluster
tensorboard --logdir=runs --port=6006 --bind_all
# On local machine
ssh -N -L 6006:compute-node:6006 username@hpc-cluster.edu
Then access http://localhost:6006
Security Warning: This method lacks authentication. Only use on trusted networks.
Using ngrok (For Temporary Sharing)
# On compute node
tensorboard --logdir=runs --port=6006 &
ngrok http 6006
Security Warning: This exposes your TensorBoard publicly. Only use for temporary demos.
Best Practice for Secure HPC Access
- Use institutional solutions like Sherlock OnDemand when available
- Never expose TensorBoard directly to the internet without authentication
- Use VPN + SSH tunneling if institutional solutions aren't available
- Consider exporting logs and viewing locally for maximum security:
# On HPC cluster
rsync -avz username@hpc:/path/to/runs ./local_runs
# On local machine
tensorboard --logdir=./local_runs
TensorBoard.dev (Public Sharing)
For sharing results publicly (e.g., with papers):
# Upload logs to TensorBoard.dev
tensorboard dev upload --logdir runs/experiment_1 \
--name "My Experiment" \
--description "ResNet-50 on CIFAR-10"
# Returns a public URL like https://tensorboard.dev/experiment/xxx
Note: Data is public and permanent. Only use for published research.
Comparing Multiple Experiments
In TensorBoard UI
-
Launch TensorBoard with multiple runs:
tensorboard --logdir=runs -
Use the run selector (left sidebar) to toggle runs on/off
-
Use regex filtering to select runs matching patterns:
.*lr_0\.01.* # All runs with lr=0.01 -
Compare metrics in the SCALARS tab
Programmatic Comparison
from tensorboard.backend.event_processing import event_accumulator
import pandas as pd
def load_tensorboard_data(log_dir):
ea = event_accumulator.EventAccumulator(log_dir)
ea.Reload()
# Extract scalar data
data = {}
for tag in ea.Tags()['scalars']:
events = ea.Scalars(tag)
data[tag] = [(e.step, e.value) for e in events]
return data
# Load and compare
run1_data = load_tensorboard_data('runs/experiment_1')
run2_data = load_tensorboard_data('runs/experiment_2')
# Create comparison DataFrame
df = pd.DataFrame({
'run1_loss': [v for _, v in run1_data['Loss/train']],
'run2_loss': [v for _, v in run2_data['Loss/train']]
})
print(df.describe())
Best Practices for TensorBoard
-
Log at Consistent Intervals
if global_step % log_interval == 0: writer.add_scalar('Loss/train', loss, global_step) -
Use Hierarchical Tag Names
writer.add_scalar('Loss/train', train_loss, step) writer.add_scalar('Loss/validation', val_loss, step) writer.add_scalar('Metrics/accuracy', accuracy, step) writer.add_scalar('Metrics/f1_score', f1, step) -
Log Model Graph Early
# Log graph once at the beginning dummy_input = torch.randn(1, 3, 224, 224) writer.add_graph(model, dummy_input) -
Monitor Gradients and Weights
for name, param in model.named_parameters(): writer.add_histogram(f'Weights/{name}', param, step) if param.grad is not None: writer.add_histogram(f'Gradients/{name}', param.grad, step) -
Clean Up Old Runs
# Delete runs older than 30 days find runs/ -type f -mtime +30 -delete -
Use Context Managers
with SummaryWriter('runs/experiment_1') as writer: # Training code writer.add_scalar('Loss', loss, step) # Writer automatically closed -
Flush Regularly for Real-time Updates
writer.add_scalar('Loss/train', loss, step) writer.flush() # Ensure data is written immediately
Common Issues and Solutions
Issue: TensorBoard not updating in real-time
- Solution: Call writer.flush() after logging, or reduce flush_secs in SummaryWriter
Issue: "No dashboards are active" message - Solution: Ensure log directory contains valid event files, check file permissions
Issue: Port already in use
- Solution: Use a different port: tensorboard --logdir=runs --port=6007
Issue: TensorBoard consuming too much disk space
- Solution: Reduce logging frequency, delete old runs, or use --purge_orphaned_data flag
Issue: Cannot access TensorBoard on HPC - Solution: Use institutional solution (e.g., Sherlock OnDemand) or SSH tunneling
Issue: Graph not displaying
- Solution: Ensure model and input are on the same device, try model.eval() before logging graph
Hybrid Approach: Combining Tools
Many teams use multiple tools together for optimal workflow:
TensorBoard + WandB/Neptune
Use TensorBoard for immediate feedback during development, and WandB/Neptune for long-term tracking:
import wandb
from torch.utils.tensorboard import SummaryWriter
# Initialize both
writer = SummaryWriter('runs/experiment_1')
run = wandb.init(project="my-project", name="experiment-1")
# Log to both
for epoch in range(epochs):
train_loss = train_epoch()
val_acc = validate()
# TensorBoard (for immediate feedback)
writer.add_scalar('Loss/train', train_loss, epoch)
writer.add_scalar('Accuracy/val', val_acc, epoch)
# WandB (for team sharing and long-term storage)
wandb.log({
"train_loss": train_loss,
"val_accuracy": val_acc,
"epoch": epoch
})
# Sync TensorBoard logs to WandB
wandb.save("runs/experiment_1/*")
writer.close()
wandb.finish()
Local Development + Cloud Backup
import os
from torch.utils.tensorboard import SummaryWriter
# Always use TensorBoard locally
writer = SummaryWriter('runs/experiment_1')
# Log to cloud if available (e.g., not on HPC)
if os.getenv('WANDB_API_KEY') and not os.getenv('SLURM_JOB_ID'):
import wandb
run = wandb.init(project="my-project")
use_wandb = True
else:
use_wandb = False
# Training loop
for step in range(num_steps):
loss = train_step()
writer.add_scalar('Loss', loss, step)
if use_wandb:
wandb.log({"loss": loss}, step=step)
Model Development Workflow
- Rapid iteration (HPC): Use TensorBoard for quick experiments
- Promising results: Log to WandB/Neptune for team visibility
- Model selection: Compare across all platforms
- Production deployment: Use Neptune Model Registry or WandB Artifacts
Conclusion
Each platform has its strengths:
- WandB: Best for teams needing real-time collaboration, comprehensive integrations, and powerful sweeps
- Neptune: Best for structured metadata organization, dataset versioning, and cleaner UI
- TensorBoard: Best for HPC environments, privacy-sensitive work, and zero-cost tracking
Recommendation for most users: - Start with TensorBoard if working on Sherlock or similar HPC clusters - Set up a personal WandB or Neptune account for long-term experiment tracking and collaboration - Use institutional solutions like Sherlock OnDemand for secure TensorBoard access - Consider a hybrid approach: TensorBoard for immediate feedback + WandB/Neptune for team sharing
The most important thing is to start tracking your experiments systematically. Even basic tracking is infinitely better than no tracking when you need to reproduce results or understand what worked six months later.