Multi-GPU training
13 minute read
DAIC-specific configuration required
DAIC GPU nodes have GPUs on different NUMA nodes (CPU sockets). You must setNCCL_P2P_DISABLE=1 in your job scripts for multi-GPU training to work. See NCCL Configuration below.What you’ll learn
By the end of this tutorial, you’ll be able to:
- Understand when and why to use multiple GPUs
- Train models across GPUs with PyTorch Lightning
- Use native PyTorch Distributed Data Parallel (DDP)
- Scale training with Hugging Face Accelerate
- Configure Slurm jobs for multi-GPU and multi-node training
- Debug common distributed training issues
Time: About 60 minutes
Prerequisites: Complete Slurm Basics and Python Environments first. Familiarity with PyTorch is assumed.
Runnable examples
This tutorial includes complete, runnable example scripts in the examples/ directory. Copy them to your project storage and test on DAIC:
examples/lightning/- PyTorch Lightning exampleexamples/ddp/- Native PyTorch DDP exampleexamples/accelerate/- Hugging Face Accelerate example
When to use multiple GPUs
Training on multiple GPUs makes sense when:
- Training is slow: A single GPU takes hours or days per epoch
- Model fits in memory: The model fits on one GPU, but you want faster training
- Large batch sizes: You need larger effective batch sizes for better convergence
Multiple GPUs do not help when:
- Your model doesn’t fit on a single GPU (you need model parallelism instead)
- Data loading is the bottleneck
- Training is already fast (communication overhead may slow things down)
- The dataset is small (like MNIST) - GPU communication overhead exceeds computation time
About the examples
The examples use CIFAR-10 with ResNet18, which is large enough to demonstrate multi-GPU speedup (~1.3x with 2 GPUs). For production workloads with larger models and datasets, expect near-linear scaling.Scaling strategies
| Strategy | What it does | When to use |
|---|---|---|
| Data Parallel | Same model on each GPU, different data batches | Most common, covered here |
| Model Parallel | Model split across GPUs | Very large models (LLMs) |
| Pipeline Parallel | Model layers on different GPUs | Very deep networks |
This tutorial focuses on data parallelism - the most common and easiest approach.
How data parallelism works
- The model is replicated on each GPU
- Each GPU processes a different batch of data
- Gradients are synchronized across GPUs
- Weights are updated identically on all GPUs
With 2 GPUs and batch size 32 per GPU, you effectively train with batch size 64.
Part 1: PyTorch Lightning
PyTorch Lightning is the easiest way to scale training. It handles distributed training automatically - you write single-GPU code, Lightning handles the rest.
Setup
Create a project with Lightning:
$ cd /tudelft.net/staff-umbrella/<project>
$ uv init lightning-multi-gpu
$ cd lightning-multi-gpu
$ uv add torch torchvision lightning --index https://download.pytorch.org/whl/cu124
Single-GPU baseline
First, write a standard Lightning module:
# src/train.py
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms
class ImageClassifier(L.LightningModule):
def __init__(self, learning_rate=1e-3):
super().__init__()
self.save_hyperparameters()
self.model = nn.Sequential(
nn.Flatten(),
nn.Linear(28 * 28, 256),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(256, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 10)
)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
acc = (logits.argmax(dim=1) == y).float().mean()
self.log('train_loss', loss, prog_bar=True)
self.log('train_acc', acc, prog_bar=True)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
logits = self(x)
loss = F.cross_entropy(logits, y)
acc = (logits.argmax(dim=1) == y).float().mean()
self.log('val_loss', loss, prog_bar=True)
self.log('val_acc', acc, prog_bar=True)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)
class MNISTDataModule(L.LightningDataModule):
def __init__(self, data_dir='./data', batch_size=64, num_workers=4):
super().__init__()
self.data_dir = data_dir
self.batch_size = batch_size
self.num_workers = num_workers
self.transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
def prepare_data(self):
# Download (runs on rank 0 only)
datasets.MNIST(self.data_dir, train=True, download=True)
datasets.MNIST(self.data_dir, train=False, download=True)
def setup(self, stage=None):
if stage == 'fit' or stage is None:
mnist_full = datasets.MNIST(
self.data_dir, train=True, transform=self.transform
)
self.mnist_train, self.mnist_val = random_split(
mnist_full, [55000, 5000]
)
if stage == 'test' or stage is None:
self.mnist_test = datasets.MNIST(
self.data_dir, train=False, transform=self.transform
)
def train_dataloader(self):
return DataLoader(
self.mnist_train,
batch_size=self.batch_size,
shuffle=True,
num_workers=self.num_workers,
persistent_workers=True
)
def val_dataloader(self):
return DataLoader(
self.mnist_val,
batch_size=self.batch_size,
num_workers=self.num_workers,
persistent_workers=True
)
def main():
# Data
datamodule = MNISTDataModule(
data_dir='/tudelft.net/staff-umbrella/<project>/data',
batch_size=64,
num_workers=4
)
# Model
model = ImageClassifier(learning_rate=1e-3)
# Trainer - single GPU
trainer = L.Trainer(
max_epochs=10,
accelerator='gpu',
devices=1,
precision='16-mixed',
enable_progress_bar=True,
)
trainer.fit(model, datamodule)
if __name__ == '__main__':
main()
Scaling to multiple GPUs
The only change needed is in the Trainer configuration:
# Multi-GPU: use all available GPUs on one node
trainer = L.Trainer(
max_epochs=10,
accelerator='gpu',
devices=2, # Use 2 GPUs
strategy='ddp', # Distributed Data Parallel
precision='16-mixed',
)
That’s it. Lightning handles:
- Spawning processes for each GPU
- Distributing data across GPUs
- Synchronizing gradients
- Logging from rank 0 only
Slurm job script for multi-GPU
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out
module purge
module load 2025/gpu cuda/12.9
cd /tudelft.net/staff-umbrella/<project>/lightning-multi-gpu
# Set number of workers based on CPUs
export NUM_WORKERS=$((SLURM_CPUS_PER_TASK / 4))
srun uv run python src/train.py
Key points:
--gres=gpu:2: Request 2 GPUs--cpus-per-task=8: Enough CPUs for data loading (4 per GPU)--ntasks-per-node=1: Lightning spawns its own processes
Multi-node training
Scale beyond one machine with minimal changes:
trainer = L.Trainer(
max_epochs=10,
accelerator='gpu',
devices=2, # GPUs per node
num_nodes=2, # Number of nodes
strategy='ddp',
precision='16-mixed',
)
Slurm script for multi-node:
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=4:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out
module purge
module load 2025/gpu cuda/12.9
cd /tudelft.net/staff-umbrella/<project>/lightning-multi-gpu
# Get master address from first node
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500
srun uv run python src/train.py
Exercise 1: Scale with Lightning
- Create the Lightning project above
- Train on 1 GPU and note the time per epoch
- Change to 2 GPUs and compare
- Verify both runs achieve similar accuracy
Check your work
Both configurations should achieve ~97% validation accuracy. Note that for MNIST, you may not see a speedup - the dataset is too small and communication overhead dominates. With larger datasets and models, you would see near-linear scaling.Part 2: PyTorch DDP (native)
If you need more control or can’t use Lightning, PyTorch’s DistributedDataParallel (DDP) is the native approach.
Key concepts
- World size: Total number of processes (GPUs)
- Rank: Unique ID for each process (0 to world_size-1)
- Local rank: GPU index on the current node (0 to GPUs_per_node-1)
DDP training script
# src/train_ddp.py
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms
def setup():
"""Initialize distributed training."""
dist.init_process_group(backend='nccl')
torch.cuda.set_device(int(os.environ['LOCAL_RANK']))
def cleanup():
"""Clean up distributed training."""
dist.destroy_process_group()
def get_rank():
return dist.get_rank()
def is_main_process():
return get_rank() == 0
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28 * 28, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 10)
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = self.flatten(x)
x = F.relu(self.fc1(x))
x = self.dropout(x)
x = F.relu(self.fc2(x))
x = self.dropout(x)
return self.fc3(x)
def train_epoch(model, loader, optimizer, device):
model.train()
total_loss = 0
correct = 0
total = 0
for batch_idx, (data, target) in enumerate(loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
loss.backward()
optimizer.step()
total_loss += loss.item()
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
total += target.size(0)
return total_loss / len(loader), correct / total
def validate(model, loader, device):
model.eval()
total_loss = 0
correct = 0
total = 0
with torch.no_grad():
for data, target in loader:
data, target = data.to(device), target.to(device)
output = model(data)
total_loss += F.cross_entropy(output, target).item()
pred = output.argmax(dim=1)
correct += pred.eq(target).sum().item()
total += target.size(0)
return total_loss / len(loader), correct / total
def main():
# Initialize distributed
setup()
local_rank = int(os.environ['LOCAL_RANK'])
device = torch.device(f'cuda:{local_rank}')
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST(
'/tudelft.net/staff-umbrella/<project>/data',
train=True, download=False, transform=transform
)
val_dataset = datasets.MNIST(
'/tudelft.net/staff-umbrella/<project>/data',
train=False, download=False, transform=transform
)
# Distributed sampler ensures each GPU gets different data
train_sampler = DistributedSampler(train_dataset, shuffle=True)
val_sampler = DistributedSampler(val_dataset, shuffle=False)
train_loader = DataLoader(
train_dataset,
batch_size=64,
sampler=train_sampler,
num_workers=4,
pin_memory=True
)
val_loader = DataLoader(
val_dataset,
batch_size=64,
sampler=val_sampler,
num_workers=4,
pin_memory=True
)
# Model - wrap in DDP
model = SimpleNet().to(device)
model = DDP(model, device_ids=[local_rank])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# Training loop
for epoch in range(10):
# Important: set epoch for proper shuffling
train_sampler.set_epoch(epoch)
train_loss, train_acc = train_epoch(model, train_loader, optimizer, device)
val_loss, val_acc = validate(model, val_loader, device)
# Only print from main process
if is_main_process():
print(f'Epoch {epoch+1}: '
f'train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, '
f'val_loss={val_loss:.4f}, val_acc={val_acc:.4f}')
# Save model (only from main process)
if is_main_process():
torch.save(model.module.state_dict(), 'model.pt')
print('Model saved to model.pt')
cleanup()
if __name__ == '__main__':
main()
Key differences from single-GPU
- Initialize process group:
dist.init_process_group() - Wrap model in DDP:
model = DDP(model, device_ids=[local_rank]) - Use DistributedSampler: Ensures each GPU gets different data
- Set sampler epoch:
train_sampler.set_epoch(epoch)for proper shuffling - Save from rank 0 only: Avoid file conflicts
- Access original model: Use
model.modulewhen saving
Slurm script for DDP
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out
module purge
module load 2025/gpu cuda/12.9
cd /tudelft.net/staff-umbrella/<project>/ddp-example
export MASTER_ADDR=$(hostname)
export MASTER_PORT=29500
srun uv run torchrun \
--nnodes=1 \
--nproc_per_node=4 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
src/train_ddp.py
Note: --ntasks-per-node=4 launches 4 processes, one per GPU.
Exercise 2: Native DDP
- Create the DDP training script
- Run with 2 GPUs using torchrun
- Verify the DistributedSampler splits data correctly
Check your work
Each GPU should process half the data:
# With 60000 training samples and 2 GPUs:
# Each GPU sees 30000 samples per epoch
GPU 0: Processing batches 0-468
GPU 1: Processing batches 0-468
Part 3: Hugging Face Accelerate
Accelerate provides a middle ground - more control than Lightning, less boilerplate than raw DDP.
Setup
$ uv add accelerate transformers datasets
Accelerate training script
# src/train_accelerate.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from accelerate import Accelerator
class SimpleNet(nn.Module):
def __init__(self):
super().__init__()
self.flatten = nn.Flatten()
self.fc1 = nn.Linear(28 * 28, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 10)
def forward(self, x):
x = self.flatten(x)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
def main():
# Initialize accelerator
accelerator = Accelerator(mixed_precision='fp16')
# Data
transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])
train_dataset = datasets.MNIST(
'/tudelft.net/staff-umbrella/<project>/data',
train=True, download=False, transform=transform
)
train_loader = DataLoader(
train_dataset,
batch_size=64,
shuffle=True,
num_workers=4
)
# Model and optimizer
model = SimpleNet()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
# Prepare for distributed training
model, optimizer, train_loader = accelerator.prepare(
model, optimizer, train_loader
)
# Training loop
for epoch in range(10):
model.train()
total_loss = 0
for batch_idx, (data, target) in enumerate(train_loader):
optimizer.zero_grad()
output = model(data)
loss = F.cross_entropy(output, target)
accelerator.backward(loss)
optimizer.step()
total_loss += loss.item()
# Print from main process only
if accelerator.is_main_process:
print(f'Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}')
# Save model
accelerator.wait_for_everyone()
if accelerator.is_main_process:
unwrapped_model = accelerator.unwrap_model(model)
torch.save(unwrapped_model.state_dict(), 'model.pt')
if __name__ == '__main__':
main()
Key features
- Minimal code changes: Just wrap with
accelerator.prepare() - Automatic device placement: No manual
.to(device) - Mixed precision: Built-in with
mixed_precision='fp16' - Gradient accumulation: Easy with
accumulate()context
Configuration file
Generate a config with:
$ uv run accelerate config
Or create accelerate_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: fp16
Slurm script for Accelerate
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out
module purge
module load 2025/gpu cuda/12.9
cd /tudelft.net/staff-umbrella/<project>/accelerate-example
srun uv run accelerate launch \
--num_processes=4 \
--mixed_precision=fp16 \
src/train_accelerate.py
Part 4: Best practices
Data loading
Data loading often becomes the bottleneck with multiple GPUs.
Tips:
- Use
num_workersproportional to CPUs: typically 4 workers per GPU - Enable
pin_memory=Truefor faster GPU transfer - Use
persistent_workers=Trueto avoid worker restart overhead - Store data on fast storage (SSD/NVMe when available)
DataLoader(
dataset,
batch_size=64,
num_workers=4, # Per GPU
pin_memory=True, # Faster transfer to GPU
persistent_workers=True, # Keep workers alive
prefetch_factor=2, # Batches to prefetch per worker
)
Batch size scaling
When using N GPUs, you have options:
Keep per-GPU batch size: Effective batch = N * per_GPU_batch
- Faster training, may need learning rate adjustment
Keep total batch size: per_GPU_batch = total / N
- Same training dynamics, just faster
Learning rate scaling rule: When increasing batch size by factor K, increase learning rate by factor K (or sqrt(K) for more conservative scaling).
# Example: scaling from 1 to 2 GPUs
base_lr = 1e-3
base_batch = 64
num_gpus = 2
# Linear scaling
scaled_lr = base_lr * num_gpus # 4e-3
Gradient accumulation
Simulate larger batches without more memory:
# Lightning
trainer = L.Trainer(
accumulate_grad_batches=4, # Effective batch = 4 * batch_size * num_gpus
)
# Accelerate
accelerator = Accelerator(gradient_accumulation_steps=4)
Checkpointing
Save checkpoints that work across different GPU configurations:
# Lightning - automatic
trainer = L.Trainer(
callbacks=[
L.callbacks.ModelCheckpoint(
dirpath='checkpoints',
filename='epoch_{epoch:02d}',
save_top_k=3,
monitor='val_loss'
)
]
)
# DDP - save unwrapped model
if is_main_process():
torch.save(model.module.state_dict(), 'model.pt')
Exercise 3: Optimize data loading
- Train with
num_workers=0and measure throughput - Increase to
num_workers=4and compare - Add
pin_memory=Trueandpersistent_workers=True - Measure the improvement
Check your work
You should see significant speedup:
num_workers=0: ~100 samples/sec
num_workers=4: ~400 samples/sec
+ pin_memory + persistent_workers: ~500 samples/sec
NCCL configuration on DAIC
DAIC GPU nodes have GPUs distributed across multiple NUMA nodes (CPU sockets). The GPUs communicate via the QPI/UPI interconnect rather than NVLink, which requires specific NCCL configuration.
Required settings
Add these environment variables to your job scripts:
# Required: Disable P2P (peer-to-peer) communication
# P2P doesn't work between GPUs on different NUMA nodes
export NCCL_P2P_DISABLE=1
Why this is needed
Check GPU topology on a compute node:
$ nvidia-smi topo -m
GPU0 GPU1 CPU Affinity NUMA Affinity
GPU0 X SYS 16-17 2
GPU1 SYS X 32-33 4
The SYS connection means GPUs communicate through the CPU interconnect (QPI/UPI), not direct P2P. Without NCCL_P2P_DISABLE=1, NCCL attempts P2P transfers that hang.
Performance expectations
With NCCL_P2P_DISABLE=1 on DAIC:
| Configuration | ResNet18 on CIFAR-10 | Speedup |
|---|---|---|
| 1 GPU | 7.8s/epoch | baseline |
| 2 GPUs | 6.1s/epoch | 1.28x |
The speedup is less than 2x because communication goes through CPU memory. Larger models and datasets see better scaling.
Part 5: Troubleshooting
Training hangs with multiple GPUs
Training hangs after “Initializing distributed” or “All distributed processes registered”.
Cause: NCCL P2P communication fails between GPUs on different NUMA nodes.
Solution:
export NCCL_P2P_DISABLE=1
NCCL errors
NCCL error: unhandled system error
Causes:
- Network issues between nodes
- Mismatched CUDA/NCCL versions
- Firewall blocking ports
Solutions:
# Use shared memory for single-node
export NCCL_SHM_DISABLE=0
# Debug logging
export NCCL_DEBUG=INFO
# Specify network interface
export NCCL_SOCKET_IFNAME=eth0
Hanging at initialization
Training hangs at init_process_group().
Causes:
- Wrong MASTER_ADDR or MASTER_PORT
- Firewall blocking communication
- Mismatched world size
Solutions:
# Verify connectivity
$ srun --nodes=2 hostname
# Check MASTER_ADDR is reachable
$ ping $MASTER_ADDR
Out of memory with DDP
DDP uses more memory than single GPU due to gradient buffers.
Solutions:
- Reduce batch size
- Use gradient checkpointing
- Enable mixed precision (fp16)
# Gradient checkpointing in Lightning
model = ImageClassifier()
model.gradient_checkpointing_enable()
Uneven GPU utilization
One GPU doing more work than others.
Causes:
- Uneven batch sizes (last batch smaller)
- Data loading bottleneck on rank 0
Solutions:
# Drop incomplete batches
DataLoader(..., drop_last=True)
# Each rank loads its own data
# (default with DistributedSampler)
Exercise 4: Debug a distributed job
- Submit a 2-GPU job with intentionally wrong MASTER_PORT
- Observe the error message
- Fix the port and verify training starts
Check your work
With wrong port, you’ll see:
RuntimeError: connect() timed out
After fixing, training should start within seconds.
Summary
You’ve learned to scale training across multiple GPUs:
| Framework | Complexity | Best for |
|---|---|---|
| Lightning | Low | Most users, fast prototyping |
| Accelerate | Medium | HF ecosystem, moderate control |
| DDP | High | Full control, custom training |
Key takeaways
- Start with Lightning - handles distributed training automatically
- Request resources correctly - GPUs, CPUs for data loading, memory
- Scale batch size or learning rate - adjust for multi-GPU
- Optimize data loading - often the real bottleneck
- Save from rank 0 only - avoid checkpoint conflicts
Quick reference
# Lightning multi-GPU
trainer = L.Trainer(accelerator='gpu', devices=2, strategy='ddp')
# DDP launch
torchrun --nproc_per_node=4 train.py
# Accelerate launch
accelerate launch --num_processes=4 train.py
Slurm essentials
#SBATCH --gres=gpu:2 # Number of GPUs
#SBATCH --cpus-per-task=8 # CPUs for data loading
#SBATCH --ntasks-per-node=1 # For Lightning
#SBATCH --ntasks-per-node=4 # For DDP/torchrun
Next steps
- Apptainer Tutorial - Containerize your training environment
- PyTorch Lightning Docs - Advanced features
- Accelerate Docs - Hugging Face integration