Multi-GPU training

Scale deep learning across multiple GPUs on DAIC.

What you’ll learn

By the end of this tutorial, you’ll be able to:

  • Understand when and why to use multiple GPUs
  • Train models across GPUs with PyTorch Lightning
  • Use native PyTorch Distributed Data Parallel (DDP)
  • Scale training with Hugging Face Accelerate
  • Configure Slurm jobs for multi-GPU and multi-node training
  • Debug common distributed training issues

Time: About 60 minutes

Prerequisites: Complete Slurm Basics and Python Environments first. Familiarity with PyTorch is assumed.


When to use multiple GPUs

Training on multiple GPUs makes sense when:

  • Training is slow: A single GPU takes hours or days per epoch
  • Model fits in memory: The model fits on one GPU, but you want faster training
  • Large batch sizes: You need larger effective batch sizes for better convergence

Multiple GPUs do not help when:

  • Your model doesn’t fit on a single GPU (you need model parallelism instead)
  • Data loading is the bottleneck
  • Training is already fast (communication overhead may slow things down)
  • The dataset is small (like MNIST) - GPU communication overhead exceeds computation time

Scaling strategies

StrategyWhat it doesWhen to use
Data ParallelSame model on each GPU, different data batchesMost common, covered here
Model ParallelModel split across GPUsVery large models (LLMs)
Pipeline ParallelModel layers on different GPUsVery deep networks

This tutorial focuses on data parallelism - the most common and easiest approach.

How data parallelism works

  1. The model is replicated on each GPU
  2. Each GPU processes a different batch of data
  3. Gradients are synchronized across GPUs
  4. Weights are updated identically on all GPUs

With 2 GPUs and batch size 32 per GPU, you effectively train with batch size 64.


Part 1: PyTorch Lightning

PyTorch Lightning is the easiest way to scale training. It handles distributed training automatically - you write single-GPU code, Lightning handles the rest.

Setup

Create a project with Lightning:

$ cd /tudelft.net/staff-umbrella/<project>
$ uv init lightning-multi-gpu
$ cd lightning-multi-gpu
$ uv add torch torchvision lightning --index https://download.pytorch.org/whl/cu124

Single-GPU baseline

First, write a standard Lightning module:

# src/train.py
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

class ImageClassifier(L.LightningModule):
    def __init__(self, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()
        self.log('train_loss', loss, prog_bar=True)
        self.log('train_acc', acc, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)


class MNISTDataModule(L.LightningDataModule):
    def __init__(self, data_dir='./data', batch_size=64, num_workers=4):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])

    def prepare_data(self):
        # Download (runs on rank 0 only)
        datasets.MNIST(self.data_dir, train=True, download=True)
        datasets.MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            mnist_full = datasets.MNIST(
                self.data_dir, train=True, transform=self.transform
            )
            self.mnist_train, self.mnist_val = random_split(
                mnist_full, [55000, 5000]
            )
        if stage == 'test' or stage is None:
            self.mnist_test = datasets.MNIST(
                self.data_dir, train=False, transform=self.transform
            )

    def train_dataloader(self):
        return DataLoader(
            self.mnist_train,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            persistent_workers=True
        )

    def val_dataloader(self):
        return DataLoader(
            self.mnist_val,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            persistent_workers=True
        )


def main():
    # Data
    datamodule = MNISTDataModule(
        data_dir='/tudelft.net/staff-umbrella/<project>/data',
        batch_size=64,
        num_workers=4
    )

    # Model
    model = ImageClassifier(learning_rate=1e-3)

    # Trainer - single GPU
    trainer = L.Trainer(
        max_epochs=10,
        accelerator='gpu',
        devices=1,
        precision='16-mixed',
        enable_progress_bar=True,
    )

    trainer.fit(model, datamodule)


if __name__ == '__main__':
    main()

Scaling to multiple GPUs

The only change needed is in the Trainer configuration:

# Multi-GPU: use all available GPUs on one node
trainer = L.Trainer(
    max_epochs=10,
    accelerator='gpu',
    devices=2,              # Use 2 GPUs
    strategy='ddp',         # Distributed Data Parallel
    precision='16-mixed',
)

That’s it. Lightning handles:

  • Spawning processes for each GPU
  • Distributing data across GPUs
  • Synchronizing gradients
  • Logging from rank 0 only

Slurm job script for multi-GPU

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/lightning-multi-gpu

# Set number of workers based on CPUs
export NUM_WORKERS=$((SLURM_CPUS_PER_TASK / 4))

srun uv run python src/train.py

Key points:

  • --gres=gpu:2: Request 2 GPUs
  • --cpus-per-task=8: Enough CPUs for data loading (4 per GPU)
  • --ntasks-per-node=1: Lightning spawns its own processes

Multi-node training

Scale beyond one machine with minimal changes:

trainer = L.Trainer(
    max_epochs=10,
    accelerator='gpu',
    devices=2,              # GPUs per node
    num_nodes=2,            # Number of nodes
    strategy='ddp',
    precision='16-mixed',
)

Slurm script for multi-node:

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=4:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/lightning-multi-gpu

# Get master address from first node
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

srun uv run python src/train.py

Exercise 1: Scale with Lightning

  1. Create the Lightning project above
  2. Train on 1 GPU and note the time per epoch
  3. Change to 2 GPUs and compare
  4. Verify both runs achieve similar accuracy

Part 2: PyTorch DDP (native)

If you need more control or can’t use Lightning, PyTorch’s DistributedDataParallel (DDP) is the native approach.

Key concepts

  • World size: Total number of processes (GPUs)
  • Rank: Unique ID for each process (0 to world_size-1)
  • Local rank: GPU index on the current node (0 to GPUs_per_node-1)

DDP training script

# src/train_ddp.py
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms


def setup():
    """Initialize distributed training."""
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(int(os.environ['LOCAL_RANK']))


def cleanup():
    """Clean up distributed training."""
    dist.destroy_process_group()


def get_rank():
    return dist.get_rank()


def is_main_process():
    return get_rank() == 0


class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        return self.fc3(x)


def train_epoch(model, loader, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
        total += target.size(0)

    return total_loss / len(loader), correct / total


def validate(model, loader, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            total_loss += F.cross_entropy(output, target).item()
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
            total += target.size(0)

    return total_loss / len(loader), correct / total


def main():
    # Initialize distributed
    setup()

    local_rank = int(os.environ['LOCAL_RANK'])
    device = torch.device(f'cuda:{local_rank}')

    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])

    train_dataset = datasets.MNIST(
        '/tudelft.net/staff-umbrella/<project>/data',
        train=True, download=False, transform=transform
    )
    val_dataset = datasets.MNIST(
        '/tudelft.net/staff-umbrella/<project>/data',
        train=False, download=False, transform=transform
    )

    # Distributed sampler ensures each GPU gets different data
    train_sampler = DistributedSampler(train_dataset, shuffle=True)
    val_sampler = DistributedSampler(val_dataset, shuffle=False)

    train_loader = DataLoader(
        train_dataset,
        batch_size=64,
        sampler=train_sampler,
        num_workers=4,
        pin_memory=True
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=64,
        sampler=val_sampler,
        num_workers=4,
        pin_memory=True
    )

    # Model - wrap in DDP
    model = SimpleNet().to(device)
    model = DDP(model, device_ids=[local_rank])

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    # Training loop
    for epoch in range(10):
        # Important: set epoch for proper shuffling
        train_sampler.set_epoch(epoch)

        train_loss, train_acc = train_epoch(model, train_loader, optimizer, device)
        val_loss, val_acc = validate(model, val_loader, device)

        # Only print from main process
        if is_main_process():
            print(f'Epoch {epoch+1}: '
                  f'train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, '
                  f'val_loss={val_loss:.4f}, val_acc={val_acc:.4f}')

    # Save model (only from main process)
    if is_main_process():
        torch.save(model.module.state_dict(), 'model.pt')
        print('Model saved to model.pt')

    cleanup()


if __name__ == '__main__':
    main()

Key differences from single-GPU

  1. Initialize process group: dist.init_process_group()
  2. Wrap model in DDP: model = DDP(model, device_ids=[local_rank])
  3. Use DistributedSampler: Ensures each GPU gets different data
  4. Set sampler epoch: train_sampler.set_epoch(epoch) for proper shuffling
  5. Save from rank 0 only: Avoid file conflicts
  6. Access original model: Use model.module when saving

Slurm script for DDP

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/ddp-example

export MASTER_ADDR=$(hostname)
export MASTER_PORT=29500

srun uv run torchrun \
    --nnodes=1 \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    src/train_ddp.py

Note: --ntasks-per-node=4 launches 4 processes, one per GPU.

Exercise 2: Native DDP

  1. Create the DDP training script
  2. Run with 2 GPUs using torchrun
  3. Verify the DistributedSampler splits data correctly

Part 3: Hugging Face Accelerate

Accelerate provides a middle ground - more control than Lightning, less boilerplate than raw DDP.

Setup

$ uv add accelerate transformers datasets

Accelerate training script

# src/train_accelerate.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from accelerate import Accelerator


class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main():
    # Initialize accelerator
    accelerator = Accelerator(mixed_precision='fp16')

    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])

    train_dataset = datasets.MNIST(
        '/tudelft.net/staff-umbrella/<project>/data',
        train=True, download=False, transform=transform
    )

    train_loader = DataLoader(
        train_dataset,
        batch_size=64,
        shuffle=True,
        num_workers=4
    )

    # Model and optimizer
    model = SimpleNet()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    # Prepare for distributed training
    model, optimizer, train_loader = accelerator.prepare(
        model, optimizer, train_loader
    )

    # Training loop
    for epoch in range(10):
        model.train()
        total_loss = 0

        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = F.cross_entropy(output, target)
            accelerator.backward(loss)
            optimizer.step()
            total_loss += loss.item()

        # Print from main process only
        if accelerator.is_main_process:
            print(f'Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}')

    # Save model
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        unwrapped_model = accelerator.unwrap_model(model)
        torch.save(unwrapped_model.state_dict(), 'model.pt')


if __name__ == '__main__':
    main()

Key features

  1. Minimal code changes: Just wrap with accelerator.prepare()
  2. Automatic device placement: No manual .to(device)
  3. Mixed precision: Built-in with mixed_precision='fp16'
  4. Gradient accumulation: Easy with accumulate() context

Configuration file

Generate a config with:

$ uv run accelerate config

Or create accelerate_config.yaml:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: fp16

Slurm script for Accelerate

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/accelerate-example

srun uv run accelerate launch \
    --num_processes=4 \
    --mixed_precision=fp16 \
    src/train_accelerate.py

Part 4: Best practices

Data loading

Data loading often becomes the bottleneck with multiple GPUs.

Tips:

  • Use num_workers proportional to CPUs: typically 4 workers per GPU
  • Enable pin_memory=True for faster GPU transfer
  • Use persistent_workers=True to avoid worker restart overhead
  • Store data on fast storage (SSD/NVMe when available)
DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # Per GPU
    pin_memory=True,         # Faster transfer to GPU
    persistent_workers=True, # Keep workers alive
    prefetch_factor=2,       # Batches to prefetch per worker
)

Batch size scaling

When using N GPUs, you have options:

  1. Keep per-GPU batch size: Effective batch = N * per_GPU_batch

    • Faster training, may need learning rate adjustment
  2. Keep total batch size: per_GPU_batch = total / N

    • Same training dynamics, just faster

Learning rate scaling rule: When increasing batch size by factor K, increase learning rate by factor K (or sqrt(K) for more conservative scaling).

# Example: scaling from 1 to 2 GPUs
base_lr = 1e-3
base_batch = 64
num_gpus = 2

# Linear scaling
scaled_lr = base_lr * num_gpus  # 4e-3

Gradient accumulation

Simulate larger batches without more memory:

# Lightning
trainer = L.Trainer(
    accumulate_grad_batches=4,  # Effective batch = 4 * batch_size * num_gpus
)

# Accelerate
accelerator = Accelerator(gradient_accumulation_steps=4)

Checkpointing

Save checkpoints that work across different GPU configurations:

# Lightning - automatic
trainer = L.Trainer(
    callbacks=[
        L.callbacks.ModelCheckpoint(
            dirpath='checkpoints',
            filename='epoch_{epoch:02d}',
            save_top_k=3,
            monitor='val_loss'
        )
    ]
)

# DDP - save unwrapped model
if is_main_process():
    torch.save(model.module.state_dict(), 'model.pt')

Exercise 3: Optimize data loading

  1. Train with num_workers=0 and measure throughput
  2. Increase to num_workers=4 and compare
  3. Add pin_memory=True and persistent_workers=True
  4. Measure the improvement

NCCL configuration on DAIC

DAIC GPU nodes have GPUs distributed across multiple NUMA nodes (CPU sockets). The GPUs communicate via the QPI/UPI interconnect rather than NVLink, which requires specific NCCL configuration.

Required settings

Add these environment variables to your job scripts:

# Required: Disable P2P (peer-to-peer) communication
# P2P doesn't work between GPUs on different NUMA nodes
export NCCL_P2P_DISABLE=1

Why this is needed

Check GPU topology on a compute node:

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      SYS     16-17           2
GPU1    SYS      X      32-33           4

The SYS connection means GPUs communicate through the CPU interconnect (QPI/UPI), not direct P2P. Without NCCL_P2P_DISABLE=1, NCCL attempts P2P transfers that hang.

Performance expectations

With NCCL_P2P_DISABLE=1 on DAIC:

ConfigurationResNet18 on CIFAR-10Speedup
1 GPU7.8s/epochbaseline
2 GPUs6.1s/epoch1.28x

The speedup is less than 2x because communication goes through CPU memory. Larger models and datasets see better scaling.


Part 5: Troubleshooting

Training hangs with multiple GPUs

Training hangs after “Initializing distributed” or “All distributed processes registered”.

Cause: NCCL P2P communication fails between GPUs on different NUMA nodes.

Solution:

export NCCL_P2P_DISABLE=1

NCCL errors

NCCL error: unhandled system error

Causes:

  • Network issues between nodes
  • Mismatched CUDA/NCCL versions
  • Firewall blocking ports

Solutions:

# Use shared memory for single-node
export NCCL_SHM_DISABLE=0

# Debug logging
export NCCL_DEBUG=INFO

# Specify network interface
export NCCL_SOCKET_IFNAME=eth0

Hanging at initialization

Training hangs at init_process_group().

Causes:

  • Wrong MASTER_ADDR or MASTER_PORT
  • Firewall blocking communication
  • Mismatched world size

Solutions:

# Verify connectivity
$ srun --nodes=2 hostname

# Check MASTER_ADDR is reachable
$ ping $MASTER_ADDR

Out of memory with DDP

DDP uses more memory than single GPU due to gradient buffers.

Solutions:

  • Reduce batch size
  • Use gradient checkpointing
  • Enable mixed precision (fp16)
# Gradient checkpointing in Lightning
model = ImageClassifier()
model.gradient_checkpointing_enable()

Uneven GPU utilization

One GPU doing more work than others.

Causes:

  • Uneven batch sizes (last batch smaller)
  • Data loading bottleneck on rank 0

Solutions:

# Drop incomplete batches
DataLoader(..., drop_last=True)

# Each rank loads its own data
# (default with DistributedSampler)

Exercise 4: Debug a distributed job

  1. Submit a 2-GPU job with intentionally wrong MASTER_PORT
  2. Observe the error message
  3. Fix the port and verify training starts

Summary

You’ve learned to scale training across multiple GPUs:

FrameworkComplexityBest for
LightningLowMost users, fast prototyping
AccelerateMediumHF ecosystem, moderate control
DDPHighFull control, custom training

Key takeaways

  1. Start with Lightning - handles distributed training automatically
  2. Request resources correctly - GPUs, CPUs for data loading, memory
  3. Scale batch size or learning rate - adjust for multi-GPU
  4. Optimize data loading - often the real bottleneck
  5. Save from rank 0 only - avoid checkpoint conflicts

Quick reference

# Lightning multi-GPU
trainer = L.Trainer(accelerator='gpu', devices=2, strategy='ddp')

# DDP launch
torchrun --nproc_per_node=4 train.py

# Accelerate launch
accelerate launch --num_processes=4 train.py

Slurm essentials

#SBATCH --gres=gpu:2          # Number of GPUs
#SBATCH --cpus-per-task=8    # CPUs for data loading
#SBATCH --ntasks-per-node=1   # For Lightning
#SBATCH --ntasks-per-node=4   # For DDP/torchrun

Next steps