DAIC Experimental — Documentation for the experimental environment. (Go to stable docs)

Multi-GPU training

Scale deep learning across multiple GPUs on DAIC.

13 minute read

DAIC-specific configuration required

DAIC GPU nodes have GPUs on different NUMA nodes (CPU sockets). You must set NCCL_P2P_DISABLE=1 in your job scripts for multi-GPU training to work. See NCCL Configuration below.

What you’ll learn

By the end of this tutorial, you’ll be able to:

Understand when and why to use multiple GPUs
Train models across GPUs with PyTorch Lightning
Use native PyTorch Distributed Data Parallel (DDP)
Scale training with Hugging Face Accelerate
Configure Slurm jobs for multi-GPU and multi-node training
Debug common distributed training issues

Time: About 60 minutes

Prerequisites: Complete Slurm Basics and Python Environments first. Familiarity with PyTorch is assumed.

Runnable examples

This tutorial includes complete, runnable example scripts in the examples/ directory. Copy them to your project storage and test on DAIC:

examples/lightning/ - PyTorch Lightning example
examples/ddp/ - Native PyTorch DDP example
examples/accelerate/ - Hugging Face Accelerate example

When to use multiple GPUs

Training on multiple GPUs makes sense when:

Training is slow: A single GPU takes hours or days per epoch
Model fits in memory: The model fits on one GPU, but you want faster training
Large batch sizes: You need larger effective batch sizes for better convergence

Multiple GPUs do not help when:

Your model doesn’t fit on a single GPU (you need model parallelism instead)
Data loading is the bottleneck
Training is already fast (communication overhead may slow things down)
The dataset is small (like MNIST) - GPU communication overhead exceeds computation time

About the examples

The examples use CIFAR-10 with ResNet18, which is large enough to demonstrate multi-GPU speedup (~1.3x with 2 GPUs). For production workloads with larger models and datasets, expect near-linear scaling.

Scaling strategies

Strategy	What it does	When to use
Data Parallel	Same model on each GPU, different data batches	Most common, covered here
Model Parallel	Model split across GPUs	Very large models (LLMs)
Pipeline Parallel	Model layers on different GPUs	Very deep networks

This tutorial focuses on data parallelism - the most common and easiest approach.

How data parallelism works

The model is replicated on each GPU
Each GPU processes a different batch of data
Gradients are synchronized across GPUs
Weights are updated identically on all GPUs

With 2 GPUs and batch size 32 per GPU, you effectively train with batch size 64.

Part 1: PyTorch Lightning

PyTorch Lightning is the easiest way to scale training. It handles distributed training automatically - you write single-GPU code, Lightning handles the rest.

Setup

Create a project with Lightning:

$ cd /tudelft.net/staff-umbrella/<project>
$ uv init lightning-multi-gpu
$ cd lightning-multi-gpu
$ uv add torch torchvision lightning --index https://download.pytorch.org/whl/cu124

Single-GPU baseline

First, write a standard Lightning module:

# src/train.py
import torch
import torch.nn as nn
import torch.nn.functional as F
import lightning as L
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

class ImageClassifier(L.LightningModule):
    def __init__(self, learning_rate=1e-3):
        super().__init__()
        self.save_hyperparameters()
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(28 * 28, 256),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(256, 128),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(128, 10)
        )

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()
        self.log('train_loss', loss, prog_bar=True)
        self.log('train_acc', acc, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.cross_entropy(logits, y)
        acc = (logits.argmax(dim=1) == y).float().mean()
        self.log('val_loss', loss, prog_bar=True)
        self.log('val_acc', acc, prog_bar=True)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=self.hparams.learning_rate)


class MNISTDataModule(L.LightningDataModule):
    def __init__(self, data_dir='./data', batch_size=64, num_workers=4):
        super().__init__()
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.num_workers = num_workers
        self.transform = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.1307,), (0.3081,))
        ])

    def prepare_data(self):
        # Download (runs on rank 0 only)
        datasets.MNIST(self.data_dir, train=True, download=True)
        datasets.MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        if stage == 'fit' or stage is None:
            mnist_full = datasets.MNIST(
                self.data_dir, train=True, transform=self.transform
            )
            self.mnist_train, self.mnist_val = random_split(
                mnist_full, [55000, 5000]
            )
        if stage == 'test' or stage is None:
            self.mnist_test = datasets.MNIST(
                self.data_dir, train=False, transform=self.transform
            )

    def train_dataloader(self):
        return DataLoader(
            self.mnist_train,
            batch_size=self.batch_size,
            shuffle=True,
            num_workers=self.num_workers,
            persistent_workers=True
        )

    def val_dataloader(self):
        return DataLoader(
            self.mnist_val,
            batch_size=self.batch_size,
            num_workers=self.num_workers,
            persistent_workers=True
        )


def main():
    # Data
    datamodule = MNISTDataModule(
        data_dir='/tudelft.net/staff-umbrella/<project>/data',
        batch_size=64,
        num_workers=4
    )

    # Model
    model = ImageClassifier(learning_rate=1e-3)

    # Trainer - single GPU
    trainer = L.Trainer(
        max_epochs=10,
        accelerator='gpu',
        devices=1,
        precision='16-mixed',
        enable_progress_bar=True,
    )

    trainer.fit(model, datamodule)


if __name__ == '__main__':
    main()

Scaling to multiple GPUs

The only change needed is in the Trainer configuration:

# Multi-GPU: use all available GPUs on one node
trainer = L.Trainer(
    max_epochs=10,
    accelerator='gpu',
    devices=2,              # Use 2 GPUs
    strategy='ddp',         # Distributed Data Parallel
    precision='16-mixed',
)

That’s it. Lightning handles:

Spawning processes for each GPU
Distributing data across GPUs
Synchronizing gradients
Logging from rank 0 only

Slurm job script for multi-GPU

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/lightning-multi-gpu

# Set number of workers based on CPUs
export NUM_WORKERS=$((SLURM_CPUS_PER_TASK / 4))

srun uv run python src/train.py

Key points:

--gres=gpu:2: Request 2 GPUs
--cpus-per-task=8: Enough CPUs for data loading (4 per GPU)
--ntasks-per-node=1: Lightning spawns its own processes

Multi-node training

Scale beyond one machine with minimal changes:

trainer = L.Trainer(
    max_epochs=10,
    accelerator='gpu',
    devices=2,              # GPUs per node
    num_nodes=2,            # Number of nodes
    strategy='ddp',
    precision='16-mixed',
)

Slurm script for multi-node:

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=4:00:00
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/lightning-multi-gpu

# Get master address from first node
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=29500

srun uv run python src/train.py

Exercise 1: Scale with Lightning

Create the Lightning project above
Train on 1 GPU and note the time per epoch
Change to 2 GPUs and compare
Verify both runs achieve similar accuracy

Check your work

Both configurations should achieve ~97% validation accuracy. Note that for MNIST, you may not see a speedup - the dataset is too small and communication overhead dominates. With larger datasets and models, you would see near-linear scaling.

Part 2: PyTorch DDP (native)

If you need more control or can’t use Lightning, PyTorch’s DistributedDataParallel (DDP) is the native approach.

Key concepts

World size: Total number of processes (GPUs)
Rank: Unique ID for each process (0 to world_size-1)
Local rank: GPU index on the current node (0 to GPUs_per_node-1)

DDP training script

# src/train_ddp.py
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torchvision import datasets, transforms


def setup():
    """Initialize distributed training."""
    dist.init_process_group(backend='nccl')
    torch.cuda.set_device(int(os.environ['LOCAL_RANK']))


def cleanup():
    """Clean up distributed training."""
    dist.destroy_process_group()


def get_rank():
    return dist.get_rank()


def is_main_process():
    return get_rank() == 0


class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)
        self.dropout = nn.Dropout(0.2)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        return self.fc3(x)


def train_epoch(model, loader, optimizer, device):
    model.train()
    total_loss = 0
    correct = 0
    total = 0

    for batch_idx, (data, target) in enumerate(loader):
        data, target = data.to(device), target.to(device)

        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
        pred = output.argmax(dim=1)
        correct += pred.eq(target).sum().item()
        total += target.size(0)

    return total_loss / len(loader), correct / total


def validate(model, loader, device):
    model.eval()
    total_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            total_loss += F.cross_entropy(output, target).item()
            pred = output.argmax(dim=1)
            correct += pred.eq(target).sum().item()
            total += target.size(0)

    return total_loss / len(loader), correct / total


def main():
    # Initialize distributed
    setup()

    local_rank = int(os.environ['LOCAL_RANK'])
    device = torch.device(f'cuda:{local_rank}')

    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])

    train_dataset = datasets.MNIST(
        '/tudelft.net/staff-umbrella/<project>/data',
        train=True, download=False, transform=transform
    )
    val_dataset = datasets.MNIST(
        '/tudelft.net/staff-umbrella/<project>/data',
        train=False, download=False, transform=transform
    )

    # Distributed sampler ensures each GPU gets different data
    train_sampler = DistributedSampler(train_dataset, shuffle=True)
    val_sampler = DistributedSampler(val_dataset, shuffle=False)

    train_loader = DataLoader(
        train_dataset,
        batch_size=64,
        sampler=train_sampler,
        num_workers=4,
        pin_memory=True
    )
    val_loader = DataLoader(
        val_dataset,
        batch_size=64,
        sampler=val_sampler,
        num_workers=4,
        pin_memory=True
    )

    # Model - wrap in DDP
    model = SimpleNet().to(device)
    model = DDP(model, device_ids=[local_rank])

    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    # Training loop
    for epoch in range(10):
        # Important: set epoch for proper shuffling
        train_sampler.set_epoch(epoch)

        train_loss, train_acc = train_epoch(model, train_loader, optimizer, device)
        val_loss, val_acc = validate(model, val_loader, device)

        # Only print from main process
        if is_main_process():
            print(f'Epoch {epoch+1}: '
                  f'train_loss={train_loss:.4f}, train_acc={train_acc:.4f}, '
                  f'val_loss={val_loss:.4f}, val_acc={val_acc:.4f}')

    # Save model (only from main process)
    if is_main_process():
        torch.save(model.module.state_dict(), 'model.pt')
        print('Model saved to model.pt')

    cleanup()


if __name__ == '__main__':
    main()

Key differences from single-GPU

Initialize process group: dist.init_process_group()
Wrap model in DDP: model = DDP(model, device_ids=[local_rank])
Use DistributedSampler: Ensures each GPU gets different data
Set sampler epoch: train_sampler.set_epoch(epoch) for proper shuffling
Save from rank 0 only: Avoid file conflicts
Access original model: Use model.module when saving

Slurm script for DDP

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=4
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/ddp-example

export MASTER_ADDR=$(hostname)
export MASTER_PORT=29500

srun uv run torchrun \
    --nnodes=1 \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    src/train_ddp.py

Note: --ntasks-per-node=4 launches 4 processes, one per GPU.

Exercise 2: Native DDP

Create the DDP training script
Run with 2 GPUs using torchrun
Verify the DistributedSampler splits data correctly

Check your work

Each GPU should process half the data:

# With 60000 training samples and 2 GPUs:
# Each GPU sees 30000 samples per epoch
GPU 0: Processing batches 0-468
GPU 1: Processing batches 0-468

Part 3: Hugging Face Accelerate

Accelerate provides a middle ground - more control than Lightning, less boilerplate than raw DDP.

Setup

$ uv add accelerate transformers datasets

Accelerate training script

# src/train_accelerate.py
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from accelerate import Accelerator


class SimpleNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(28 * 28, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.flatten(x)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)


def main():
    # Initialize accelerator
    accelerator = Accelerator(mixed_precision='fp16')

    # Data
    transform = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])

    train_dataset = datasets.MNIST(
        '/tudelft.net/staff-umbrella/<project>/data',
        train=True, download=False, transform=transform
    )

    train_loader = DataLoader(
        train_dataset,
        batch_size=64,
        shuffle=True,
        num_workers=4
    )

    # Model and optimizer
    model = SimpleNet()
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

    # Prepare for distributed training
    model, optimizer, train_loader = accelerator.prepare(
        model, optimizer, train_loader
    )

    # Training loop
    for epoch in range(10):
        model.train()
        total_loss = 0

        for batch_idx, (data, target) in enumerate(train_loader):
            optimizer.zero_grad()
            output = model(data)
            loss = F.cross_entropy(output, target)
            accelerator.backward(loss)
            optimizer.step()
            total_loss += loss.item()

        # Print from main process only
        if accelerator.is_main_process:
            print(f'Epoch {epoch+1}: loss={total_loss/len(train_loader):.4f}')

    # Save model
    accelerator.wait_for_everyone()
    if accelerator.is_main_process:
        unwrapped_model = accelerator.unwrap_model(model)
        torch.save(unwrapped_model.state_dict(), 'model.pt')


if __name__ == '__main__':
    main()

Key features

Minimal code changes: Just wrap with accelerator.prepare()
Automatic device placement: No manual .to(device)
Mixed precision: Built-in with mixed_precision='fp16'
Gradient accumulation: Easy with accumulate() context

Configuration file

Generate a config with:

$ uv run accelerate config

Or create accelerate_config.yaml:

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 4
mixed_precision: fp16

Slurm script for Accelerate

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=64G
#SBATCH --gres=gpu:2
#SBATCH --output=train_%j.out

module purge
module load 2025/gpu cuda/12.9

cd /tudelft.net/staff-umbrella/<project>/accelerate-example

srun uv run accelerate launch \
    --num_processes=4 \
    --mixed_precision=fp16 \
    src/train_accelerate.py

Part 4: Best practices

Data loading

Data loading often becomes the bottleneck with multiple GPUs.

Tips:

Use num_workers proportional to CPUs: typically 4 workers per GPU
Enable pin_memory=True for faster GPU transfer
Use persistent_workers=True to avoid worker restart overhead
Store data on fast storage (SSD/NVMe when available)

DataLoader(
    dataset,
    batch_size=64,
    num_workers=4,           # Per GPU
    pin_memory=True,         # Faster transfer to GPU
    persistent_workers=True, # Keep workers alive
    prefetch_factor=2,       # Batches to prefetch per worker
)

Batch size scaling

When using N GPUs, you have options:

Keep per-GPU batch size: Effective batch = N * per_GPU_batch
- Faster training, may need learning rate adjustment
Keep total batch size: per_GPU_batch = total / N
- Same training dynamics, just faster

Learning rate scaling rule: When increasing batch size by factor K, increase learning rate by factor K (or sqrt(K) for more conservative scaling).

# Example: scaling from 1 to 2 GPUs
base_lr = 1e-3
base_batch = 64
num_gpus = 2

# Linear scaling
scaled_lr = base_lr * num_gpus  # 4e-3

Gradient accumulation

Simulate larger batches without more memory:

# Lightning
trainer = L.Trainer(
    accumulate_grad_batches=4,  # Effective batch = 4 * batch_size * num_gpus
)

# Accelerate
accelerator = Accelerator(gradient_accumulation_steps=4)

Checkpointing

Save checkpoints that work across different GPU configurations:

# Lightning - automatic
trainer = L.Trainer(
    callbacks=[
        L.callbacks.ModelCheckpoint(
            dirpath='checkpoints',
            filename='epoch_{epoch:02d}',
            save_top_k=3,
            monitor='val_loss'
        )
    ]
)

# DDP - save unwrapped model
if is_main_process():
    torch.save(model.module.state_dict(), 'model.pt')

Exercise 3: Optimize data loading

Train with num_workers=0 and measure throughput
Increase to num_workers=4 and compare
Add pin_memory=True and persistent_workers=True
Measure the improvement

Check your work

You should see significant speedup:

num_workers=0:                    ~100 samples/sec
num_workers=4:                    ~400 samples/sec
+ pin_memory + persistent_workers: ~500 samples/sec

NCCL configuration on DAIC

DAIC GPU nodes have GPUs distributed across multiple NUMA nodes (CPU sockets). The GPUs communicate via the QPI/UPI interconnect rather than NVLink, which requires specific NCCL configuration.

Required settings

Add these environment variables to your job scripts:

# Required: Disable P2P (peer-to-peer) communication
# P2P doesn't work between GPUs on different NUMA nodes
export NCCL_P2P_DISABLE=1

Why this is needed

Check GPU topology on a compute node:

$ nvidia-smi topo -m
        GPU0    GPU1    CPU Affinity    NUMA Affinity
GPU0     X      SYS     16-17           2
GPU1    SYS      X      32-33           4

The SYS connection means GPUs communicate through the CPU interconnect (QPI/UPI), not direct P2P. Without NCCL_P2P_DISABLE=1, NCCL attempts P2P transfers that hang.

Performance expectations

With NCCL_P2P_DISABLE=1 on DAIC:

Configuration	ResNet18 on CIFAR-10	Speedup
1 GPU	7.8s/epoch	baseline
2 GPUs	6.1s/epoch	1.28x

The speedup is less than 2x because communication goes through CPU memory. Larger models and datasets see better scaling.

Part 5: Troubleshooting

Training hangs with multiple GPUs

Training hangs after “Initializing distributed” or “All distributed processes registered”.

Cause: NCCL P2P communication fails between GPUs on different NUMA nodes.

Solution:

export NCCL_P2P_DISABLE=1

NCCL errors

NCCL error: unhandled system error

Causes:

Network issues between nodes
Mismatched CUDA/NCCL versions
Firewall blocking ports

Solutions:

# Use shared memory for single-node
export NCCL_SHM_DISABLE=0

# Debug logging
export NCCL_DEBUG=INFO

# Specify network interface
export NCCL_SOCKET_IFNAME=eth0

Hanging at initialization

Training hangs at init_process_group().

Causes:

Wrong MASTER_ADDR or MASTER_PORT
Firewall blocking communication
Mismatched world size

Solutions:

# Verify connectivity
$ srun --nodes=2 hostname

# Check MASTER_ADDR is reachable
$ ping $MASTER_ADDR

Out of memory with DDP

DDP uses more memory than single GPU due to gradient buffers.

Solutions:

Reduce batch size
Use gradient checkpointing
Enable mixed precision (fp16)

# Gradient checkpointing in Lightning
model = ImageClassifier()
model.gradient_checkpointing_enable()

Uneven GPU utilization

One GPU doing more work than others.

Causes:

Uneven batch sizes (last batch smaller)
Data loading bottleneck on rank 0

Solutions:

# Drop incomplete batches
DataLoader(..., drop_last=True)

# Each rank loads its own data
# (default with DistributedSampler)

Exercise 4: Debug a distributed job

Submit a 2-GPU job with intentionally wrong MASTER_PORT
Observe the error message
Fix the port and verify training starts

Check your work

With wrong port, you’ll see:

RuntimeError: connect() timed out

After fixing, training should start within seconds.

Summary

You’ve learned to scale training across multiple GPUs:

Framework	Complexity	Best for
Lightning	Low	Most users, fast prototyping
Accelerate	Medium	HF ecosystem, moderate control
DDP	High	Full control, custom training

Key takeaways

Start with Lightning - handles distributed training automatically
Request resources correctly - GPUs, CPUs for data loading, memory
Scale batch size or learning rate - adjust for multi-GPU
Optimize data loading - often the real bottleneck
Save from rank 0 only - avoid checkpoint conflicts

Quick reference

# Lightning multi-GPU
trainer = L.Trainer(accelerator='gpu', devices=2, strategy='ddp')

# DDP launch
torchrun --nproc_per_node=4 train.py

# Accelerate launch
accelerate launch --num_processes=4 train.py

Slurm essentials

#SBATCH --gres=gpu:2          # Number of GPUs
#SBATCH --cpus-per-task=8    # CPUs for data loading
#SBATCH --ntasks-per-node=1   # For Lightning
#SBATCH --ntasks-per-node=4   # For DDP/torchrun

Next steps

Apptainer Tutorial - Containerize your training environment
PyTorch Lightning Docs - Advanced features
Accelerate Docs - Hugging Face integration

Last modified March 23, 2026: fix multi-GPU: add NCCL_P2P_DISABLE=1 for cross-NUMA GPUs, use CIFAR-10/ResNet18 (89fefc1)