Slurm basics

Understanding the job scheduler on DAIC.

What you’ll learn

By the end of this tutorial, you’ll be able to:

  • Submit batch jobs that run on compute nodes
  • Request CPUs, memory, and GPUs for your jobs
  • Monitor job status and troubleshoot failures
  • Use interactive sessions for testing
  • Run parameter sweeps with job arrays

Time: About 45 minutes

Prerequisites: Complete the Bash Basics tutorial first, or be comfortable with Linux command line.


What is Slurm?

When you log into DAIC, you land on a login node. This is a shared computer where users prepare their work - but you shouldn’t run computations here. The actual computing happens on compute nodes, powerful machines with GPUs and lots of memory.

Slurm is the traffic controller that manages these compute nodes. When you want to run a computation, you don’t run it directly - you ask Slurm to run it for you. Slurm finds available resources, starts your job, and makes sure it doesn’t interfere with other users’ jobs.

Think of it like a restaurant: you don’t walk into the kitchen and cook your own food. You submit an order (your job), and the kitchen (Slurm) prepares it when they have capacity.

Why can’t I just run my code?

You might wonder: “Why can’t I just type python train.py and let it run?”

On a personal computer, that works fine. But DAIC is shared by hundreds of researchers, each wanting to use expensive GPUs. Without a scheduler:

  • Everyone would fight over the same resources
  • Your job might get killed when someone else starts theirs
  • GPUs would sit idle when no one happens to be logged in
  • There would be no fairness - whoever types fastest wins

Slurm solves these problems by:

  • Queueing jobs and running them in order
  • Guaranteeing that your job gets the resources you requested
  • Ensuring fair access based on policies
  • Maximizing utilization of expensive hardware

The two ways to run jobs

Batch jobs: submit and walk away

Most of the time, you’ll use batch jobs. You write a script that describes what you want to run, submit it, and Slurm runs it whenever resources are available. You don’t need to stay logged in - you can submit at 5pm, go home, and check results the next morning.

$ sbatch my_job.sh
Submitted batch job 12345

Your job enters a queue. When resources become available, Slurm runs it. Output goes to a file you can read later.

Interactive jobs: real-time access

Sometimes you need to work interactively - debugging, testing, or exploring data. For this, you request an interactive job. Slurm allocates resources, and you get a shell on a compute node.

$ salloc --account=<your-account> --partition=all --time=1:00:00 --gres=gpu:1
salloc: Granted job allocation 12346
$ srun nvidia-smi
$ srun python -c "import torch; print(torch.cuda.is_available())"
True

Interactive jobs are great for testing but expensive - you’re reserving resources the whole time, even if you’re just thinking. Use batch jobs for actual computations.

Your first batch job

Let’s walk through creating and submitting a batch job step by step.

Step 1: Create a Python script

First, create a simple script to run. This one just prints some information:

$ cd /tudelft.net/staff-umbrella/<project>
$ vim hello.py
import socket
import os

print(f"Hello from {socket.gethostname()}")
print(f"Job ID: {os.environ.get('SLURM_JOB_ID', 'not in slurm')}")
print(f"CPUs allocated: {os.environ.get('SLURM_CPUS_PER_TASK', 'unknown')}")

Step 2: Create a batch script

Now create the Slurm script that will run your Python code:

$ vim hello_job.sh
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=hello_%j.out

echo "Job started at $(date)"

srun python hello.py

echo "Job finished at $(date)"

Let’s understand each line:

LinePurpose
#!/bin/bashThis is a bash script
#SBATCH --account=...Which account to bill (required)
#SBATCH --partition=allWhich group of nodes to use
#SBATCH --time=0:10:00Maximum runtime: 10 minutes
#SBATCH --ntasks=1Run one task
#SBATCH --cpus-per-task=1Use one CPU core
#SBATCH --mem=1GRequest 1 GB of memory
#SBATCH --output=hello_%j.outWhere to write output (%j = job ID)
srun python hello.pyThe actual command to run

Step 3: Find your account

Before submitting, you need to know your account name:

$ sacctmgr show associations user=$USER format=Account -P
Account
ewi-insy-reit

Replace <your-account> in your script with this value (e.g., ewi-insy-reit).

Step 4: Submit the job

$ sbatch hello_job.sh
Submitted batch job 12345

The number 12345 is your job ID. You’ll use this to track your job.

Step 5: Check job status

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12345       all hello_jo  netid01 PD       0:00      1 (Priority)

The ST column shows the status:

  • PD = Pending - waiting in queue
  • R = Running
  • CG = Completing - wrapping up

The REASON column tells you why a job is pending:

  • Priority = other jobs are ahead of you in the queue
  • Resources = waiting for nodes to become free
  • QOSMaxJobsPerUserLimit = you’ve hit your job limit

Step 6: Check the output

Once the job completes, read the output file:

$ cat hello_12345.out
Job started at Fri Mar 20 10:15:32 CET 2026
Hello from gpu23.ethernet.tudhpc
Job ID: 12345
CPUs allocated: 1
Job finished at Fri Mar 20 10:15:33 CET 2026

Your code ran on gpu23, not on the login node. Slurm handled everything.

Understanding resource requests

The most confusing part of Slurm is figuring out what resources to request. Request too little and your job crashes; request too much and you wait longer in the queue.

Time (--time)

How long your job will run. Format: D-HH:MM:SS or HH:MM:SS

#SBATCH --time=0:30:00      # 30 minutes
#SBATCH --time=4:00:00      # 4 hours
#SBATCH --time=1-00:00:00   # 1 day
#SBATCH --time=7-00:00:00   # 7 days (maximum on DAIC)

Important: If your job exceeds this time, Slurm kills it. But requesting more time means waiting longer in the queue. Start with a generous estimate, then use seff on completed jobs to tune it.

Memory (--mem)

How much RAM your job needs.

#SBATCH --mem=4G      # 4 gigabytes
#SBATCH --mem=32G     # 32 gigabytes
#SBATCH --mem=128G    # 128 gigabytes

If your job exceeds this limit, Slurm kills it with an “out of memory” error. Check your code’s actual memory usage with seff after a successful run.

CPUs (--cpus-per-task)

How many CPU cores your job needs.

#SBATCH --cpus-per-task=1    # Single-threaded code
#SBATCH --cpus-per-task=4    # Code that uses 4 threads
#SBATCH --cpus-per-task=16   # Heavily parallel CPU code

Match this to what your code actually uses:

  • Simple Python scripts: 1 CPU
  • PyTorch with DataLoader workers: workers + 1 (e.g., 4 workers = 5 CPUs)
  • NumPy/Pandas with parallelism: however many threads you configure

GPUs (--gres)

Request GPUs with the --gres (generic resources) option:

#SBATCH --gres=gpu:1    # One GPU (any type)
#SBATCH --gres=gpu:2    # Two GPUs
#SBATCH --gres=gpu:l40:1   # Specifically an L40 GPU
#SBATCH --gres=gpu:a40:2   # Two A40 GPUs

Available GPU types on DAIC include L40, A40, and RTX Pro 6000. Request specific types only if your code requires it - being flexible gets you through the queue faster.

Running GPU jobs

Most deep learning jobs need GPUs. Here’s a complete example:

The Python training script

# train.py
import torch
import torch.nn as nn

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Simple training loop
model = nn.Linear(1000, 100).to(device)
optimizer = torch.optim.Adam(model.parameters())

for epoch in range(100):
    x = torch.randn(64, 1000, device=device)
    y = model(x)
    loss = y.sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, loss: {loss.item():.4f}")

print("Training complete!")

The batch script

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --output=train_%j.out

# Clean environment and load required modules
module purge
module load 2025/gpu cuda/12.9

# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Running on: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Start time: $(date)"

# Run training
srun python train.py

echo "End time: $(date)"

Understanding the module system

DAIC uses an environment modules system to manage software. Instead of having every version of every library available at once (which would cause conflicts), software is organized into modules that you load when needed.

The module commands set up your software environment:

module purge            # Clear any previously loaded modules
module load 2025/gpu    # Load the 2025 GPU software stack
module load cuda/12.9   # Load CUDA 12.9

Why use modules?

  • Version control: Run module load python/3.11 today, python/3.12 tomorrow
  • Avoid conflicts: Different projects can use different library versions
  • Clean environment: module purge gives you a fresh start

Common module commands:

CommandPurpose
module availList all available modules
module avail cudaList modules matching “cuda”
module listShow currently loaded modules
module load <name>Load a module
module purgeUnload all modules

For a complete guide, see Loading Software.

Submit and monitor

$ sbatch train_job.sh
Submitted batch job 12350

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12350       all train_jo  netid01  R       0:45      1 gpu15

$ tail -f train_12350.out
Job ID: 12350
Running on: gpu15.ethernet.tudhpc
GPUs: 0
Start time: Fri Mar 20 11:00:00 CET 2026
Using device: cuda
GPU: NVIDIA L40
Memory: 45.0 GB
Epoch 0, loss: 156.7823
Epoch 10, loss: 89.3421
...

The tail -f command shows output in real-time as your job runs.

Interactive jobs for testing

Before submitting a long batch job, test your code interactively:

Request an interactive session

$ salloc --account=<your-account> --partition=all --time=1:00:00 --cpus-per-task=4 --mem=8G --gres=gpu:1
salloc: Pending job allocation 12351
salloc: job 12351 queued and waiting for resources
salloc: job 12351 has been allocated resources
salloc: Granted job allocation 12351

You now have resources reserved. But you’re still on the login node - you need srun to actually use the compute node:

Run commands on the compute node

$ srun hostname
gpu15.ethernet.tudhpc

$ srun nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.9     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA L40          On   | 00000000:41:00.0 Off |                    0 |
| N/A   30C    P8    22W / 300W |      0MiB / 46068MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

$ srun python train.py
Using device: cuda
...

Start an interactive shell on the compute node

For more extended testing, start a shell on the compute node:

$ srun --pty bash
$ hostname
gpu15.ethernet.tudhpc
$ python train.py
...
$ exit

Don’t forget to release resources

When done testing, release your allocation:

$ exit
salloc: Relinquishing job allocation 12351

If you forget, you’ll hold resources for the full time you requested, even if you’re not using them. This isn’t fair to other users.

Job arrays: running many similar jobs

Often you need to run the same code with different parameters - different random seeds, different hyperparameters, or different data splits. Job arrays make this easy.

The problem

You want to run your experiment with seeds 1 through 10. You could submit 10 separate jobs:

$ sbatch --export=SEED=1 experiment.sh
$ sbatch --export=SEED=2 experiment.sh
$ sbatch --export=SEED=3 experiment.sh
... # tedious!

The solution: job arrays

Instead, use a single job array:

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --array=1-10
#SBATCH --output=experiment_%A_%a.out

# %A = array job ID, %a = array task ID
echo "Array job ID: $SLURM_ARRAY_JOB_ID"
echo "Array task ID: $SLURM_ARRAY_TASK_ID"

srun python experiment.py --seed $SLURM_ARRAY_TASK_ID

Submit once, get 10 jobs:

$ sbatch experiment_array.sh
Submitted batch job 12360

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12360_1       all experime  netid01  R       0:30      1 gpu01
12360_2       all experime  netid01  R       0:30      1 gpu02
12360_3       all experime  netid01  R       0:30      1 gpu03
12360_4       all experime  netid01 PD       0:00      1 (Resources)
...

Array variations

#SBATCH --array=1-100        # Tasks 1 through 100
#SBATCH --array=0-9          # Tasks 0 through 9
#SBATCH --array=1,3,5,7      # Just these specific tasks
#SBATCH --array=1-100%10     # 1-100, but max 10 running at once

The %10 syntax limits concurrent tasks, useful if you don’t want to flood the queue.

Using array indices creatively

Your Python code can use $SLURM_ARRAY_TASK_ID for more than just seeds:

import os
import json

task_id = int(os.environ.get('SLURM_ARRAY_TASK_ID', 0))

# Load hyperparameter configurations
with open('configs.json') as f:
    configs = json.load(f)

config = configs[task_id]
print(f"Running with config: {config}")

Where configs.json contains:

[
  {"lr": 0.001, "batch_size": 32},
  {"lr": 0.001, "batch_size": 64},
  {"lr": 0.01, "batch_size": 32},
  {"lr": 0.01, "batch_size": 64}
]

Job dependencies: workflows

Sometimes jobs must run in a specific order. Job dependencies let you express this.

Run after another job succeeds

$ sbatch preprocess.sh
Submitted batch job 12370

$ sbatch --dependency=afterok:12370 train.sh
Submitted batch job 12371

Job 12371 won’t start until job 12370 completes successfully. If 12370 fails, 12371 never runs.

Dependency types

DependencyMeaning
afterok:jobidStart after job succeeds
afternotok:jobidStart after job fails
afterany:jobidStart after job finishes (either way)
after:jobidStart after job starts
singletonOnly one job with this name at a time

Complex workflows

Chain multiple dependencies:

$ sbatch download_data.sh
Submitted batch job 12380

$ sbatch --dependency=afterok:12380 preprocess.sh
Submitted batch job 12381

$ sbatch --dependency=afterok:12381 train.sh
Submitted batch job 12382

$ sbatch --dependency=afterok:12382 evaluate.sh
Submitted batch job 12383

Or depend on multiple jobs:

$ sbatch train_model_a.sh
Submitted batch job 12390

$ sbatch train_model_b.sh
Submitted batch job 12391

$ sbatch --dependency=afterok:12390:12391 ensemble.sh
Submitted batch job 12392

Job 12392 waits for both 12390 and 12391 to complete.

Checking job history and efficiency

View past jobs

$ sacct -u $USER --starttime=2026-03-01
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12340          training        all  ewi-insy          8  COMPLETED      0:0
12341            failed        all  ewi-insy          4     FAILED      1:0
12342          training        all  ewi-insy          8    TIMEOUT      0:0

Exit codes:

  • 0:0 = success
  • 1:0 = your code exited with error
  • 0:9 = killed by signal 9 (often out of memory)
  • TIMEOUT = exceeded time limit

Check efficiency

The seff command shows how well you used the resources you requested:

$ seff 12340
Job ID: 12340
Cluster: daic
State: COMPLETED
Nodes: 1
Cores per node: 8
CPU Utilized: 06:30:15
CPU Efficiency: 81.3% of 08:00:00 core-walltime
Job Wall-clock time: 01:00:00
Memory Utilized: 24.5 GB
Memory Efficiency: 76.6% of 32.0 GB

This job used 81% of allocated CPU and 77% of allocated memory - reasonable efficiency. If you see numbers below 50%, you’re requesting more than you need.

Adjusting based on efficiency

If seff shows:

  • Low CPU efficiency: Reduce --cpus-per-task
  • Low memory efficiency: Reduce --mem
  • Very high efficiency (>95%): Consider requesting slightly more headroom

Troubleshooting

Job stuck in pending

Check why with squeue:

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12345       all training  netid01 PD       0:00      1 (Resources)

Common reasons:

  • Priority - Other jobs are ahead of you. Wait, or request fewer resources.
  • Resources - Not enough free nodes. Wait, or request fewer resources.
  • QOSMaxJobsPerUserLimit - You’ve hit your concurrent job limit. Wait for some to finish.
  • AssocMaxJobsLimit - Your account has hit its limit.

Job killed immediately

Check the output file for errors. Common issues:

Out of memory:

slurmstepd: error: Detected 1 oom-kill event(s) in step 12345.0

Solution: Increase --mem

Time limit:

slurmstepd: error: *** JOB 12345 ON gpu01 CANCELLED AT 2026-03-20T12:00:00 DUE TO TIME LIMIT ***

Solution: Increase --time or add checkpointing to your code

Module not found:

ModuleNotFoundError: No module named 'torch'

Solution: Add module load commands to your script

Can’t find GPUs

Your code can’t see GPUs even though you requested them:

torch.cuda.is_available()  # Returns False

Common causes:

  1. Forgot --gres=gpu:1 in your script
  2. Running on login node instead of through srun
  3. Missing module load cuda
  4. CUDA version mismatch

Best practices

1. Test before submitting long jobs

$ salloc --time=0:30:00 --gres=gpu:1 ...
$ srun python train.py --max-epochs 1  # Quick test
$ exit
$ sbatch full_training.sh  # Now submit the real job

2. Request only what you need

Larger requests wait longer in the queue. Start small and increase if needed.

3. Use meaningful job names

#SBATCH --job-name=bert-finetune-lr001

Makes squeue output much more readable.

4. Save checkpoints

For long jobs, save state periodically so you can resume if killed:

# Save checkpoint every epoch
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
}, f'checkpoint_epoch_{epoch}.pt')

5. Use job arrays instead of many scripts

One job array is easier to manage than 100 separate submissions.

6. Check efficiency and tune

After your first successful run, check seff and adjust requests.

Quick reference

Submit and monitor

CommandPurpose
sbatch script.shSubmit batch job
salloc ...Request interactive session
srun commandRun command on allocated nodes
squeue -u $USERView your jobs
scancel 12345Cancel a job
scancel -u $USERCancel all your jobs

Information

CommandPurpose
sinfoView partitions and nodes
scontrol show job 12345Detailed job info
sacct -u $USERView job history
seff 12345Check job efficiency
sacctmgr show assoc user=$USERView your accounts

Common sbatch options

OptionExamplePurpose
--accountewi-insyBilling account
--partitionallNode group
--time4:00:00Time limit
--cpus-per-task8CPU cores
--mem32GMemory
--gresgpu:1GPUs
--outputlog_%j.outOutput file
--array1-10Job array

Summary

You’ve learned:

ConceptKey Commands
Submit a batch jobsbatch script.sh
Request interactive sessionsalloc --time=1:00:00 --gres=gpu:1 ...
Run on allocated nodesrun python train.py
Check job statussqueue -u $USER
Cancel a jobscancel <jobid>
View job historysacct -u $USER
Check efficiencyseff <jobid>
Run parameter sweep#SBATCH --array=1-10
Chain jobs--dependency=afterok:<jobid>

Exercises

Try these on your own to solidify your understanding:

Exercise 1: Basic job submission

Create and submit a job that prints your username, hostname, and current date. Check the output.

Exercise 2: GPU job

Modify the basic job to request a GPU. Add nvidia-smi to verify the GPU is available.

Exercise 3: Resource tuning

Submit a job, then use seff to check its efficiency. Was your resource request appropriate?

Exercise 4: Job array

Create a job array that runs 5 tasks. Each task should print its array task ID.

Exercise 5: Dependencies

Submit two jobs where the second depends on the first completing successfully.

Next steps