Slurm basics
15 minute read
What you’ll learn
By the end of this tutorial, you’ll be able to:
- Submit batch jobs that run on compute nodes
- Request CPUs, memory, and GPUs for your jobs
- Monitor job status and troubleshoot failures
- Use interactive sessions for testing
- Run parameter sweeps with job arrays
Time: About 45 minutes
Prerequisites: Complete the Bash Basics tutorial first, or be comfortable with Linux command line.
What is Slurm?
When you log into DAIC, you land on a login node. This is a shared computer where users prepare their work - but you shouldn’t run computations here. The actual computing happens on compute nodes, powerful machines with GPUs and lots of memory.
Slurm is the traffic controller that manages these compute nodes. When you want to run a computation, you don’t run it directly - you ask Slurm to run it for you. Slurm finds available resources, starts your job, and makes sure it doesn’t interfere with other users’ jobs.
Think of it like a restaurant: you don’t walk into the kitchen and cook your own food. You submit an order (your job), and the kitchen (Slurm) prepares it when they have capacity.
Why can’t I just run my code?
You might wonder: “Why can’t I just type python train.py and let it run?”
On a personal computer, that works fine. But DAIC is shared by hundreds of researchers, each wanting to use expensive GPUs. Without a scheduler:
- Everyone would fight over the same resources
- Your job might get killed when someone else starts theirs
- GPUs would sit idle when no one happens to be logged in
- There would be no fairness - whoever types fastest wins
Slurm solves these problems by:
- Queueing jobs and running them in order
- Guaranteeing that your job gets the resources you requested
- Ensuring fair access based on policies
- Maximizing utilization of expensive hardware
The two ways to run jobs
Batch jobs: submit and walk away
Most of the time, you’ll use batch jobs. You write a script that describes what you want to run, submit it, and Slurm runs it whenever resources are available. You don’t need to stay logged in - you can submit at 5pm, go home, and check results the next morning.
$ sbatch my_job.sh
Submitted batch job 12345
Your job enters a queue. When resources become available, Slurm runs it. Output goes to a file you can read later.
Interactive jobs: real-time access
Sometimes you need to work interactively - debugging, testing, or exploring data. For this, you request an interactive job. Slurm allocates resources, and you get a shell on a compute node.
$ salloc --account=<your-account> --partition=all --time=1:00:00 --gres=gpu:1
salloc: Granted job allocation 12346
$ srun nvidia-smi
$ srun python -c "import torch; print(torch.cuda.is_available())"
True
Interactive jobs are great for testing but expensive - you’re reserving resources the whole time, even if you’re just thinking. Use batch jobs for actual computations.
Your first batch job
Let’s walk through creating and submitting a batch job step by step.
Step 1: Create a Python script
First, create a simple script to run. This one just prints some information:
$ cd /tudelft.net/staff-umbrella/<project>
$ vim hello.py
import socket
import os
print(f"Hello from {socket.gethostname()}")
print(f"Job ID: {os.environ.get('SLURM_JOB_ID', 'not in slurm')}")
print(f"CPUs allocated: {os.environ.get('SLURM_CPUS_PER_TASK', 'unknown')}")
Step 2: Create a batch script
Now create the Slurm script that will run your Python code:
$ vim hello_job.sh
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=hello_%j.out
echo "Job started at $(date)"
srun python hello.py
echo "Job finished at $(date)"
Let’s understand each line:
| Line | Purpose |
|---|---|
#!/bin/bash | This is a bash script |
#SBATCH --account=... | Which account to bill (required) |
#SBATCH --partition=all | Which group of nodes to use |
#SBATCH --time=0:10:00 | Maximum runtime: 10 minutes |
#SBATCH --ntasks=1 | Run one task |
#SBATCH --cpus-per-task=1 | Use one CPU core |
#SBATCH --mem=1G | Request 1 GB of memory |
#SBATCH --output=hello_%j.out | Where to write output (%j = job ID) |
srun python hello.py | The actual command to run |
Step 3: Find your account
Before submitting, you need to know your account name:
$ sacctmgr show associations user=$USER format=Account -P
Account
ewi-insy-reit
Replace <your-account> in your script with this value (e.g., ewi-insy-reit).
Step 4: Submit the job
$ sbatch hello_job.sh
Submitted batch job 12345
The number 12345 is your job ID. You’ll use this to track your job.
Step 5: Check job status
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 all hello_jo netid01 PD 0:00 1 (Priority)
The ST column shows the status:
PD= Pending - waiting in queueR= RunningCG= Completing - wrapping up
The REASON column tells you why a job is pending:
Priority= other jobs are ahead of you in the queueResources= waiting for nodes to become freeQOSMaxJobsPerUserLimit= you’ve hit your job limit
Step 6: Check the output
Once the job completes, read the output file:
$ cat hello_12345.out
Job started at Fri Mar 20 10:15:32 CET 2026
Hello from gpu23.ethernet.tudhpc
Job ID: 12345
CPUs allocated: 1
Job finished at Fri Mar 20 10:15:33 CET 2026
Your code ran on gpu23, not on the login node. Slurm handled everything.
Understanding resource requests
The most confusing part of Slurm is figuring out what resources to request. Request too little and your job crashes; request too much and you wait longer in the queue.
Time (--time)
How long your job will run. Format: D-HH:MM:SS or HH:MM:SS
#SBATCH --time=0:30:00 # 30 minutes
#SBATCH --time=4:00:00 # 4 hours
#SBATCH --time=1-00:00:00 # 1 day
#SBATCH --time=7-00:00:00 # 7 days (maximum on DAIC)
Important: If your job exceeds this time, Slurm kills it. But requesting more time means waiting longer in the queue. Start with a generous estimate, then use seff on completed jobs to tune it.
Memory (--mem)
How much RAM your job needs.
#SBATCH --mem=4G # 4 gigabytes
#SBATCH --mem=32G # 32 gigabytes
#SBATCH --mem=128G # 128 gigabytes
If your job exceeds this limit, Slurm kills it with an “out of memory” error. Check your code’s actual memory usage with seff after a successful run.
CPUs (--cpus-per-task)
How many CPU cores your job needs.
#SBATCH --cpus-per-task=1 # Single-threaded code
#SBATCH --cpus-per-task=4 # Code that uses 4 threads
#SBATCH --cpus-per-task=16 # Heavily parallel CPU code
Match this to what your code actually uses:
- Simple Python scripts: 1 CPU
- PyTorch with DataLoader workers: workers + 1 (e.g., 4 workers = 5 CPUs)
- NumPy/Pandas with parallelism: however many threads you configure
GPUs (--gres)
Request GPUs with the --gres (generic resources) option:
#SBATCH --gres=gpu:1 # One GPU (any type)
#SBATCH --gres=gpu:2 # Two GPUs
#SBATCH --gres=gpu:l40:1 # Specifically an L40 GPU
#SBATCH --gres=gpu:a40:2 # Two A40 GPUs
Available GPU types on DAIC include L40, A40, and RTX Pro 6000. Request specific types only if your code requires it - being flexible gets you through the queue faster.
Running GPU jobs
Most deep learning jobs need GPUs. Here’s a complete example:
The Python training script
# train.py
import torch
import torch.nn as nn
# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Simple training loop
model = nn.Linear(1000, 100).to(device)
optimizer = torch.optim.Adam(model.parameters())
for epoch in range(100):
x = torch.randn(64, 1000, device=device)
y = model(x)
loss = y.sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}, loss: {loss.item():.4f}")
print("Training complete!")
The batch script
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --output=train_%j.out
# Clean environment and load required modules
module purge
module load 2025/gpu cuda/12.9
# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Running on: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Start time: $(date)"
# Run training
srun python train.py
echo "End time: $(date)"
Understanding the module system
DAIC uses an environment modules system to manage software. Instead of having every version of every library available at once (which would cause conflicts), software is organized into modules that you load when needed.
The module commands set up your software environment:
module purge # Clear any previously loaded modules
module load 2025/gpu # Load the 2025 GPU software stack
module load cuda/12.9 # Load CUDA 12.9
Why use modules?
- Version control: Run
module load python/3.11today,python/3.12tomorrow - Avoid conflicts: Different projects can use different library versions
- Clean environment:
module purgegives you a fresh start
Common module commands:
| Command | Purpose |
|---|---|
module avail | List all available modules |
module avail cuda | List modules matching “cuda” |
module list | Show currently loaded modules |
module load <name> | Load a module |
module purge | Unload all modules |
For a complete guide, see Loading Software.
Submit and monitor
$ sbatch train_job.sh
Submitted batch job 12350
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12350 all train_jo netid01 R 0:45 1 gpu15
$ tail -f train_12350.out
Job ID: 12350
Running on: gpu15.ethernet.tudhpc
GPUs: 0
Start time: Fri Mar 20 11:00:00 CET 2026
Using device: cuda
GPU: NVIDIA L40
Memory: 45.0 GB
Epoch 0, loss: 156.7823
Epoch 10, loss: 89.3421
...
The tail -f command shows output in real-time as your job runs.
Interactive jobs for testing
Before submitting a long batch job, test your code interactively:
Request an interactive session
$ salloc --account=<your-account> --partition=all --time=1:00:00 --cpus-per-task=4 --mem=8G --gres=gpu:1
salloc: Pending job allocation 12351
salloc: job 12351 queued and waiting for resources
salloc: job 12351 has been allocated resources
salloc: Granted job allocation 12351
You now have resources reserved. But you’re still on the login node - you need srun to actually use the compute node:
Run commands on the compute node
$ srun hostname
gpu15.ethernet.tudhpc
$ srun nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.9 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA L40 On | 00000000:41:00.0 Off | 0 |
| N/A 30C P8 22W / 300W | 0MiB / 46068MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
$ srun python train.py
Using device: cuda
...
Start an interactive shell on the compute node
For more extended testing, start a shell on the compute node:
$ srun --pty bash
$ hostname
gpu15.ethernet.tudhpc
$ python train.py
...
$ exit
Don’t forget to release resources
When done testing, release your allocation:
$ exit
salloc: Relinquishing job allocation 12351
If you forget, you’ll hold resources for the full time you requested, even if you’re not using them. This isn’t fair to other users.
Job arrays: running many similar jobs
Often you need to run the same code with different parameters - different random seeds, different hyperparameters, or different data splits. Job arrays make this easy.
The problem
You want to run your experiment with seeds 1 through 10. You could submit 10 separate jobs:
$ sbatch --export=SEED=1 experiment.sh
$ sbatch --export=SEED=2 experiment.sh
$ sbatch --export=SEED=3 experiment.sh
... # tedious!
The solution: job arrays
Instead, use a single job array:
#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --array=1-10
#SBATCH --output=experiment_%A_%a.out
# %A = array job ID, %a = array task ID
echo "Array job ID: $SLURM_ARRAY_JOB_ID"
echo "Array task ID: $SLURM_ARRAY_TASK_ID"
srun python experiment.py --seed $SLURM_ARRAY_TASK_ID
Submit once, get 10 jobs:
$ sbatch experiment_array.sh
Submitted batch job 12360
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12360_1 all experime netid01 R 0:30 1 gpu01
12360_2 all experime netid01 R 0:30 1 gpu02
12360_3 all experime netid01 R 0:30 1 gpu03
12360_4 all experime netid01 PD 0:00 1 (Resources)
...
Array variations
#SBATCH --array=1-100 # Tasks 1 through 100
#SBATCH --array=0-9 # Tasks 0 through 9
#SBATCH --array=1,3,5,7 # Just these specific tasks
#SBATCH --array=1-100%10 # 1-100, but max 10 running at once
The %10 syntax limits concurrent tasks, useful if you don’t want to flood the queue.
Using array indices creatively
Your Python code can use $SLURM_ARRAY_TASK_ID for more than just seeds:
import os
import json
task_id = int(os.environ.get('SLURM_ARRAY_TASK_ID', 0))
# Load hyperparameter configurations
with open('configs.json') as f:
configs = json.load(f)
config = configs[task_id]
print(f"Running with config: {config}")
Where configs.json contains:
[
{"lr": 0.001, "batch_size": 32},
{"lr": 0.001, "batch_size": 64},
{"lr": 0.01, "batch_size": 32},
{"lr": 0.01, "batch_size": 64}
]
Job dependencies: workflows
Sometimes jobs must run in a specific order. Job dependencies let you express this.
Run after another job succeeds
$ sbatch preprocess.sh
Submitted batch job 12370
$ sbatch --dependency=afterok:12370 train.sh
Submitted batch job 12371
Job 12371 won’t start until job 12370 completes successfully. If 12370 fails, 12371 never runs.
Dependency types
| Dependency | Meaning |
|---|---|
afterok:jobid | Start after job succeeds |
afternotok:jobid | Start after job fails |
afterany:jobid | Start after job finishes (either way) |
after:jobid | Start after job starts |
singleton | Only one job with this name at a time |
Complex workflows
Chain multiple dependencies:
$ sbatch download_data.sh
Submitted batch job 12380
$ sbatch --dependency=afterok:12380 preprocess.sh
Submitted batch job 12381
$ sbatch --dependency=afterok:12381 train.sh
Submitted batch job 12382
$ sbatch --dependency=afterok:12382 evaluate.sh
Submitted batch job 12383
Or depend on multiple jobs:
$ sbatch train_model_a.sh
Submitted batch job 12390
$ sbatch train_model_b.sh
Submitted batch job 12391
$ sbatch --dependency=afterok:12390:12391 ensemble.sh
Submitted batch job 12392
Job 12392 waits for both 12390 and 12391 to complete.
Checking job history and efficiency
View past jobs
$ sacct -u $USER --starttime=2026-03-01
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12340 training all ewi-insy 8 COMPLETED 0:0
12341 failed all ewi-insy 4 FAILED 1:0
12342 training all ewi-insy 8 TIMEOUT 0:0
Exit codes:
0:0= success1:0= your code exited with error0:9= killed by signal 9 (often out of memory)TIMEOUT= exceeded time limit
Check efficiency
The seff command shows how well you used the resources you requested:
$ seff 12340
Job ID: 12340
Cluster: daic
State: COMPLETED
Nodes: 1
Cores per node: 8
CPU Utilized: 06:30:15
CPU Efficiency: 81.3% of 08:00:00 core-walltime
Job Wall-clock time: 01:00:00
Memory Utilized: 24.5 GB
Memory Efficiency: 76.6% of 32.0 GB
This job used 81% of allocated CPU and 77% of allocated memory - reasonable efficiency. If you see numbers below 50%, you’re requesting more than you need.
Adjusting based on efficiency
If seff shows:
- Low CPU efficiency: Reduce
--cpus-per-task - Low memory efficiency: Reduce
--mem - Very high efficiency (>95%): Consider requesting slightly more headroom
Troubleshooting
Job stuck in pending
Check why with squeue:
$ squeue -u $USER
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
12345 all training netid01 PD 0:00 1 (Resources)
Common reasons:
Priority- Other jobs are ahead of you. Wait, or request fewer resources.Resources- Not enough free nodes. Wait, or request fewer resources.QOSMaxJobsPerUserLimit- You’ve hit your concurrent job limit. Wait for some to finish.AssocMaxJobsLimit- Your account has hit its limit.
Job killed immediately
Check the output file for errors. Common issues:
Out of memory:
slurmstepd: error: Detected 1 oom-kill event(s) in step 12345.0
Solution: Increase --mem
Time limit:
slurmstepd: error: *** JOB 12345 ON gpu01 CANCELLED AT 2026-03-20T12:00:00 DUE TO TIME LIMIT ***
Solution: Increase --time or add checkpointing to your code
Module not found:
ModuleNotFoundError: No module named 'torch'
Solution: Add module load commands to your script
Can’t find GPUs
Your code can’t see GPUs even though you requested them:
torch.cuda.is_available() # Returns False
Common causes:
- Forgot
--gres=gpu:1in your script - Running on login node instead of through
srun - Missing
module load cuda - CUDA version mismatch
Best practices
1. Test before submitting long jobs
$ salloc --time=0:30:00 --gres=gpu:1 ...
$ srun python train.py --max-epochs 1 # Quick test
$ exit
$ sbatch full_training.sh # Now submit the real job
2. Request only what you need
Larger requests wait longer in the queue. Start small and increase if needed.
3. Use meaningful job names
#SBATCH --job-name=bert-finetune-lr001
Makes squeue output much more readable.
4. Save checkpoints
For long jobs, save state periodically so you can resume if killed:
# Save checkpoint every epoch
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
}, f'checkpoint_epoch_{epoch}.pt')
5. Use job arrays instead of many scripts
One job array is easier to manage than 100 separate submissions.
6. Check efficiency and tune
After your first successful run, check seff and adjust requests.
Quick reference
Submit and monitor
| Command | Purpose |
|---|---|
sbatch script.sh | Submit batch job |
salloc ... | Request interactive session |
srun command | Run command on allocated nodes |
squeue -u $USER | View your jobs |
scancel 12345 | Cancel a job |
scancel -u $USER | Cancel all your jobs |
Information
| Command | Purpose |
|---|---|
sinfo | View partitions and nodes |
scontrol show job 12345 | Detailed job info |
sacct -u $USER | View job history |
seff 12345 | Check job efficiency |
sacctmgr show assoc user=$USER | View your accounts |
Common sbatch options
| Option | Example | Purpose |
|---|---|---|
--account | ewi-insy | Billing account |
--partition | all | Node group |
--time | 4:00:00 | Time limit |
--cpus-per-task | 8 | CPU cores |
--mem | 32G | Memory |
--gres | gpu:1 | GPUs |
--output | log_%j.out | Output file |
--array | 1-10 | Job array |
Summary
You’ve learned:
| Concept | Key Commands |
|---|---|
| Submit a batch job | sbatch script.sh |
| Request interactive session | salloc --time=1:00:00 --gres=gpu:1 ... |
| Run on allocated node | srun python train.py |
| Check job status | squeue -u $USER |
| Cancel a job | scancel <jobid> |
| View job history | sacct -u $USER |
| Check efficiency | seff <jobid> |
| Run parameter sweep | #SBATCH --array=1-10 |
| Chain jobs | --dependency=afterok:<jobid> |
Exercises
Try these on your own to solidify your understanding:
Exercise 1: Basic job submission
Create and submit a job that prints your username, hostname, and current date. Check the output.
Check your work
Your output file should contain something like:
netid01
gpu15.ethernet.tudhpc
Fri Mar 20 10:30:00 CET 2026
The hostname should be a compute node (not daic01).
Exercise 2: GPU job
Modify the basic job to request a GPU. Add nvidia-smi to verify the GPU is available.
Check your work
Your output should include nvidia-smi output showing a GPU:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI ... Driver Version: ... CUDA Version: ... |
|-------------------------------+----------------------+----------------------+
| GPU Name ...
If you see “NVIDIA-SMI has failed”, check that you requested a GPU with --gres=gpu:1.
Exercise 3: Resource tuning
Submit a job, then use seff to check its efficiency. Was your resource request appropriate?
Check your work
Run seff <jobid> after your job completes. Good efficiency looks like:
CPU Efficiency: 70-95%
Memory Efficiency: 50-90%
If efficiency is below 50%, reduce your request next time.
Exercise 4: Job array
Create a job array that runs 5 tasks. Each task should print its array task ID.
Check your work
You should see 5 output files (e.g., job_12345_1.out through job_12345_5.out). Each should contain its task ID:
$ cat job_*_1.out
Task ID: 1
$ cat job_*_5.out
Task ID: 5
Exercise 5: Dependencies
Submit two jobs where the second depends on the first completing successfully.
Check your work
After submitting both jobs, squeue -u $USER should show:
JOBID PARTITION NAME USER ST REASON
12346 all second netid01 PD (Dependency)
12345 all first netid01 R
The second job shows (Dependency) while waiting. After the first completes, the second starts automatically.
Next steps
- Apptainer Tutorial - Package your environment in containers
- Vim Tutorial - Edit files efficiently on the cluster
- Modules - Load pre-installed software