DAIC Experimental — Documentation for the experimental environment. (Go to stable docs)

Slurm basics

Understanding the job scheduler on DAIC.

15 minute read

What you’ll learn

By the end of this tutorial, you’ll be able to:

Submit batch jobs that run on compute nodes
Request CPUs, memory, and GPUs for your jobs
Monitor job status and troubleshoot failures
Use interactive sessions for testing
Run parameter sweeps with job arrays

Time: About 45 minutes

Prerequisites: Complete the Bash Basics tutorial first, or be comfortable with Linux command line.

What is Slurm?

When you log into DAIC, you land on a login node. This is a shared computer where users prepare their work - but you shouldn’t run computations here. The actual computing happens on compute nodes, powerful machines with GPUs and lots of memory.

Slurm is the traffic controller that manages these compute nodes. When you want to run a computation, you don’t run it directly - you ask Slurm to run it for you. Slurm finds available resources, starts your job, and makes sure it doesn’t interfere with other users’ jobs.

Think of it like a restaurant: you don’t walk into the kitchen and cook your own food. You submit an order (your job), and the kitchen (Slurm) prepares it when they have capacity.

Why can’t I just run my code?

You might wonder: “Why can’t I just type python train.py and let it run?”

On a personal computer, that works fine. But DAIC is shared by hundreds of researchers, each wanting to use expensive GPUs. Without a scheduler:

Everyone would fight over the same resources
Your job might get killed when someone else starts theirs
GPUs would sit idle when no one happens to be logged in
There would be no fairness - whoever types fastest wins

Slurm solves these problems by:

Queueing jobs and running them in order
Guaranteeing that your job gets the resources you requested
Ensuring fair access based on policies
Maximizing utilization of expensive hardware

The two ways to run jobs

Batch jobs: submit and walk away

Most of the time, you’ll use batch jobs. You write a script that describes what you want to run, submit it, and Slurm runs it whenever resources are available. You don’t need to stay logged in - you can submit at 5pm, go home, and check results the next morning.

$ sbatch my_job.sh
Submitted batch job 12345

Your job enters a queue. When resources become available, Slurm runs it. Output goes to a file you can read later.

Interactive jobs: real-time access

Sometimes you need to work interactively - debugging, testing, or exploring data. For this, you request an interactive job. Slurm allocates resources, and you get a shell on a compute node.

$ salloc --account=<your-account> --partition=all --time=1:00:00 --gres=gpu:1
salloc: Granted job allocation 12346
$ srun nvidia-smi
$ srun python -c "import torch; print(torch.cuda.is_available())"
True

Interactive jobs are great for testing but expensive - you’re reserving resources the whole time, even if you’re just thinking. Use batch jobs for actual computations.

Your first batch job

Let’s walk through creating and submitting a batch job step by step.

Step 1: Create a Python script

First, create a simple script to run. This one just prints some information:

$ cd /tudelft.net/staff-umbrella/<project>
$ vim hello.py

import socket
import os

print(f"Hello from {socket.gethostname()}")
print(f"Job ID: {os.environ.get('SLURM_JOB_ID', 'not in slurm')}")
print(f"CPUs allocated: {os.environ.get('SLURM_CPUS_PER_TASK', 'unknown')}")

Step 2: Create a batch script

Now create the Slurm script that will run your Python code:

$ vim hello_job.sh

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=0:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --output=hello_%j.out

echo "Job started at $(date)"

srun python hello.py

echo "Job finished at $(date)"

Let’s understand each line:

Line	Purpose
`#!/bin/bash`	This is a bash script
`#SBATCH --account=...`	Which account to bill (required)
`#SBATCH --partition=all`	Which group of nodes to use
`#SBATCH --time=0:10:00`	Maximum runtime: 10 minutes
`#SBATCH --ntasks=1`	Run one task
`#SBATCH --cpus-per-task=1`	Use one CPU core
`#SBATCH --mem=1G`	Request 1 GB of memory
`#SBATCH --output=hello_%j.out`	Where to write output (`%j` = job ID)
`srun python hello.py`	The actual command to run

Step 3: Find your account

Before submitting, you need to know your account name:

$ sacctmgr show associations user=$USER format=Account -P
Account
ewi-insy-reit

Replace <your-account> in your script with this value (e.g., ewi-insy-reit).

Step 4: Submit the job

$ sbatch hello_job.sh
Submitted batch job 12345

The number 12345 is your job ID. You’ll use this to track your job.

Step 5: Check job status

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12345       all hello_jo  netid01 PD       0:00      1 (Priority)

The ST column shows the status:

PD = Pending - waiting in queue
R = Running
CG = Completing - wrapping up

The REASON column tells you why a job is pending:

Priority = other jobs are ahead of you in the queue
Resources = waiting for nodes to become free
QOSMaxJobsPerUserLimit = you’ve hit your job limit

Step 6: Check the output

Once the job completes, read the output file:

$ cat hello_12345.out
Job started at Fri Mar 20 10:15:32 CET 2026
Hello from gpu23.ethernet.tudhpc
Job ID: 12345
CPUs allocated: 1
Job finished at Fri Mar 20 10:15:33 CET 2026

Your code ran on gpu23, not on the login node. Slurm handled everything.

Understanding resource requests

The most confusing part of Slurm is figuring out what resources to request. Request too little and your job crashes; request too much and you wait longer in the queue.

Time (`--time`)

How long your job will run. Format: D-HH:MM:SS or HH:MM:SS

#SBATCH --time=0:30:00      # 30 minutes
#SBATCH --time=4:00:00      # 4 hours
#SBATCH --time=1-00:00:00   # 1 day
#SBATCH --time=7-00:00:00   # 7 days (maximum on DAIC)

Important: If your job exceeds this time, Slurm kills it. But requesting more time means waiting longer in the queue. Start with a generous estimate, then use seff on completed jobs to tune it.

Memory (`--mem`)

How much RAM your job needs.

#SBATCH --mem=4G      # 4 gigabytes
#SBATCH --mem=32G     # 32 gigabytes
#SBATCH --mem=128G    # 128 gigabytes

If your job exceeds this limit, Slurm kills it with an “out of memory” error. Check your code’s actual memory usage with seff after a successful run.

CPUs (`--cpus-per-task`)

How many CPU cores your job needs.

#SBATCH --cpus-per-task=1    # Single-threaded code
#SBATCH --cpus-per-task=4    # Code that uses 4 threads
#SBATCH --cpus-per-task=16   # Heavily parallel CPU code

Match this to what your code actually uses:

Simple Python scripts: 1 CPU
PyTorch with DataLoader workers: workers + 1 (e.g., 4 workers = 5 CPUs)
NumPy/Pandas with parallelism: however many threads you configure

GPUs (`--gres`)

Request GPUs with the --gres (generic resources) option:

#SBATCH --gres=gpu:1    # One GPU (any type)
#SBATCH --gres=gpu:2    # Two GPUs
#SBATCH --gres=gpu:l40:1   # Specifically an L40 GPU
#SBATCH --gres=gpu:a40:2   # Two A40 GPUs

Available GPU types on DAIC include L40, A40, and RTX Pro 6000. Request specific types only if your code requires it - being flexible gets you through the queue faster.

Running GPU jobs

Most deep learning jobs need GPUs. Here’s a complete example:

The Python training script

# train.py
import torch
import torch.nn as nn

# Check GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Simple training loop
model = nn.Linear(1000, 100).to(device)
optimizer = torch.optim.Adam(model.parameters())

for epoch in range(100):
    x = torch.randn(64, 1000, device=device)
    y = model(x)
    loss = y.sum()
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        print(f"Epoch {epoch}, loss: {loss.item():.4f}")

print("Training complete!")

The batch script

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=1:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --output=train_%j.out

# Clean environment and load required modules
module purge
module load 2025/gpu cuda/12.9

# Print job info for debugging
echo "Job ID: $SLURM_JOB_ID"
echo "Running on: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Start time: $(date)"

# Run training
srun python train.py

echo "End time: $(date)"

Understanding the module system

DAIC uses an environment modules system to manage software. Instead of having every version of every library available at once (which would cause conflicts), software is organized into modules that you load when needed.

The module commands set up your software environment:

module purge            # Clear any previously loaded modules
module load 2025/gpu    # Load the 2025 GPU software stack
module load cuda/12.9   # Load CUDA 12.9

Why use modules?

Version control: Run module load python/3.11 today, python/3.12 tomorrow
Avoid conflicts: Different projects can use different library versions
Clean environment: module purge gives you a fresh start

Common module commands:

Command	Purpose
`module avail`	List all available modules
`module avail cuda`	List modules matching “cuda”
`module list`	Show currently loaded modules
`module load <name>`	Load a module
`module purge`	Unload all modules

For a complete guide, see Loading Software.

Submit and monitor

$ sbatch train_job.sh
Submitted batch job 12350

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12350       all train_jo  netid01  R       0:45      1 gpu15

$ tail -f train_12350.out
Job ID: 12350
Running on: gpu15.ethernet.tudhpc
GPUs: 0
Start time: Fri Mar 20 11:00:00 CET 2026
Using device: cuda
GPU: NVIDIA L40
Memory: 45.0 GB
Epoch 0, loss: 156.7823
Epoch 10, loss: 89.3421
...

The tail -f command shows output in real-time as your job runs.

Interactive jobs for testing

Before submitting a long batch job, test your code interactively:

Request an interactive session

$ salloc --account=<your-account> --partition=all --time=1:00:00 --cpus-per-task=4 --mem=8G --gres=gpu:1
salloc: Pending job allocation 12351
salloc: job 12351 queued and waiting for resources
salloc: job 12351 has been allocated resources
salloc: Granted job allocation 12351

You now have resources reserved. But you’re still on the login node - you need srun to actually use the compute node:

Run commands on the compute node

$ srun hostname
gpu15.ethernet.tudhpc

$ srun nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15    Driver Version: 550.54.15    CUDA Version: 12.9     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA L40          On   | 00000000:41:00.0 Off |                    0 |
| N/A   30C    P8    22W / 300W |      0MiB / 46068MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

$ srun python train.py
Using device: cuda
...

Start an interactive shell on the compute node

For more extended testing, start a shell on the compute node:

$ srun --pty bash
$ hostname
gpu15.ethernet.tudhpc
$ python train.py
...
$ exit

Don’t forget to release resources

When done testing, release your allocation:

$ exit
salloc: Relinquishing job allocation 12351

If you forget, you’ll hold resources for the full time you requested, even if you’re not using them. This isn’t fair to other users.

Job arrays: running many similar jobs

Often you need to run the same code with different parameters - different random seeds, different hyperparameters, or different data splits. Job arrays make this easy.

The problem

You want to run your experiment with seeds 1 through 10. You could submit 10 separate jobs:

$ sbatch --export=SEED=1 experiment.sh
$ sbatch --export=SEED=2 experiment.sh
$ sbatch --export=SEED=3 experiment.sh
... # tedious!

The solution: job arrays

Instead, use a single job array:

#!/bin/bash
#SBATCH --account=<your-account>
#SBATCH --partition=all
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G
#SBATCH --gres=gpu:1
#SBATCH --array=1-10
#SBATCH --output=experiment_%A_%a.out

# %A = array job ID, %a = array task ID
echo "Array job ID: $SLURM_ARRAY_JOB_ID"
echo "Array task ID: $SLURM_ARRAY_TASK_ID"

srun python experiment.py --seed $SLURM_ARRAY_TASK_ID

Submit once, get 10 jobs:

$ sbatch experiment_array.sh
Submitted batch job 12360

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
12360_1       all experime  netid01  R       0:30      1 gpu01
12360_2       all experime  netid01  R       0:30      1 gpu02
12360_3       all experime  netid01  R       0:30      1 gpu03
12360_4       all experime  netid01 PD       0:00      1 (Resources)
...

Array variations

#SBATCH --array=1-100        # Tasks 1 through 100
#SBATCH --array=0-9          # Tasks 0 through 9
#SBATCH --array=1,3,5,7      # Just these specific tasks
#SBATCH --array=1-100%10     # 1-100, but max 10 running at once

The %10 syntax limits concurrent tasks, useful if you don’t want to flood the queue.

Using array indices creatively

Your Python code can use $SLURM_ARRAY_TASK_ID for more than just seeds:

import os
import json

task_id = int(os.environ.get('SLURM_ARRAY_TASK_ID', 0))

# Load hyperparameter configurations
with open('configs.json') as f:
    configs = json.load(f)

config = configs[task_id]
print(f"Running with config: {config}")

Where configs.json contains:

[
  {"lr": 0.001, "batch_size": 32},
  {"lr": 0.001, "batch_size": 64},
  {"lr": 0.01, "batch_size": 32},
  {"lr": 0.01, "batch_size": 64}
]

Job dependencies: workflows

Sometimes jobs must run in a specific order. Job dependencies let you express this.

Run after another job succeeds

$ sbatch preprocess.sh
Submitted batch job 12370

$ sbatch --dependency=afterok:12370 train.sh
Submitted batch job 12371

Job 12371 won’t start until job 12370 completes successfully. If 12370 fails, 12371 never runs.

Dependency types

Dependency	Meaning
`afterok:jobid`	Start after job succeeds
`afternotok:jobid`	Start after job fails
`afterany:jobid`	Start after job finishes (either way)
`after:jobid`	Start after job starts
`singleton`	Only one job with this name at a time

Complex workflows

Chain multiple dependencies:

$ sbatch download_data.sh
Submitted batch job 12380

$ sbatch --dependency=afterok:12380 preprocess.sh
Submitted batch job 12381

$ sbatch --dependency=afterok:12381 train.sh
Submitted batch job 12382

$ sbatch --dependency=afterok:12382 evaluate.sh
Submitted batch job 12383

Or depend on multiple jobs:

$ sbatch train_model_a.sh
Submitted batch job 12390

$ sbatch train_model_b.sh
Submitted batch job 12391

$ sbatch --dependency=afterok:12390:12391 ensemble.sh
Submitted batch job 12392

Job 12392 waits for both 12390 and 12391 to complete.

Checking job history and efficiency

View past jobs

$ sacct -u $USER --starttime=2026-03-01
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
12340          training        all  ewi-insy          8  COMPLETED      0:0
12341            failed        all  ewi-insy          4     FAILED      1:0
12342          training        all  ewi-insy          8    TIMEOUT      0:0

Exit codes:

0:0 = success
1:0 = your code exited with error
0:9 = killed by signal 9 (often out of memory)
TIMEOUT = exceeded time limit

Check efficiency

The seff command shows how well you used the resources you requested:

$ seff 12340
Job ID: 12340
Cluster: daic
State: COMPLETED
Nodes: 1
Cores per node: 8
CPU Utilized: 06:30:15
CPU Efficiency: 81.3% of 08:00:00 core-walltime
Job Wall-clock time: 01:00:00
Memory Utilized: 24.5 GB
Memory Efficiency: 76.6% of 32.0 GB

This job used 81% of allocated CPU and 77% of allocated memory - reasonable efficiency. If you see numbers below 50%, you’re requesting more than you need.

Adjusting based on efficiency

If seff shows:

Low CPU efficiency: Reduce --cpus-per-task
Low memory efficiency: Reduce --mem
Very high efficiency (>95%): Consider requesting slightly more headroom

Troubleshooting

Job stuck in pending

Check why with squeue:

$ squeue -u $USER
  JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
  12345       all training  netid01 PD       0:00      1 (Resources)

Common reasons:

Priority - Other jobs are ahead of you. Wait, or request fewer resources.
Resources - Not enough free nodes. Wait, or request fewer resources.
QOSMaxJobsPerUserLimit - You’ve hit your concurrent job limit. Wait for some to finish.
AssocMaxJobsLimit - Your account has hit its limit.

Job killed immediately

Check the output file for errors. Common issues:

Out of memory:

slurmstepd: error: Detected 1 oom-kill event(s) in step 12345.0

Solution: Increase --mem

Time limit:

slurmstepd: error: *** JOB 12345 ON gpu01 CANCELLED AT 2026-03-20T12:00:00 DUE TO TIME LIMIT ***

Solution: Increase --time or add checkpointing to your code

Module not found:

ModuleNotFoundError: No module named 'torch'

Solution: Add module load commands to your script

Can’t find GPUs

Your code can’t see GPUs even though you requested them:

torch.cuda.is_available()  # Returns False

Common causes:

Forgot --gres=gpu:1 in your script
Running on login node instead of through srun
Missing module load cuda
CUDA version mismatch

Best practices

1. Test before submitting long jobs

$ salloc --time=0:30:00 --gres=gpu:1 ...
$ srun python train.py --max-epochs 1  # Quick test
$ exit
$ sbatch full_training.sh  # Now submit the real job

2. Request only what you need

Larger requests wait longer in the queue. Start small and increase if needed.

3. Use meaningful job names

#SBATCH --job-name=bert-finetune-lr001

Makes squeue output much more readable.

4. Save checkpoints

For long jobs, save state periodically so you can resume if killed:

# Save checkpoint every epoch
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
}, f'checkpoint_epoch_{epoch}.pt')

5. Use job arrays instead of many scripts

One job array is easier to manage than 100 separate submissions.

6. Check efficiency and tune

After your first successful run, check seff and adjust requests.

Quick reference

Submit and monitor

Command	Purpose
`sbatch script.sh`	Submit batch job
`salloc ...`	Request interactive session
`srun command`	Run command on allocated nodes
`squeue -u $USER`	View your jobs
`scancel 12345`	Cancel a job
`scancel -u $USER`	Cancel all your jobs

Information

Command	Purpose
`sinfo`	View partitions and nodes
`scontrol show job 12345`	Detailed job info
`sacct -u $USER`	View job history
`seff 12345`	Check job efficiency
`sacctmgr show assoc user=$USER`	View your accounts

Common sbatch options

Option	Example	Purpose
`--account`	`ewi-insy`	Billing account
`--partition`	`all`	Node group
`--time`	`4:00:00`	Time limit
`--cpus-per-task`	`8`	CPU cores
`--mem`	`32G`	Memory
`--gres`	`gpu:1`	GPUs
`--output`	`log_%j.out`	Output file
`--array`	`1-10`	Job array

Summary

You’ve learned:

Concept	Key Commands
Submit a batch job	`sbatch script.sh`
Request interactive session	`salloc --time=1:00:00 --gres=gpu:1 ...`
Run on allocated node	`srun python train.py`
Check job status	`squeue -u $USER`
Cancel a job	`scancel <jobid>`
View job history	`sacct -u $USER`
Check efficiency	`seff <jobid>`
Run parameter sweep	`#SBATCH --array=1-10`
Chain jobs	`--dependency=afterok:<jobid>`

Exercises

Try these on your own to solidify your understanding:

Exercise 1: Basic job submission

Create and submit a job that prints your username, hostname, and current date. Check the output.

Check your work

Your output file should contain something like:

netid01
gpu15.ethernet.tudhpc
Fri Mar 20 10:30:00 CET 2026

The hostname should be a compute node (not daic01).

Exercise 2: GPU job

Modify the basic job to request a GPU. Add nvidia-smi to verify the GPU is available.

Check your work

Your output should include nvidia-smi output showing a GPU:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI ...    Driver Version: ...    CUDA Version: ...                  |
|-------------------------------+----------------------+----------------------+
| GPU  Name        ...

If you see “NVIDIA-SMI has failed”, check that you requested a GPU with --gres=gpu:1.

Exercise 3: Resource tuning

Submit a job, then use seff to check its efficiency. Was your resource request appropriate?

Check your work

Run seff <jobid> after your job completes. Good efficiency looks like:

CPU Efficiency: 70-95%
Memory Efficiency: 50-90%

If efficiency is below 50%, reduce your request next time.

Exercise 4: Job array

Create a job array that runs 5 tasks. Each task should print its array task ID.

Check your work

You should see 5 output files (e.g., job_12345_1.out through job_12345_5.out). Each should contain its task ID:

$ cat job_*_1.out
Task ID: 1
$ cat job_*_5.out
Task ID: 5

Exercise 5: Dependencies

Submit two jobs where the second depends on the first completing successfully.

Check your work

After submitting both jobs, squeue -u $USER should show:

  JOBID PARTITION     NAME     USER ST  REASON
  12346       all   second  netid01 PD  (Dependency)
  12345       all    first  netid01  R

The second job shows (Dependency) while waiting. After the first completes, the second starts automatically.

Next steps

Apptainer Tutorial - Package your environment in containers
Vim Tutorial - Edit files efficiently on the cluster
Modules - Load pre-installed software

Last modified March 20, 2026: improve tutorials with fixes, module docs, troubleshooting, and exercise verification (b3a68e1)

Slurm basics

What you’ll learn

What is Slurm?

Why can’t I just run my code?

The two ways to run jobs

Batch jobs: submit and walk away

Interactive jobs: real-time access

Your first batch job

Step 1: Create a Python script

Step 2: Create a batch script

Step 3: Find your account

Step 4: Submit the job

Step 5: Check job status

Step 6: Check the output

Understanding resource requests

Time (--time)

Memory (--mem)

CPUs (--cpus-per-task)

GPUs (--gres)

Running GPU jobs

The Python training script

The batch script

Understanding the module system

Submit and monitor

Interactive jobs for testing

Request an interactive session

Run commands on the compute node

Start an interactive shell on the compute node

Don’t forget to release resources

Job arrays: running many similar jobs

The problem

The solution: job arrays

Array variations

Using array indices creatively

Job dependencies: workflows

Run after another job succeeds

Dependency types

Complex workflows

Checking job history and efficiency

View past jobs

Check efficiency

Adjusting based on efficiency

Troubleshooting

Job stuck in pending

Job killed immediately

Can’t find GPUs

Best practices

1. Test before submitting long jobs

2. Request only what you need

3. Use meaningful job names

4. Save checkpoints

5. Use job arrays instead of many scripts

6. Check efficiency and tune

Quick reference

Submit and monitor

Information

Common sbatch options

Summary

Exercises

Exercise 1: Basic job submission

Check your work

Exercise 2: GPU job

Check your work

Exercise 3: Resource tuning

Check your work

Exercise 4: Job array

Check your work

Exercise 5: Dependencies

Check your work

Next steps

Time (`--time`)

Memory (`--mem`)

CPUs (`--cpus-per-task`)

GPUs (`--gres`)