Troubleshooting

Common issues and troubleshooting steps for DAIC.

Storage access errors

“Permission denied” or “Stale file handle” when accessing linuxhome

Cause: Missing Kerberos ticket. This happens when you log in with SSH keys instead of a password.

Solution: Run kinit and enter your NetID password:

kinit

Verify your ticket with klist:

klist

You should see output like:

Default principal: <YourNetID>@TUDELFT.NET
Valid starting     Expires            Service principal
03/23/26 11:05:12  03/23/26 21:05:12  krbtgt/TUDELFT.NET@TUDELFT.NET

Storage takes long to access on first use

Cause: Network storage mounts on-demand and may take up to 30 seconds on first access.

Solution: Wait and retry. Subsequent accesses will be fast.

Job submission errors

“Disk quota exceeded” in home directory

Cause: Cluster home (/trinity/home) has a 5 MB quota for config files only.

Solution: Store code and data in ~/linuxhome or project storage. Check quota with:

quota -s

Job fails immediately with no output

Cause: Often a missing module or incorrect path.

Solution:

  1. Check error file: cat <jobname>_<jobid>.err
  2. Verify modules load correctly: module load 2025/gpu cuda/12.9
  3. Check working directory in job script uses $SLURM_SUBMIT_DIR

Multi-GPU training issues

Training hangs with multiple GPUs

Symptoms: Training hangs after “Initializing distributed” or “All distributed processes registered”. NCCL all_reduce operations never complete.

Cause: DAIC GPU nodes have GPUs on different NUMA nodes (CPU sockets). NCCL P2P (peer-to-peer) communication fails between GPUs that aren’t directly connected.

Solution: Add this to your job script:

export NCCL_P2P_DISABLE=1

This forces NCCL to use shared memory instead of P2P, which works across NUMA boundaries.

Verify GPU topology

Check how GPUs are connected:

nvidia-smi topo -m

If you see SYS between GPUs (not NV# for NVLink), you need NCCL_P2P_DISABLE=1.

See Multi-GPU Training for details.