This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Support

Help yourself, or get help from the community.

1 - Contact

Ways to contact the DAIC support team.

Community support

Join the DAIC Mattermost channel for questions, discussions, and announcements:

Join DAIC Mattermost

This is the best place to get help from fellow users and the DAIC team.

Service Desk

For technical issues, account problems, or storage issues, submit a ticket through the TU Delft Self-Service Portal:

TU Delft Self-Service Portal

Request forms

RequestLink
DAIC account accessRequest Access
Project storageRequest Storage
General inquiryContact Form

Scientific output

Share your DAIC-based publications in the ScientificOutput channel:

ScientificOutput Channel

2 - Troubleshooting

Common issues and troubleshooting steps for DAIC.

Storage access errors

“Permission denied” or “Stale file handle” when accessing linuxhome

Cause: Missing Kerberos ticket. This happens when you log in with SSH keys instead of a password.

Solution: Run kinit and enter your NetID password:

kinit

Verify your ticket with klist:

klist

You should see output like:

Default principal: <YourNetID>@TUDELFT.NET
Valid starting     Expires            Service principal
03/23/26 11:05:12  03/23/26 21:05:12  krbtgt/TUDELFT.NET@TUDELFT.NET

Storage takes long to access on first use

Cause: Network storage mounts on-demand and may take up to 30 seconds on first access.

Solution: Wait and retry. Subsequent accesses will be fast.

Job submission errors

“Disk quota exceeded” in home directory

Cause: Cluster home (/trinity/home) has a 5 MB quota for config files only.

Solution: Store code and data in ~/linuxhome or project storage. Check quota with:

quota -s

Job fails immediately with no output

Cause: Often a missing module or incorrect path.

Solution:

  1. Check error file: cat <jobname>_<jobid>.err
  2. Verify modules load correctly: module load 2025/gpu cuda/12.9
  3. Check working directory in job script uses $SLURM_SUBMIT_DIR

Multi-GPU training issues

Training hangs with multiple GPUs

Symptoms: Training hangs after “Initializing distributed” or “All distributed processes registered”. NCCL all_reduce operations never complete.

Cause: DAIC GPU nodes have GPUs on different NUMA nodes (CPU sockets). NCCL P2P (peer-to-peer) communication fails between GPUs that aren’t directly connected.

Solution: Add this to your job script:

export NCCL_P2P_DISABLE=1

This forces NCCL to use shared memory instead of P2P, which works across NUMA boundaries.

Verify GPU topology

Check how GPUs are connected:

nvidia-smi topo -m

If you see SYS between GPUs (not NV# for NVLink), you need NCCL_P2P_DISABLE=1.

See Multi-GPU Training for details.