- Contact - Mattermost, Service Desk, request forms
- Troubleshooting - Common issues and solutions
This is the multi-page printable view of this section. Click here to print.
Support
- 1: Contact
- 2: Troubleshooting
1 - Contact
Community support
Join the DAIC Mattermost channel for questions, discussions, and announcements:
Join DAIC MattermostThis is the best place to get help from fellow users and the DAIC team.
Service Desk
For technical issues, account problems, or storage issues, submit a ticket through the TU Delft Self-Service Portal:
TU Delft Self-Service PortalRequest forms
| Request | Link |
|---|---|
| DAIC account access | Request Access |
| Project storage | Request Storage |
| General inquiry | Contact Form |
Scientific output
Share your DAIC-based publications in the ScientificOutput channel:
ScientificOutput Channel2 - Troubleshooting
Storage access errors
“Permission denied” or “Stale file handle” when accessing linuxhome
Cause: Missing Kerberos ticket. This happens when you log in with SSH keys instead of a password.
Solution: Run kinit and enter your NetID password:
kinit
Verify your ticket with klist:
klist
You should see output like:
Default principal: <YourNetID>@TUDELFT.NET
Valid starting Expires Service principal
03/23/26 11:05:12 03/23/26 21:05:12 krbtgt/TUDELFT.NET@TUDELFT.NET
Storage takes long to access on first use
Cause: Network storage mounts on-demand and may take up to 30 seconds on first access.
Solution: Wait and retry. Subsequent accesses will be fast.
Job submission errors
“Disk quota exceeded” in home directory
Cause: Cluster home (/trinity/home) has a 5 MB quota for config files only.
Solution: Store code and data in ~/linuxhome or project storage. Check quota with:
quota -s
Job fails immediately with no output
Cause: Often a missing module or incorrect path.
Solution:
- Check error file:
cat <jobname>_<jobid>.err - Verify modules load correctly:
module load 2025/gpu cuda/12.9 - Check working directory in job script uses
$SLURM_SUBMIT_DIR
Multi-GPU training issues
Training hangs with multiple GPUs
Symptoms: Training hangs after “Initializing distributed” or “All distributed processes registered”. NCCL all_reduce operations never complete.
Cause: DAIC GPU nodes have GPUs on different NUMA nodes (CPU sockets). NCCL P2P (peer-to-peer) communication fails between GPUs that aren’t directly connected.
Solution: Add this to your job script:
export NCCL_P2P_DISABLE=1
This forces NCCL to use shared memory instead of P2P, which works across NUMA boundaries.
Verify GPU topology
Check how GPUs are connected:
nvidia-smi topo -m
If you see SYS between GPUs (not NV# for NVLink), you need NCCL_P2P_DISABLE=1.
See Multi-GPU Training for details.