Policies & Usage Guidelines
5 minute read
User agreement
This user agreement establishes expectations between all users and administrators of the cluster with respect to fair-use and fair-share of cluster resources. By using the DAIC cluster you agree to these terms and conditions.
Definitions
- Cluster structure: DAIC is made up of shared resources contributed by different labs and groups. Pooling resources benefits everyone: it enables larger, parallelized computations and more efficient use with less idle time.
- Basic principles: Cluster use is based on fair-use and fair-share (through priority) of resources. All users are expected to ensure their cluster use does not hinder other users.
- Policies: Cluster policies are decided by the user board and enforced by the job scheduler (based on QoS limits) and administrators (for stability and performance).
Support
| Role | Responsibility |
|---|---|
| Cluster administrators | Ensure stability and performance, provide generic software, help with cluster-specific questions (during office hours) |
| Contact persons | Add and manage users at group level, communicate between groups and administrators |
| HPC Engineers | Maintain documentation, run training courses, collaborate on research projects |
Cluster workflow
- Test your code locally or on a login node
- Determine resources needed for your job
- Submit the job to the scheduler
- Monitor job progress
- Repeat until results are obtained
Testing with GPUs
For jobs requiring more than 4 CPUs, 4 GB memory, or GPUs, use an interactive session withsalloc instead of running on the login node.Access and accounts
DAIC is dedicated to TU Delft researchers (PhD students, postdocs, etc.) from participating departments.
Requesting access
Eligible candidates can request an account via the DAIC Request Access form.Terms of service
Resource limits: Use cluster resources within the QoS restrictions of your account. Depending on your group, you may have access to specific partitions with higher priorities.
Reservations: Your group may be eligible for limited-time node reservations (e.g., before conference deadlines). Check with your lab.
Communications: Official DAIC emails are sent to your TU Delft mailbox:
- Scheduled maintenance notifications
- User board meeting announcements
- Automated job efficiency warnings
- Job cancellation or ban notifications
Self-service: You are responsible for debugging your own code. Administrators may offer advice with cancellation notices, but personalized code debugging is not provided.
User board: You may join quarterly user board meetings for updates and to suggest improvements. Announcements are sent by email and posted on Mattermost.
Expectations from users
Responsibility: Your jobs must not interfere with other users’ cluster usage. Resources are limited and shared.
Research only: The cluster may only be used for studies and research.
Responsiveness: Respond to administrator emails requesting information or action regarding your cluster use.
Acknowledgment: Cite and acknowledge DAIC in your publications using the format in How to Cite.
Responsible usage
You are responsible for running jobs efficiently:
Monitor your jobs: Watch for unexpected behavior and respond to automated efficiency emails.
Short jobs: If running many short jobs (minutes each), consider grouping them to reduce overhead from module loading and job startup.
GPU efficiency: For multi-GPU jobs, communication overhead between GPUs and CPUs (e.g., data loaders) can reduce efficiency. Consider using fewer GPUs with more memory each, or specialized multi-GPU libraries.
Consequences of irresponsible usage
Jobs may be canceled if:
- The node becomes unresponsive and must be restarted
- The job overloads the node (e.g., network saturation)
- The job adversely affects other users’ jobs
- The job ignores administrator directions
- The job shows clear problems (hanging, idle, not using requested resources)
You may receive a ban for:
- Disallowed use of the cluster or computing time
- Attempting unauthorized access or causing disruptions
- Unresponsiveness to administrator emails
- Repeated unresolved problems
Your access will be restored when all parties are confident the problem is understood and won’t reoccur.
Jobs won’t be canceled for:
- Scheduled maintenance (jobs are held, not killed)
What to do in case of problems
Follow these steps in order:
- Ask colleagues: Contact fellow cluster users in your lab who may have solutions.
- Ask on Mattermost: Post questions on the DAIC Mattermost channel.
- Contact your supervisor: For prolonged problems, escalate to your PI.
- Contact administrators: For technical or persistent problems, submit a request through the Self Service Portal referencing “DAIC cluster”.
- User board: For recurring problems, complaints, or policy suggestions, contact the advisory board to add it to the next meeting agenda.
Usage guidelines
DAIC has substantial but limited resources. Use them efficiently and fairly.
One rule: Respect your fellow users.
We reserve the right to terminate any job that interferes with others’ ability to complete work.
Login node usage
Login nodes are for:
- Compiling software
- Preparing and submitting batch scripts
- Monitoring jobs
- Analyzing results
- Managing files
Do not run production computations on login nodes. Request an interactive session for testing that requires significant resources.
Multi-threaded applications
Applications like Java and MATLAB automatically use all CPU cores. If you must run on a login node, limit threads to at most 25% of available cores (e.g., 4 threads on a 16-core node).Recommendations
- Save results frequently - jobs can crash, servers can become overloaded
- Write modular code so you can continue from the last checkpoint
- Monitor your jobs at least twice daily
- If a job isn’t working correctly, terminate it and fix the problem before resubmitting
- Watch server load and consider moving jobs if resources are near limits (>90% usage)