Policies & Usage Guidelines

User agreement, access requirements, and guidelines for responsible cluster usage.

User agreement

This user agreement establishes expectations between all users and administrators of the cluster with respect to fair-use and fair-share of cluster resources. By using the DAIC cluster you agree to these terms and conditions.

Definitions

  • Cluster structure: DAIC is made up of shared resources contributed by different labs and groups. Pooling resources benefits everyone: it enables larger, parallelized computations and more efficient use with less idle time.
  • Basic principles: Cluster use is based on fair-use and fair-share (through priority) of resources. All users are expected to ensure their cluster use does not hinder other users.
  • Policies: Cluster policies are decided by the user board and enforced by the job scheduler (based on QoS limits) and administrators (for stability and performance).

Support

RoleResponsibility
Cluster administratorsEnsure stability and performance, provide generic software, help with cluster-specific questions (during office hours)
Contact personsAdd and manage users at group level, communicate between groups and administrators
HPC EngineersMaintain documentation, run training courses, collaborate on research projects

Cluster workflow

  1. Test your code locally or on a login node
  2. Determine resources needed for your job
  3. Submit the job to the scheduler
  4. Monitor job progress
  5. Repeat until results are obtained

Access and accounts

DAIC is dedicated to TU Delft researchers (PhD students, postdocs, etc.) from participating departments.

Terms of service

  1. Resource limits: Use cluster resources within the QoS restrictions of your account. Depending on your group, you may have access to specific partitions with higher priorities.

  2. Reservations: Your group may be eligible for limited-time node reservations (e.g., before conference deadlines). Check with your lab.

  3. Communications: Official DAIC emails are sent to your TU Delft mailbox:

    • Scheduled maintenance notifications
    • User board meeting announcements
    • Automated job efficiency warnings
    • Job cancellation or ban notifications
  4. Self-service: You are responsible for debugging your own code. Administrators may offer advice with cancellation notices, but personalized code debugging is not provided.

  5. User board: You may join quarterly user board meetings for updates and to suggest improvements. Announcements are sent by email and posted on Mattermost.

Expectations from users

  1. Responsibility: Your jobs must not interfere with other users’ cluster usage. Resources are limited and shared.

  2. Research only: The cluster may only be used for studies and research.

  3. Responsiveness: Respond to administrator emails requesting information or action regarding your cluster use.

  4. Acknowledgment: Cite and acknowledge DAIC in your publications using the format in How to Cite.

Responsible usage

You are responsible for running jobs efficiently:

  1. Monitor your jobs: Watch for unexpected behavior and respond to automated efficiency emails.

  2. Short jobs: If running many short jobs (minutes each), consider grouping them to reduce overhead from module loading and job startup.

  3. GPU efficiency: For multi-GPU jobs, communication overhead between GPUs and CPUs (e.g., data loaders) can reduce efficiency. Consider using fewer GPUs with more memory each, or specialized multi-GPU libraries.

Consequences of irresponsible usage

Jobs may be canceled if:

  • The node becomes unresponsive and must be restarted
  • The job overloads the node (e.g., network saturation)
  • The job adversely affects other users’ jobs
  • The job ignores administrator directions
  • The job shows clear problems (hanging, idle, not using requested resources)

You may receive a ban for:

  • Disallowed use of the cluster or computing time
  • Attempting unauthorized access or causing disruptions
  • Unresponsiveness to administrator emails
  • Repeated unresolved problems

Your access will be restored when all parties are confident the problem is understood and won’t reoccur.

Jobs won’t be canceled for:

  • Scheduled maintenance (jobs are held, not killed)

What to do in case of problems

Follow these steps in order:

  1. Ask colleagues: Contact fellow cluster users in your lab who may have solutions.
  2. Ask on Mattermost: Post questions on the DAIC Mattermost channel.
  3. Contact your supervisor: For prolonged problems, escalate to your PI.
  4. Contact administrators: For technical or persistent problems, submit a request through the Self Service Portal referencing “DAIC cluster”.
  5. User board: For recurring problems, complaints, or policy suggestions, contact the advisory board to add it to the next meeting agenda.

Usage guidelines

DAIC has substantial but limited resources. Use them efficiently and fairly.

Login node usage

Login nodes are for:

  • Compiling software
  • Preparing and submitting batch scripts
  • Monitoring jobs
  • Analyzing results
  • Managing files

Do not run production computations on login nodes. Request an interactive session for testing that requires significant resources.

Recommendations

  • Save results frequently - jobs can crash, servers can become overloaded
  • Write modular code so you can continue from the last checkpoint
  • Monitor your jobs at least twice daily
  • If a job isn’t working correctly, terminate it and fix the problem before resubmitting
  • Watch server load and consider moving jobs if resources are near limits (>90% usage)