Policies

What is required to access DAIC, usage policies, and how to acknowledge DAIC in your publication.

User agreement

This user agreement is intended to establish the expectations between all users and administrators of the cluster with respect to fair-use and fair-share of cluster resources. By using the DAIC cluster you agree to these terms and conditions.

General information about the DAIC cluster

  • Cluster structure: The DAIC cluster is made up of shared resources contributed by different labs and groups. The pooling of resources from different groups is beneficial for everyone: it enables larger, parallelized computations and more efficient use of resources with less idle time.
  • Basic principles: Regardless of the specific details, cluster use is always based on basic principles of fair-use and fair-share (through priority) of resources, and all users are expected to take care at all times that their cluster use is not hindering other users.
  • Policies: Cluster policies are decided by the user board and enforced by various automated and non-automated actions, for example by the job scheduler based on QoS limits and the administrators for ensuring the stability and performance of the cluster.
  • Support:
    • Cluster administrators offer, during office hours, different levels of support, which include (in order of priority): ensuring the stability and performance of the cluster, providing generic software, helping with cluster-specific questions and problems, and providing information (via e-mails and during the board meeting) about cluster updates.
    • Contact persons from participating groups add and manage users at the level of their respective groups, communicate needs and updates between their groups and system administrators, and may help with cluster-specific questions and problems.
    • HPC Engineers, in CS@Delft, provide support to (CS) students, researchers and staff members to efficiently use DAIC resources. This includes: maintaining updated documentation resources, running onboarding and advanced training courses on cluster usage, organizing workshops to assess compute needs, plan infrastructure upgrades, and may collaborate with researchers on individual projects as fits.
  • More information: Please see the General cluster usage and What to do in case of problems sections on where to find more information about cluster use.
  • Cluster workflow:
    • The typical steps for running a job on the cluster are: Test → Determine resources → Submit → Monitor job → Repeat until results are obtained. See Quickstart
    • You can use the logins nodes for testing your code, determining the required resources and submitting jobs (see Computing on login nodes).
    • For testing jobs which require larger resources (more than 4 CPUs and/or more than 4 GB of memory and/or one or more powerful GPUs), start an interactive job (see Interactive jobs).
    • For determining resources of larger jobs, you can submit a single (short) test job (see Submitting jobs)
  • QoS:
    • A Quality of Service (QoS) is a set of limits that controls what resources a job can use and determines the priority level of a job. DAIC adopts multiple QoSs to optimize the throughput of job scheduling and to reduce the waiting times in the cluster (see Quality of Service).
    • The DAIC QoS limits are set by the DAIC user board, and the scheduler strictly enforces these limits. Thus, no user can use more resources than the amount that was set by the user board.
    • Any (perceived) imbalance in the use of resources by a certain QoS or user should not be hold against a user or the scheduler, but should be discussed in the user board.

General cluster usage

  1. You may use cluster resources for your research within the QoS restrictions of your domain user and user group. Depending on your user group, you might be eligible to use specific partitions, giving higher priorities on certain nodes. See Priority tiers, and please check this with your lab.
  2. Depending on your user group, you might be eligible to get priorities on certain nodes. For example, you might have access to a specialized partition or limited-time node reservation for your group or department (for example before a conference deadline). Please check this with your lab and try to use these in your *.sbatch file, your jobs should then start faster! See Resources Reservations for more information.
  3. In general, you will be informed about standard administrative actions on the cluster. All official DAIC cluster e-mails are sent to your official TU Delft mailbox, so it is advised to check it regularly.
    1. You will receive e-mails about downtimes relating to scheduled maintenance.
    2. You, or your supervisor, will receive e-mails about scheduled cluster user board meetings where any updates and changes to the cluster structure, software, or hardware will be announced. Please check with your lab or feel free to join the cluster board meetings if you want to be up-to-date about any changes.
    3. You will receive automated e-mails regarding the efficiency of your jobs. The cluster monitors the use of resources of all jobs. When certain specific inefficiencies are detected for a significant number of jobs in the same day, an automated efficiency mail is sent to inform you about these problems with your resource use, to help you optimize your jobs. These mails will not lead to automatic cancellations or bans. To avoid spamming, limited inefficient use will not trigger a mail.
    4. You will receive an e-mail when your jobs are canceled or you receive a cluster ban (see the Expectations from cluster users and Regulations sections). You will be informed about why your jobs were canceled or why you were banned from the cluster (often before the bans take place). If the problem is still not clear to you from the e-mails you already received, please follow the steps detailed in the What to do in case of problems section.
    5. You are not entitled to receive personalized help on how to debug your code via e-mail. It is your responsibility to solve technical problems stemming from your code. Please first consult with your lab for a solution to a technical problem (see What to do in case of problems). However, admins might offer help, advice and solutions along with information regarding a job cancellation or ban. Please listen to such advice, it might help you solve your problem and improve fair use of the cluster.
  4. You may join cluster user board meetings. In the meetings you will be informed of any new developments, hardware and software updates and can suggest changes and improvements. These meetings take place roughly every 3 months and will be announced by e-mail and on the MatterMost channel.

Expectations from cluster users

  1. You are responsible for your jobs not interfering with other users’ cluster usage. Please try to always keep in mind that cluster resources are limited and shared between all users, and that fair use benefits everyone.
  2. You are not allowed to use the cluster for reasons unrelated to your studies and research.
  3. If your jobs are destructive to other users’ jobs or are threatening cluster integrity, your jobs might be canceled. You have the responsibility at all times to avoid behavior which interferes negatively with other users’ cluster usage. See Regulations.
  4. If the destructive behavior of your jobs does not change over time or you are unresponsive to e-mails from system admins requesting information or requiring immediate action regarding your cluster use, you might receive a ban from the cluster. See Regulations.

Regulations

  1. Your jobs might be canceled if:
    1. The node your jobs are running on becomes unresponsive and the node is automatically restarted.
    2. The job is overloading the node (for example overloading the network communication of the node).
    3. The job is adversely affecting the execution of other jobs (jobs that are not using all requested resources (effectively) and thus unfairly block waiting jobs from running may also be canceled).
    4. The jobs ignore the directions from the administrators (for example if a job is (still) affected by the same problem that the administrators informed you about before, and asked you to fix and test before resubmitting).
    5. The job is showing clear signs of a problem (like hanging, or being idle, or using only 1 CPU of the multiple CPUs requested, or not using a GPU that was requested).
  2. You might receive a cluster access ban for:
    1. Disallowed use of the cluster, including disallowed use of computing time, purposefully ignoring directions, guidelines, fair-use principles and/or (trying to gain) unauthorized access and/or causing disruptions to the cluster or parts thereof (even if unintentional).
    2. Unresponsiveness to e-mails from system admins requesting information or requiring immediate action regarding your cluster use.
    3. Repeated problems caused by your cluster use which go unsolved even after attempts to resolve the issue.
    4. Your cluster use privileges will be returned when all parties are confident that you understand the problem and it won’t reoccur.
  3. Your jobs won’t be canceled for:
    1. Scheduled maintenance. This is planned in advance and jobs that would run during scheduled maintenance times won’t start until the end of maintenance.

What to do in case of problems?

When you encounter problems, please follow the subsequent steps, in the indicated order:

  1. First, please contact your colleagues and fellow cluster users in your lab, concerning problems with your code, job performance and efficiency. They may be running similar jobs and potentially have solutions for your problem.
  2. You can also ask questions to fellow users on the MatterMost channel.
  3. For prolonged problems, your initial contact point is your supervisor/PI.
  4. As a final step, you can contact the cluster administrators for technical sysadmin problems or persistent efficiency problems, or for more information if you are not sure why you are banned from the cluster. You can do this by reporting your question, through the Self Service Portal , to the Service Desk. In your question, refer to the ‘DAIC cluster’.
  5. For severe recurring problems, complaints and suggestions for policy changes, or issues affecting multiple users, you can contact the DAIC advisory board to bring it up as an agenda point in the next user board meeting.

Responsible cluster usage

You are responsible that your jobs run efficiently:

  1. Please keep an eye on your jobs and the automated efficiency e-mails to check for unexpected behavior.
  2. Sometimes many jobs from the same user, or from student groups, will be running on many nodes at the same time. While this may seem like one user, or user group, is blocking the cluster for everyone else, please keep in mind that the scheduler operates on a set of predetermined rules based on the QoS and priority settings. We do not want idle resources. Therefore, at the time that those jobs were started, the resources were idle, no higher priority jobs were in the queue and the jobs did not exceed the QoS limits. If you repeatedly observe pending jobs, please bring it up in the user board meeting.
  3. Short job efficiency: If you are running many (hundreds or thousands of) very short jobs (duration of a few minutes), you may want to consider that starting and individually loading the same modules for each job may create overheads. When reasonably possible, it might save computation time to instead group some jobs together. The jobs can still be submitted to the short queue if the runtime is less than 4 hours.
  4. GPU job efficiency: If you are running multi-GPU jobs (for example due to GPU memory limitations), you may want to consider that the communications between the GPUs and other CPU processes (for example data loaders) may create overheads. It might be useful to consider running jobs on less GPUs with more GPU memory each, or taking advantage of specialized libraries optimized for multi-GPU computing in your code.

Acknowledging DAIC

Please acknowledge DAIC in your scientific publications using this sentence:

Acknowledgement

> Research reported in this work was partially or completely facilitated by computational resources and support of the the Delft AI Cluster (DAIC) at TU Delft (URL:https://doc.daic.tudelft.nl/), but remains the sole responsibility of the authors, not the DAIC team.

Reporting of scientific outputs

Please remember to post any scientific output based-off work performed on DAIC to the ScientificOutput MatterMost channel.

Access and accounts

  • DAIC is a cluster dedicated for TU Delft researchers (eg, PhD students, postdocs, .. etc) from participating groups (see Contributing departments).
  • To access DAIC resources, eligible candidates from these groups can request an account via the DAIC request Access form.
  • Additionally, requests for resources reservations can also be accommodated (see General cluster usage).