Priorities and waiting times

How to submit jobs to slurm?

Slurm’s job scheduling and waiting times

When slurm is not configured for FIFO scheduling, jobs are prioritized in the following order:

  1. Jobs that can preempt: Not enabled in DAIC
  2. Jobs with an advanced reservation: See Slurm's Advanced Resource Reservation Guide
  3. Partition PriorityTier: See Priority tiers
  4. Job priority: See Priority calculations and QoS priority
  5. Job ID

Priority tiers

DAIC partitions are tiered:

  • The general partition is in the lowest priority tier,
  • Department partitions (eg, insy, st) are in the middle priority tier, and
  • Partitions for specific groups (eg, influence, mmll) are in the highest priority tier. Those partitions correspond to resources contributed by the respective groups or departments (see Contributing departments).

When resources become available, the scheduler will first look for jobs in the highest priority partition that those resources are in, and start the highest (user) priority jobs that fit within the resources (if any). When resources remain, the scheduler will check the next lower priority tier, and so on. Finally, the scheduler will try to backfill lower (user) priority jobs that fit (if any).

The partition priorities have no impact on resources that are in use, so jobs have to wait until the resources become available.

Partition selection

The purpose of this tiering is to let you submit your jobs to multiple partitions (e.g., --partition=mml,insy,general), allowing the scheduler to determine where the job can start the soonest. This ensures your job has the highest possible priority across different partitions in the cluster, without negatively impacting your or others’ resource access.

Keep in mind that:

  • Resources of all partitions (eg, st) are also part of the general partition (see Fig 1). Thus:
    • Submitting to the general partition allows jobs to use all nodes
    • Submitting to group-specific partitions alone results in longer waiting times, since the general partition has much more resources than any of them (The bigger the resource pool, the more chances a job has to be scheduled or back-filled)
    • The optimal strategy is to submit to both general and group-specific partitions when accessible. This is to skip over higher-priority jobs that would otherwise get started first on resources that are also in the specific partition.
  • You should only submit jobs to partitions that your account has access to. Submitting jobs to unauthorized partitions (e.g., using --partition=insy,st when your submitting account does not have access to both of these) will result in the job remaining in a pending state and generate excessive logging, potentially overloading the Slurm controller nodes.
Correct: explicit default account and partition specification

#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,general
Correct: Implicit default account omitted since it has access to the specified patition

#SBATCH --partition=insy,general
Incorrect: Multiple partitions with account mismatch

#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,st  
Incorrect: Specifying a wrong account for the partition

#SBATCH --account=ewi-st
#SBATCH --partition=insy 

Priority calculations

Slurm continually calculates job priorities and schedules the execution of jobs based on its configurations. A few configuration parameters affect priority computations:

  • SchedulerType: The type of scheduling used based on available resources, requested resources, and job priorities. On DAIC, slurm is used with backfill scheduling mechanism. This mechanism allows low priority jobs to backfill idle resources if doing so does not delay the expected start time of any high priority job (based on resource availability).
  • PriorityType: The way priority is computed. On DAIC, a multifactor computation is applied, where job priority at any given time is a weighted sum of the following factors:
    • Fairshare: a measure of the amount of resources that a group (ie account in slurm terminology) has contributed, and the historical usage of the group and the user.
    • QOS: the quality of service associated with the job, which is specified with the slurm --qos directive (see QoS priority).

The following commands are useful for checking prioritization of your own jobs:

CommandPurpose
sprio -j <YourJobID>Determine the priority of your job
squeue -j <YourJobID> --startRequest your job’s estimated start time
sshare -u <YourNetID>Determine your current fairshare value

QoS priority

The purpose of the (multiple) QoSs in DAIC is to optimize the throughput of the cluster and to reduce the waiting times for jobs:

  • Long jobs block resources for a long time, thus leading to long waiting times and fragmentation of resources.
  • Short jobs block resources only for short times, and can more easily fill in the gaps in the scheduling of resources (thus start sooner), and are therefore better for throughput and waiting times.

Thus, DAIC has the following policy:

  • To stimulate short jobs, the short QoS has a higher priority, and allows you to use a larger part of all resources, than the medium and long QoS.

  • To prevent long jobs from blocking all resources in the cluster for long times (thus causing long waiting times), only a certain part of all cluster resources is available to all running long QoS jobs (of all users) combined.

  • All running medium QoS jobs together can use a somewhat larger part of all resources in the cluster, and all running short QoS jobs combined are allowed to fill the biggest part of the cluster.

    • These limits are called the QoS group limits.
    • When this limit is reached, no new jobs with this QoS can be started, until some of the running jobs with this QoS finish and release some resources.
    • The scheduler will indicate this with the reason QoS Group CPU/memory/GRES limit.
  • To prevent one user from single-handedly using all available resources in a certain QoS, there are also limits for the total resources that all running jobs of one user in a specific QoS can use.

    • These are called the QoS per-user limits.
    • When this limit is reached, no new jobs of this user with this QoS can be started, until some of the running jobs of this user and with this QoS finish and release some resources.
    • The scheduler will indicate this with the reason QoS User CPU/memory/GRES limit.

These per-group and per-user limits are set by the DAIC user board, and the scheduler strictly enforces these limits. Thus, no user can use more resources than the amount that was set by the user board. Any (perceived) imbalance in the use of resources by a certain QoS or user should not be held against a user or the scheduler, but should be discussed in the user board.