Priorities, Partitions, Quality of Service & Reservations
14 minute read
Slurm’s job scheduling and waiting times
When slurm is not configured for FIFO scheduling, jobs are prioritized in the following order:
- Jobs that can preempt: Not enabled in DAIC
- Jobs with an advanced reservation: See Slurm's Advanced Resource Reservation Guide
- Partition PriorityTier: See Priority tiers
- Job priority: See Priority calculations and QoS priority
- Job ID
Priority tiers
DAIC partitions are tiered:
- The
general
partition is in the lowest priority tier, - Department partitions (eg,
insy
,st
) are in the middle priority tier, and - Partitions for specific groups (eg,
influence
,mmll
) are in the highest priority tier. Those partitions correspond to resources contributed by the respective groups or departments (see Contributing departments).
When resources become available, the scheduler will first look for jobs in the highest priority partition that those resources are in, and start the highest (user) priority jobs that fit within the resources (if any). When resources remain, the scheduler will check the next lower priority tier, and so on. Finally, the scheduler will try to backfill lower (user) priority jobs that fit (if any).
The partition priorities have no impact on resources that are in use, so jobs have to wait until the resources become available.
Partition selection
The purpose of this tiering is to let you submit your jobs to multiple partitions (e.g., --partition=mml,insy,general
), allowing the scheduler to determine where the job can start the soonest. This ensures your job has the highest possible priority across different partitions in the cluster, without negatively impacting your or others’ resource access.
Keep in mind that:
- Resources of all partitions (eg,
st
) are also part of thegeneral
partition (see Fig 1). Thus:- Submitting to the
general
partition allows jobs to use all nodes - Submitting to group-specific partitions alone results in longer waiting times, since the
general
partition has much more resources than any of them (The bigger the resource pool, the more chances a job has to be scheduled or back-filled) - The optimal strategy is to submit to both
general
and group-specific partitions when accessible. This is to skip over higher-priority jobs that would otherwise get started first on resources that are also in the specific partition.
- Submitting to the
- You should only submit jobs to partitions that your account has access to. Submitting jobs to unauthorized partitions (e.g., using
--partition=insy,st
when your submitting account does not have access to both of these) will result in the job remaining in a pending state and generate excessive logging, potentially overloading the Slurm controller nodes.
Warning
Always ensure you are submitting jobs to partitions accessible by your account. You can check your account and partition permissions with the following commands- example output for a user is shown below:
$ sacctmgr show user "$USER" withassoc Format='DefaultAccount,Account' --parsable # Check your account(s)
Def Acct|Account|
ewi-insy-prb|ewi-st|
ewi-insy-prb|ewi-insy-prb|
$ echo "Partition AllowAccounts"; scontrol show partition -a | \
> awk '
> /PartitionName=/ {
> split($1, a, "=");
> partition = a[2]
> }
> /AllowAccounts=/ {
> split($2, b, "=");
> print partition, b[2]
> }
> ' | \
> grep -E 'ALL|ewi-insy-prb' # Check paritions accessible to your *default* account
Partition AllowAccounts
general ALL
insy ewi-insy,ewi-insy-cgv,ewi-insy-cys,ewi-insy-ii,ewi-insy-ii-influence,ewi-insy-mmc,ewi-insy-prb,ewi-insy-prb-dbl,ewi-insy-prb-prlab,ewi-insy-prb-spclab,ewi-insy-prb-visionlab,ewi-insy-reit,ewi-insy-sdm,ewi-insy-sup
This shows that the user can use the ewi-insy-prb
or the ewi-st
accounts.
The second command shows that all accounts can submit to the general
partition and several accounts can submit to the insy
partition.
Replace the ewi-insy-prb
in the grep line above to get the partition details for your specific account.
For the example above note the following correct and incorrect examples:
#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,general
#SBATCH --partition=insy,general
#SBATCH --account=ewi-insy-prb
#SBATCH --partition=insy,st
#SBATCH --account=ewi-st
#SBATCH --partition=insy
Priority calculations
Slurm continually calculates job priorities and schedules the execution of jobs based on its configurations. A few configuration parameters affect priority computations:
SchedulerType
: The type of scheduling used based on available resources, requested resources, and job priorities. On DAIC, slurm is used withbackfill
scheduling mechanism. This mechanism allows low priority jobs to backfill idle resources if doing so does not delay the expected start time of any high priority job (based on resource availability).
Tip
With sched/backfil
, jobs can only be started when the resources that they request fit within the available idle resources. Thus:
- The fewer resources a job request, the higher the chance that it will fit within the available idle resources.
- The more resources a job request, the long it will have to wait before enough resources become available to start. To check how the cluster is configured, you may run:
$ scontrol show config | grep SchedulerType
SchedulerType = sched/backfil
More details is available in Slurm’s SchedulerType
PriorityType
: The way priority is computed. On DAIC, amultifactor
computation is applied, where job priority at any given time is a weighted sum of the following factors:- Fairshare: a measure of the amount of resources that a group (ie
account
in slurm terminology) has contributed, and the historical usage of the group and the user. - QOS: the quality of service associated with the job, which is specified with the slurm
--qos
directive (see QoS priority).
- Fairshare: a measure of the amount of resources that a group (ie
Info
The whole idea behind the FairShare scheduling in DAIC is to share all the available resources fairly and efficiently with all users (instead of having strict limitations in the amount of resource use or in which hardware users can compute). The resources in the cluster are contributed in different amounts by different groups (see Contributing departments), and the scheduler makes sure that each group can use a share of the resource relative to what the group contributed. To check how the cluster is configured you may run:
$ scontrol show config | grep PriorityType
PriorityType = priority/multifactor
$ sprio --weights
JOBID PARTITION PRIORITY SITE FAIRSHARE QOS
Weights 1 20000000 40000000
The following commands are useful for checking prioritization of your own jobs:
Command | Purpose |
---|---|
sprio -j <YourJobID> | Determine the priority of your job |
squeue -j <YourJobID> --start | Request your job’s estimated start time |
sshare -u <YourNetID> | Determine your current fairshare value |
Info
To get more complete priority configurations of a cluster, run the command:
$ scontrol show config | grep ^Priority
PriorityParameters = (null)
PrioritySiteFactorParameters = (null)
PrioritySiteFactorPlugin = (null)
PriorityDecayHalfLife = 2-00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = No
PriorityFlags =
PriorityMaxAge = 7-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 0
PriorityWeightAssoc = 0
PriorityWeightFairShare = 20000000
PriorityWeightJobSize = 0
PriorityWeightPartition = 0
PriorityWeightQOS = 40000000
PriorityWeightTRES = (null)
QoS priority
The purpose of the (multiple) QoSs in DAIC is to optimize the throughput of the cluster and to reduce the waiting times for jobs:
- Long jobs block resources for a long time, thus leading to long waiting times and fragmentation of resources.
- Short jobs block resources only for short times, and can more easily fill in the gaps in the scheduling of resources (thus start sooner), and are therefore better for throughput and waiting times.
Thus, DAIC has the following policy:
To stimulate short jobs, the
short
QoS has a higher priority, and allows you to use a larger part of all resources, than themedium
andlong
QoS.To prevent long jobs from blocking all resources in the cluster for long times (thus causing long waiting times), only a certain part of all cluster resources is available to all running
long
QoS jobs (of all users) combined.All running
medium
QoS jobs together can use a somewhat larger part of all resources in the cluster, and all runningshort
QoS jobs combined are allowed to fill the biggest part of the cluster.- These limits are called the QoS group limits.
- When this limit is reached, no new jobs with this QoS can be started, until some of the running jobs with this QoS finish and release some resources.
- The scheduler will indicate this with the reason
QoS Group CPU/memory/GRES limit
.
To prevent one user from single-handedly using all available resources in a certain QoS, there are also limits for the total resources that all running jobs of one user in a specific QoS can use.
- These are called the QoS per-user limits.
- When this limit is reached, no new jobs of this user with this QoS can be started, until some of the running jobs of this user and with this QoS finish and release some resources.
- The scheduler will indicate this with the reason
QoS User CPU/memory/GRES limit
.
These per-group and per-user limits are set by the DAIC user board, and the scheduler strictly enforces these limits. Thus, no user can use more resources than the amount that was set by the user board. Any (perceived) imbalance in the use of resources by a certain QoS or user should not be held against a user or the scheduler, but should be discussed in the user board.
Partitions
In SLURM, a partition is a scheduling construct that groups nodes or resources based on certain characteristics or policies. Partitions are used to organize and manage resources within a cluster, and they allow system administrators to control how jobs are allocated and executed on different nodes.
See partition definitions
On DAIC the scontrol
command only shows you the general partitions. More partitions are available.
$ scontrol show partition
PartitionName=general
AllowGroups=ALL AllowAccounts=ALL DenyQos=influence
AllocNodes=login[1-3],oodtest Default=YES QoS=N/A
DefaultTime=00:01:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
MaxNodes=UNLIMITED MaxTime=UNLIMITED MinNodes=0 LLN=NO MaxCPUsPerNode=UNLIMITED
Nodes=3dgi[1-2],100plus,awi[01-26],cor1,gpu[01-11],grs[1-4],influ[1-6],insy[11-16],tbm5,wis1
PriorityJobFactor=1 PriorityTier=1 RootOnly=NO ReqResv=NO OverSubscribe=NO
OverTimeLimit=NONE PreemptMode=OFF
State=UP TotalCPUs=4064 TotalNodes=59 SelectTypeParameters=NONE
JobDefaults=(null)
DefMemPerNode=1024 MaxMemPerNode=UNLIMITED
TRESBillingWeights=CPU=0.5,Mem=0.083333333G,GRES/gpu=16.0
Quality of Service (QoS)
When you submit a job in a slurm-based system, it enters a queue waiting for resources. The partition and Quality of Service(QoS) are the two job parameters slurm uses to assign resources for a job:
- The partition is a set of compute nodes on which a job can be scheduled. In DAIC, the nodes contributed or funded by a certain group are lumped into a corresponding partition (see Contributing departments).
All nodes in DAIC are part of the
general
partition, but other partitions exist for prioritization purposes on select nodes (see Priority tiers). - The Quality of Service is a set of limits that controls what resources a job can use and, therefore, determines the priority level of a job. This includes the run time, CPU, GPU and memory limits on the given partition. Jobs that exceed these limits are automatically terminated (see QoS priority).
For DAIC, Table 1 shows the QoS limits on the general
partition.
*infinite QoS jobs will be killed when servers go down, eg, during maintenance. It is not recommended to submit jobs with this QoS. | ||||||||||
Partition | QoS | Priority | Max run time | Jobs per user | CPU limits | GPU limits | Memory limits | |||
---|---|---|---|---|---|---|---|---|---|---|
Per QoS | Per user | Per QoS | Per user | Per QoS | Per User | |||||
general | interactive | high | 1 hour | 1 running | - | 2 | - | 2 | - | 16G |
short | normal | 4 hours | 10000 | 3672 (85%) | 2160 (50%) | 109 (85%) | 64 (50%) | 23159G (85%) | 13623G (50%) | |
medium | medium | 1 ½ day | 2000 | 3456 (80%) | 1512 (35%) | 103 (80%) | 45 (35%) | 21796G (80%) | 9536G (35%) | |
long | low | 7 days | 1000 | 3240 (75%) | 864 (20%) | 96 (75%) | 25 (20%) | 20434G (75%) | 5449G (20%) | |
infinite* | none | infinite | 1 running | 32 | - | 2 | - | 250G | - |
Note
The priority of a job is a function of both QoS and previous usage (less is better). Read Priority and waiting times for more information.See Quality of Service definitions
On DAIC you can check the QoS policies with the sacctmgr
command:
$ sacctmgr list qos
Name Priority GraceTime Preempt PreemptExemptTime PreemptMode Flags UsageThres UsageFactor GrpTRES GrpTRESMins GrpTRESRunMin GrpJobs GrpSubmit GrpWall MaxTRES MaxTRESPerNode MaxTRESMins MaxWall MaxTRESPU MaxJobsPU MaxSubmitPU MaxTRESPA MaxJobsPA MaxSubmitPA MinTRES
---------- ---------- ---------- ---------- ------------------- ----------- ---------------------------------------- ---------- ----------- ------------- ------------- ------------- ------- --------- ----------- ------------- -------------- ------------- ----------- ------------- --------- ----------- ------------- --------- ----------- -------------
normal 0 00:00:00 cluster DenyOnLimit 1.000000 cpu=1
short 50 00:00:00 cluster DenyOnLimit 1.000000 cpu=3562,gre+ 65536 04:00:00 cpu=2096,gre+ 10000 cpu=1,mem=1M
long 25 00:00:00 cluster DenyOnLimit 1.000000 cpu=3144,gre+ 65536 7-00:00:00 cpu=838,gres+ 1000 cpu=1,mem=1M
infinite 0 00:00:00 cluster DenyOnLimit 1.000000 cpu=32,gres/+ 65536 1 100 cpu=1,mem=1M
interacti+ 100 00:00:00 cluster DenyOnLimit 2.000000 65536 01:00:00 cpu=2,gres/g+ 1 1 cpu=1,mem=1M
student 10 00:00:00 cluster DenyOnLimit 1.000000 cpu=192,gres+ 65536 04:00:00 cpu=2,gres/g+ 1 100 cpu=1,mem=1M
reservati+ 100 00:00:00 cluster DenyOnLimit,RequiresReservation 1.000000 65536 10000 cpu=1,mem=1M
influence 100 00:00:00 cluster DenyOnLimit 1.000000 65536 10000 cpu=1,mem=1M
guest-sho+ 10 00:00:00 cluster DenyOnLimit 1.000000 cpu=200,gres+ 65536 04:00:00 cpu=128,gres+ 100 cpu=1,mem=1M
guest-long 0 00:00:00 cluster DenyOnLimit 1.000000 cpu=200,gres+ 65536 7-00:00:00 cpu=128,gres+ 1 10 cpu=1,mem=1M
medium 35 00:00:00 cluster DenyOnLimit 1.000000 cpu=3352,gre+ 65536 1-12:00:00 cpu=1466,gre+ 2000 cpu=1,mem=1M
How to use QoS in your sbatch
scripts?
In your sbatch.slurm
script you can specify the QoS with #SBATCH --qos=...
option.
Example:
#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --partition=general
#SBATCH --account=ewi-insy-reit
#SBATCH --qos=short # This is how you specify QoS
#SBATCH --time=0:01:00
#SBATCH --nodes=1
#SBATCH --tasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --mem=1GB
#SBATCH --output=slurm-%n-%j.out
#SBATCH --error=slurm-%n-%j.err
srun echo 'Hi, from Slurm!'
sleep 30 # Wait for 30 seconds before exiting.
QoS for reservations
In case you have a reservation you need to specify --qos=reservation
and `–reservation=
Resources reservations
Slurm gives the possibility to reserve one or more compute nodes exclusively for a specific user or group of users. A reservation ensures that the designated node (or nodes) are dedicated solely to the reservation holder’s tasks and are not shared with other users during the reserved period. This feature allows users to plan the execution of future workloads, and accommodates cluster users with special needs beyond the batch system (eg latency measurement scenarios).
Note
Using reservations is in line with the General cluster usage clauses of DAIC users’ agreement. However, please be mindful that reservations are intended to facilitate special needs that cannot be satisfied by the batch system, and should not be requested to guarantee fast throughput for production runs.Requesting a Reservation
To request a reservation for nodes, please use to the Request Reservation form. You can request a reservation for an entire compute node (or a group of nodes) if you have contributed this (or these) nodes to the cluster and you have special needs that needs to be accommodated.
General guidelines for reservations’ requests:
- You can be granted a reservation only on nodes from a partition that is contributed by your group (See Partitions to check the name of the partition contributed by your group, and System specifications for a listing of available nodes and their features).
- Please ask for the least amount of resources you need as to minimize impact on other users.
- Plan ahead and request your reservation as soon as possible: Reservations usually ignore running jobs, so any running job on the machine(s) you request will continue to run when the reservation starts. While jobs from other users will not start on the reserved node(s), the resources in use by an already running job at the start time of the reservation will not be available in the reservation until this running job ends. The earlier ahead you request resources, the easier it is to allocate the requested resources.
Using reservations
Once your reservation request is approved and a reservation is placed on the system, you can run your jobs in the reservation by specifying --qos=reservation
along with the following directives to your slurm commands: --reservation=<name>
and --partition=<partition>
. For example, to submit the job job.sbatch
to a reservation named icra_iv
on the cor1
node on the cor
partition use:
$ sbatch --qos=reservation --reservation=icra_iv --partition=cor job.sbatch
Alternatively, it is possible to add the following lines to the job.sbatch
file, and submitting this file as usual:
#SBATCH --qos=reservation
#SBATCH --reservation=icra_iv
#SBATCH --partition=cor
Note
It is possible to submit jobs to a reservation once it is created. Jobs will start immediately when the reservation is available, but already running jobs on resources will not be canceled for the reservation to start.Note
When a reservation is used to run your jobs, remember to also pass the reservation parameters to your srun steps:
$ srun --qos=reservation --reservation=<reservation_name> --partition=<partition_name> <some_script.sh>
To make use of an existing reservation you have to specify --qos=reservation
and --reservation=<reservation-name>
in your sbatch
script.
Viewing reservations
To view all active and future reservations run the scontrol
command as follows:
$ scontrol show reservations
ReservationName=icra_iv StartTime=2023-09-09T00:00:00 EndTime=2023-09-16T00:00:00 Duration=7-00:00:00
Nodes=cor1 NodeCnt=1 CoreCnt=32 Features=(null) PartitionName=cor Flags=
TRES=cpu=64
Users=(null) Groups=(null) Accounts=3me-cor Licenses=(null) State=ACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
ReservationName=maintenance weekend 2023-10-14 StartTime=2023-10-13T20:00:00 EndTime=2023-10-16T09:00:00 Duration=2-13:00:00
Nodes=3dgi[1-2],100plus,awi[01-26],cor1,gpu[01-11],grs[1-4],influ[1-6],insy[11-12,14-16],tbm5,wis1 NodeCnt=58 CoreCnt=2000 Features=(null) PartitionName=(null) Flags=MAINT,IGNORE_JOBS,SPEC_NODES,ALL_NODES
TRES=cpu=4000
Users=root Groups=(null) Accounts=(null) Licenses=(null) State=INACTIVE BurstBuffer=(null) Watts=n/a
MaxStartDelay=(null)
Note
- Jobs can run on a reservation only if explicitly requested as shown in the Requesting a reservation section.
- Only jobs from the
Users
orAccounts
associated with the reservation (as shown in thescontrol show reservations
output) will be run on the reservation STATE
of a reservation will show asACTIVE
(instead ofINACTIVE
) during the reservation window.
Feedback
Was this page helpful?
Glad to hear it! Please click here to notify us. We appreciate it.
Sorry to hear that. Please click here let the page maintainers know.