At present DAIC and DelftBlue have different software stacks. This pertains to the operating system (CentOS 7 vs Red Hat Enterprise Linux 8, respectively) and, consequently, the available software. Please refer to the respective DelftBlue modules and Software section before commencing your experiments.
Operating System
DAIC runs the Red Hat Enterprise Linux 7 Linux distribution, which provides the general Linux software. Most common software, including programming languages, libraries and development files for compiling your own software, is installed on the nodes (see Available software). However, a not-so-common program that you need might not be installed. Similarly, if your research requires a state-of-the-art program that is not (yet) available as a package for Red Hat 7, then it is not available. See Installing software for more information.
Login Nodes
The login nodes are the gateway to the DAIC HPC cluster and are specifically designed for lightweight tasks such as job submission, file management, and compiling code (on certain nodes). These nodes are not intended for running resource-intensive jobs, which should be submitted to the Compute Nodes.
Specifications and usage notes
Hostname | CPU (Sockets x Model) | Total Cores | Total RAM | Operating System | GPU Type | GPU Count | Usage Notes |
---|---|---|---|---|---|---|---|
login1 | 1 x Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz | 8 | 15.39 GB | OpenShift Enterprise | Quadro K2200 | 1 | For file transfers, job submission, and lightweight tasks. |
login2 | 1 x Intel(R) Xeon(R) CPU E5-2683 v3 @ 2.00GHz | 1 | 3.70 GB | OpenShift Enterprise | N/A | N/A | Virtual server, for non-intensive tasks. No compilation. |
login3 | 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz | 32 | 503.60 GB | RHEV | Quadro K2200 | 1 | For large compilation and interactive sessions. |
Compute Nodes
DAIC compute nodes are all multi CPU servers, with large memories, and some with GPUs. The nodes in the cluster are heterogeneous, i.e. they have different types of hardware (processors, memory, GPUs), different functionality (some more advanced than others) and different performance characteristics. If a program requires specific features, you need to specifically request those for that job (see Submitting jobs).
Note
All compute nodes have Advanced Vector Extensions 1 and 2 (AVX, AVX2) support, and hyper-threading (ht
) processors (two CPUs per core, always allocated in pairs).Note
You can use Slurm’s sinfo
command to get various information about cluster nodes. For example, to get an overview of compute nodes on DAIC, you can use the command:
$ sinfo --all --format="%P %N %c %m %G %b" --hide -S P,N -a | grep -v "general" | awk 'NR==1 {print; next} {match($5, /gpu:[^,]+:[0-9]+/); if (RSTART) print $1, $2, $3, $4, substr($5, RSTART, RLENGTH), $6; else print $1, $2, $3, $4, "-", $6 }'
Check out the Slurm’s sinfo page and wikipedia’s awk page for more info on these commands.
List of all nodes
The following table gives an overview of current nodes and their characteristics:
Hostname | CPU (Sockets x Model) | Cores per Socket | Total Cores | CPU Speed (MHz) | Total RAM | GPU Type | GPU Count |
---|---|---|---|---|---|---|---|
100plus | 2 x Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz | 16 | 32 | 2097.488 | 755.585 GB | ||
3dgi1 | 1 x AMD EPYC 7502P 32-Core Processor | 32 | 32 | 2500 | 251.41 GB | ||
3dgi2 | 1 x AMD EPYC 7502P 32-Core Processor | 32 | 32 | 2500 | 251.41 GB | ||
awi01 | 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz | 18 | 36 | 2996.569 | 376.384 GB | Tesla V100 PCIe 32GB | 1 |
awi02 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2900.683 | 503.619 GB | Tesla V100 SXM2 16GB | 2 |
awi03 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi04 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 3231.884 | 503.625 GB | ||
awi05 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 3258.984 | 503.625 GB | ||
awi07 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi08 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi09 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi10 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi11 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi12 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 503.625 GB | ||
awi19 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 251.641 GB | ||
awi20 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 251.641 GB | ||
awi21 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 251.641 GB | ||
awi22 | 2 x Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz | 14 | 28 | 2899.951 | 251.641 GB | ||
awi23 | 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz | 18 | 36 | 3221.038 | 376.385 GB | ||
awi24 | 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz | 18 | 36 | 2580.2 | 376.385 GB | ||
awi25 | 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz | 18 | 36 | 3399.884 | 376.385 GB | ||
awi26 | 2 x Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz | 18 | 36 | 3442.7 | 376.385 GB | ||
cor1 | 2 x Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz | 16 | 64 | 3599.975 | 1510.33 GB | Tesla V100 SXM2 32GB | 8 |
gpu01 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu02 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu03 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu04 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu05 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu06 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu07 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu08 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu09 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu10 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu11 | 2 x AMD EPYC 7413 24-Core Processor | 24 | 48 | 2650 | 503.402 GB | NVIDIA A40 | 3 |
gpu14 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.613 | 503.275 GB | NVIDIA A40 | 3 |
gpu15 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.938 | 503.275 GB | NVIDIA A40 | 3 |
gpu16 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.604 | 503.275 GB | NVIDIA A40 | 3 |
gpu17 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.878 | 503.275 GB | NVIDIA A40 | 3 |
gpu18 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.57 | 503.275 GB | NVIDIA A40 | 3 |
gpu19 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.682 | 503.275 GB | NVIDIA A40 | 3 |
gpu20 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.651 | 1007.24 GB | NVIDIA A40 | 3 |
gpu21 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.646 | 1007.24 GB | NVIDIA A40 | 3 |
gpu22 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.963 | 1007.24 GB | NVIDIA A40 | 3 |
gpu23 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.658 | 1007.24 GB | NVIDIA A40 | 3 |
gpu24 | 2 x AMD EPYC 7543 32-Core Processor | 32 | 64 | 2794.664 | 1007.24 GB | NVIDIA A40 | 3 |
grs1 | 2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz | 8 | 16 | 3499.804 | 251.633 GB | ||
grs2 | 2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz | 8 | 16 | 3577.734 | 251.633 GB | ||
grs3 | 2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz | 8 | 16 | 3499.804 | 251.633 GB | ||
grs4 | 2 x Intel(R) Xeon(R) CPU E5-2667 v4 @ 3.20GHz | 8 | 16 | 3499.804 | 251.633 GB | ||
influ1 | 2 x Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz | 16 | 32 | 2955.816 | 376.391 GB | GeForce RTX 2080 Ti | 8 |
influ2 | 2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz | 16 | 32 | 2300 | 187.232 GB | GeForce RTX 2080 Ti | 4 |
influ3 | 2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz | 16 | 32 | 2300 | 187.232 GB | GeForce RTX 2080 Ti | 4 |
influ4 | 2 x AMD EPYC 7452 32-Core Processor | 32 | 64 | 1500 | 251.626 GB | ||
influ5 | 2 x AMD EPYC 7452 32-Core Processor | 32 | 64 | 2350 | 503.611 GB | ||
influ6 | 2 x AMD EPYC 7452 32-Core Processor | 32 | 64 | 1500 | 503.61 GB | ||
insy15 | 2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz | 16 | 32 | 2300 | 754.33 GB | GeForce RTX 2080 Ti Rev. A | 4 |
insy16 | 2 x Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz | 16 | 32 | 2300 | 754.33 GB | GeForce RTX 2080 Ti Rev. A | 4 |
Total | 1206 | 2380 | 28 TB | 101 | |||
CPUs
All nodes have multiple Central Processing Units (CPUs) that perform the operations. Each CPU can process one thread (i.e. a separate string of computer code) at a time. A computer program consists of one or multiple threads, and thus needs one or multiple CPUs simultaneously to do its computations (see wikipedia's CPU page ).
Note
Most programs use a fixed number of threads. Requesting more CPUs for a program than its number of threads will not make it any faster because it won’t know how to use the extra CPUs. When a program has less CPUs available than its number of threads, the threads will have to time-share the available CPUs (i.e. each thread only gets part-time use of a CPU), and, as a result, the program will run slower (And even slower because of the added overhead of the switching of the threads). So it’s always necessary to match the number of CPUs to the number of threads, or the other way around. See submitting jobs for setting resources for batch jobs.The number of threads running simultaneously determines the load of a server. If the number of running threads is equal to the number of available CPUs, the server is loaded 100% (or 1.00). When the number of threads that want to run exceed the number of available CPUs, the load rises above 100%.
The CPU functionality is provided by the hardware cores in the processor chips in the machines. Traditionally, one physical core contained one logical CPU, thus the CPUs operated completely independent. Most current chips feature hyper-threading: one core contains two (or more) logical CPUs. These CPUs share parts of the core and the cache, so one CPU may have to wait when a shared resource is in use by the other CPU. Therefore these CPUs are always allocated in pairs by the job scheduler.
GPUs
A few types of GPUs are available in some of DAIC nodes, as shown in table 1. The total numbers of these GPUs/type and their technical specifications are shown in table 2. See using graphic cards for requesting GPUs for a computational job.
GPU (slurm) type | Count | Model | Architecture | Compute Capability | CUDA cores | Memory |
---|---|---|---|---|---|---|
a40 | 66 | NVIDIA A40 | Ampere | 8.6 | 10752 | 46068 MiB |
turing | 24 | NVIDIA GeForce RTX 2080 Ti | Turing | 7.5 | 4352 | 11264 MiB |
v100 | 11 | Tesla V100-SXM2-32GB | Volta | 7.0 | 5120 | 32768 MiB |
In table 2: the headers denote:
Model
: The official product name of the GPUArchitecture
: The hardware design used, and thus the hardware specifications and performance characteristics of the GPU. Each new architecture brings forward a new generation of GPUs.Compute capability
: determines the general functionality, available features and CUDA support of the GPU. A GPU with a higher capability supports more advanced functionality.CUDA cores
: The number of cores perform the computations: The more cores, the more work can be done in parallel (provided that the algorithm can make use of higher parallelization).Memory
: Total installed GPU memory. The GPUs provide their own internal (fixed-size) memory for storing data for GPU computations. All required data needs to fit in the internal memory or your computations will suffer a big performance penalty.
Note
To inspect a given GPU and obtain the data of table 2, you can run the following commands on an interactive session or an sbatch script (see Jobs on GPU resources). The apptainer image used in this code snippet was built as demonstrated in the Apptainer tutorial.
$ sinteractive --cpus-per-task=2 --mem=500 --time=00:02:00 --gres=gpu
Note: interactive sessions are automatically terminated when they reach their time limit (1 hour)!
srun: job 8607783 queued and waiting for resources
srun: job 8607783 has been allocated resources
15:50:29 up 51 days, 3:26, 0 users, load average: 60,33, 59,72, 54,65
SomeNetID@influ1:~$ nvidia-smi --format=csv,noheader --query-gpu=name
NVIDIA GeForce RTX 2080 Ti
SomeNetID@influ1:~$ nvidia-smi -q | grep Architecture
Product Architecture : Turing
SomeNetID@influ1:~$ nvidia-smi --query-gpu=compute_cap --format=csv,noheader
7.5
SomeNetID@influ1:~$ apptainer run --nv cuda_based_image.sif | grep "CUDA Cores" # using the apptainer image of the tutorial
(068) Multiprocessors, (064) CUDA Cores/MP: 4352 CUDA Cores
SomeNetID@influ1:~$ nvidia-smi --format=csv,noheader --query-gpu=memory.total
11264 MiB
SomeNetID@influ1:~$ exit
Memory
All machines have large main memories for performing computations on big data sets. A job cannot use more than it’s allocated amount of memory. If it needs to use more memory, it will fail or be killed. It’s not possible to combine the memory from multiple nodes for a single task. 32-bit programs can only address (use) up to 3Gb (gigabytes) of memory. See Submitting jobs for setting resources for batch jobs.
Storage
DAIC compute nodes have direct access to the TU Delft home, group and project storage. You can use your TU Delft installed machine or an SCP or SFTP client to transfer files to and from these storage areas and others (see data transfer) , as is demonstrated throughout this page.
File System Overview
Unlike TU Delft’s
DelftBlue
, DAIC does not have a dedicated storage filesystem. This means no /scratch
space for storing temporary files (see DelftBlue’s
Storage description
and
Disk quota and scratch space
). Instead, DAIC relies on direct connection to the TU Delft network storage filesystem (see
Overview data storage
) from all its nodes, and offers the following types of storage areas:
Personal storage (aka home folder)
The Personal Storage is private and is meant to store personal files (program settings, bookmarks). A backup service protects your home files from both hardware failures and user error (you can restore previous versions of files from up to two weeks ago). The available space is limited by a quota limit (since this space is not meant to be used for research data).
You have two (separate) home folders: one for Linux and one for Windows (because Linux and Windows store program settings differently). You can access these home folders from a machine (running Linux or Windows OS) using a command line interface or a browser via
TU Delft's webdata
. For example, Windows home has a My Documents
folder. My documents
can be found on a Linux machine under /winhome/<YourNetID>/My Documents
Home directory | Access from | Storage location |
---|---|---|
Linux home folder | ||
Linux | /home/nfs/<YourNetID> | |
Windows | only accessible using an scp/sftp client (see SSH access) | |
webdata | not available | |
Windows home folder | ||
Linux | /winhome/<YourNetID> | |
Windows | H: or \\tudelft.net\staff-homes\[a-z]\<YourNetID> | |
webdata | https://webdata.tudelft.nl/staff-homes/[a-z]/<YourNetID> |
It’s possible to access the backups yourself. In Linux the backups are located under the (hidden, read-only) ~/.snapshot/
folder. In Windows you can right-click the H:
drive and choose Restore previous versions
.
Note
To see your disk usage, run something like:
du -h '</path/to/folder>' | sort -h | tail
Group storage
The Group Storage is meant to share files (documents, educational and research data) with department/group members. The whole department or group has access to this storage, so this is not for confidential or project data. There is a backup service to protect the files, with previous versions up to two weeks ago. There is a Fair-Use policy for the used space.
Destination | Access from | Storage location |
---|---|---|
Group Storage | ||
Linux | /tudelft.net/staff-groups/<faculty>/<department>/<group> or | |
/tudelft.net/staff-bulk/<faculty>/<department>/<group>/<NetID> | ||
Windows | M: or \\tudelft.net\staff-groups\<faculty>\<department>\<group> or | |
L: or \\tudelft.net\staff-bulk\ewi\insy\<group>\<NetID> | ||
webdata | https://webdata.tudelft.nl/staff-groups/<faculty>/<department>/<group>/ |
Project Storage
The Project Storage is meant for storing (research) data (datasets, generated results, download files and programs, …) for projects. Only the project members (including external persons) can access the data, so this is suitable for confidential data (but you may want to use encryption for highly sensitive confidential data). There is a backup service and a Fair-Use policy for the used space.
Project leaders (or supervisors) can request a Project Storage location via the Self-Service Portal or the Service Desk .
Destination | Access from | Storage location |
---|---|---|
Project Storage | ||
Linux | /tudelft.net/staff-umbrella/<project> | |
Windows | U: or \\tudelft.net\staff-umbrella\<project> | |
webdata | https://webdata.tudelft.nl/staff-umbrella/<project> or
|
Tip
Data deleted from project storage,staff-umbrella
, remains in a hidden .snapshot
folder. If accidently deleted, you can recover such data by copying it from the (hidden).snapshot
folder in your storage.Local Storage
Local storage is meant for temporary storage of (large amounts of) data with fast access on a single computer. You can create your own personal folder inside the local storage. Unlike the network storage above, local storage is only accessible on that computer, not on other computers or through network file servers or webdata. There is no backup service nor quota. The available space is large but fixed, so leave enough space for other users. Files under /tmp
that have not been accessed for 10 days are automatically removed.
Destination | Access from | Storage location |
---|---|---|
Local storage | ||
Linux | /tmp/<NetID> | |
Windows | not available | |
webdata | not available |
Memory Storage
Memory storage is meant for short-term storage of limited amounts of data with very fast access on a single computer. You can create your own personal folder inside the memory storage location. Memory storage is only accessible on that computer, and there is no backup service nor quota. The available space is limited and shared with programs, so leave enough space (the computer will likely crash when you don’t!). Files that have not been accessed for 1 day are automatically removed.
Destination | Access from | Storage location |
---|---|---|
Memory storage | ||
Linux | /dev/shm/<NetID> | |
Windows | not available | |
webdata | not available |
Warning
Use this only when using other storage makes your job or the whole computer slow.Workload scheduler
DAIC uses the Slurm scheduler to efficiently manage workloads. All jobs for the cluster have to be submitted as batch jobs into a queue. The scheduler then manages and prioritizes the jobs in the queue, allocates resources (CPUs, memory) for the jobs, executes the jobs and enforces the resource allocations. See the job submission pages for more information.
A slurm-based cluster is composed of a set of login nodes that are used to access the cluster and submit computational jobs. A central manager orchestrates computational demands across a set of compute nodes. These nodes are organized logically into groups called partitions, that defines job limits or access rights. The central manager provides fault-tolerant hierarchical communications, to ensure optimal and fair use of available compute resources to eligible users, and make it easier to run and schedule complex jobs across compute resources (multiple nodes).