Slurm

Reservation and management of FRIDA compute resources is based on Slurm (Simple Linux Utility for Resource Management), a software package for submitting, scheduling, and monitoring jobs on large compute clusters. Slurm governs access to the cluster's resources through three views; account/user association, partition, and QoS (Quality of Service). Accounts link users into groups (based on research lab, project or other shared properties). Partitions link nodes and QoS allow users to control the priority and limits on a job level. Each of these views can impose limits and affect the submitted job's priority.

Nodes

FRIDA currently consists of one login node and several compute nodes with characteristics listed in the table below. The total compute currently consists of 1456 vCPUs with 8576GB of system RAM, and 40 GPUs with 3392GB GPU RAM. Although the login node has Python 3.10 and venv pre-installed, these are provided only to aid in quick scripting, and the login node is not intended for any intensive processing. Refrain from installing user space additions like conda and others. All computationally intensive tasks must be submitted as Slurm jobs via the corresponding Slurm commands. User accounts that fail to adhere to these guidelines will be subject to suspension.

NODE	ROLE	vCPU	MEM	nGPU	GPU type
login	login	64	256GB	-	-
aga	compute	256	512GB	4	NVIDIA A100 40GB SXM4
apl	compute	112	1TB	8	NVIDIA L4 24GB PCIe
ana	compute	112	1TB	8	NVIDIA A100 80GB PCIe
ixh	compute	224	2TB	8	NVIDIA H100 80GB HBM3
ixb	compute	224	2TB	8	NVIDIA B200 180GB
gh[1-2]	compute	72	576GB	1	NVIDIA GH200 480GB
api	compute	384	768GB	2	AMD MI210 64GB PCIe

The compute node naming scheme follows a two/three letter acronym that is based on the node architecture and suffixed by a number, if multiple such nodes exist. For example node name ana stands for AMD CPU, NVLink interconnect, and Ampere GPU, api stands for AMD CPU, PCIe interconnect, and AMD Instinct MIxxx GPU, aga stands for AMD CPU, gang NVLink (gen 3) interconnect (i.e. no NVSwitch), Ampere GPU, ixh stands for Intel CPU, SXM5 - NVLink + NVSwitch (gen 4) interconnect, and Hopper GPU, Ampere GPU, ixb stands for Intel CPU, SXM6 - NVLink + NVSwitch (gen 5) interconnect, and Blackwell GPU, and gh stands for GraceHopper superchip (i.e. Grace CPU and Hopper GPU tightly bound via a 900GB/s NVLink-C2C interconnect, allowing for cache and memory coherency). More details about the nodes can be obtaind by executing the command scontrol show node <node_name>; note that each node is also assigned a series of features which can be used in conjuction with parameter -C/--constraint <key>:<value> to target a specific node.

NODE	CPU_BRD	CPU_GEN	CPU_SKU	CPU_L3	CPU_MEM	GPU_BRD	GPU_GEN	GPU_SKU	GPU_MEM	GPU_CC
aga	AMD	ZEN3	EPYC_7763	256MB	512GB	NVIDIA	AMPERE	A100_SXM4_40GB	40GB	8.0
apl	AMD	ZEN3	EPYC_7453	64MB	1TB	NVIDIA	ADA_LOVELACE	L4_24GB_PCIE	24GB	8.9
ana	AMD	ZEN3	EPYC_7453	64MB	1TB	NVIDIA	AMPERE	A100_80GB_PCIE	80GB	8.0
ixh	INTEL	GOLDEN_COVE	PLATINUM_8480CL	105MB	2TB	NVIDIA	HOPPER	H100_80GB_HBM3	80GB	9.0
ixb	INTEL	EMERALD_RAPIDS	PLATINUM_8570	200MB	2TB	NVIDIA	BLACKWELL	B200_180GB	180GB	10.0
gh[1-2]	ARM	NEO2	GRACE	234MB	576GB	NVIDIA	HOPPER	GH200_480GB	96GB	9.0
api	AMD	ZEN4	EPYC_9684X	1152MB	768GB	AMD	CDNA2	MI210	64GB	-

Partitions

Within Slurm subsets of compute nodes are organized into partitions. On FRIDA there are two types of partitions, general and private (available to selected research labs or groups based on their co-funding of FRIDA). Interactive jobs can be run only on partition dev. Production runs are not permitted in interactive jobs, dev partition is thus intended to be used for code development, testing, and debugging only.

PARTITION	TYPE	nodes	default time	max time	available gres types
frida	general	all*	4h	7d	gpu, gpu:L4, gpu:A100, gpu:A100_80GB, gpu:H100, gpu:B200
dev	general	aga,ana,apl	2h	12h	gpu, gpu:L4, gpu:A100, gpu:A100_80GB
cjvt	private	axa	4h	4d	gpu, gpu:A100
psuiis	private	ana	4h	4d	gpu, gpu:A100_80GB
nxt	experimental	gh[1-2]	2h	2d	gpu, gpu:GH200
amd	experimental	api	2h	2d	gpu, gpu:MI210

Note

To avoid issues related to differences in CPU and/or GPU architecture, partition frida includes all nodes, but those that are part of partitions nxt and amd.

Tip

Jobs that are CPU intense should target partition amd as it is equipped with large cache AMD Epyc 9684X CPUs.

Shared storage

All FRIDA nodes have access to a limited amount of shared data storage. For performance reasons, it is served by a raid0 backend. Note that FRIDA does not provide automatic backups, these are entirely in the domain of the end user. Also, as of current settings, FRIDA does not enforce shared storage quotas, but this may change in future upgrades. Access permissions on the user folder are set to user only. Depending on the groups a user is a member of, they may have access to multiple workspace folders. These folders have the group special (SGID) bit set, so files created within them will automatically have the correct group ownership. All group members will have read and execute rights. Note, however, that group write access is masked, so users who wish to make their files writable by other group members should change permissions by using the chmod g+w command. All other security measures dictated by the nature of your data are the responsibility of the end users.

In addition to access to shared storage, compute nodes provide also an even smaller amount of local storage. The amount varies per node and may change with FRIDA's future updates. Local storage is intended as scratch space, the corresponding path is created on a per-job basis at job start and purged as soon as the job ends.

TYPE	backend	backups	access	location	env	quota
shared	weka	no	user	`/shared/home/$USER`	`$HOME`	-
shared	weka	no	group	`/shared/workspace/$SLURM_JOB_ACCOUNT`	`$WORK`	-
scratch	raid0	no	job	`/local/scratch/$USER/$SLURM_JOB_ID`	`$SCRATCH`	-

Usage

In Slurm there are two principal ways of submitting jobs. Interactive and non-interactive, or batch submission. What follows in the next subsections is a quick review of most typical use cases.

Some useful Slurm commands with their typical use case, notes and corresponding Slurm help are displayed in the table below. In the next sections, we will be using these commands to submit and view jobs.

CMD	typical use case	notes	help
`frida`	view the current state of resources	this is a custom, FRIDA command, which in a compact form presents important FRIDA notices about maintenance; it also prints detailed info of the last five jobs submitted by the user, a list of their shared storage locations with usage and remaining space, and a list of currently available GPUs
`slurm`	view the current status of all resources	this is a custom, FRIDA alias built on top of the more general `sinfo` Slurm command	see Slurm docs on `sinfo`
`salloc`	resource reservation and allocation for interactive use	intended for interactive jobs; upon successful allocation `srun` commands can be used to open a shell with the allocated resources	see Slurm docs on `salloc`
`srun`	resource reservation, allocation and execution of supplied command on these resources	a blocking command; with `--pty` can be used for interactive jobs; by appending `&` followed by a `wait` the command can be turned into non-blocking	see Slurm docs on `srun`
`stunnel`	resource reservation, allocation and setup of a vscode tunnel to these resources	this is a custom, FRIDA alias built on top of the more general `srun --pty` Slurm command; requires a GitHub or Microsoft account for tunnel registration; intended for dvelopment, testing and debugging	see section code tunnel
`sbatch`	resource reservation, allocation and execution of non-interactive batch job on these resources	asynchronous execution of sbatch script; when combined with `srun` multiple sub-steps become possible	see Slurm docs on `sbatch`
`squeue`	view the current queue status	on large clusters the output can be large; filter can be applied to limit output to specific partition by `-p {partition}`	see Slurm docs on `squeue`
`sprio`	view the current priority status on queue	this is a custom FRIDA alias built on top of the more general `sprio` Slurm command	see Slurm docs on `sprio`
`scancel`	cancel a running job		see Slurm docs on `scancel`
`scontrol show job`	show detailed information of a job		see Slurm docs on `scontrol`

Interactive sessions

Slurm in general provides two ways of running interactive jobs; either via the salloc or via the srun --pty command. The former will first reserve the resources, which you then explicitly use via the latter, while the latter can be used to do the reservation and execution in a single step.

For example, the following snippet shows how to allocate resources on dev partition with default parameters (2 vCPU, 8GB RAM) and then start a bash shell on the allocated resources. Done that it shows how to check what was allocated, then exits the bash shell and finally releases the allocated resources.

ilb@login-frida:~$ salloc -p dev
salloc: Granted job allocation 498
salloc: Waiting for resource configuration
salloc: Nodes ana are ready for job

ilb@login-frida:~ [interactive]$ srun --pty bash

ilb@ana-frida:~ [interactive]$ scontrol show job $SLURM_JOB_ID | grep AllocTRES
   AllocTRES=cpu=2,mem=8G,node=1,billing=2

ilb@ana-frida:~ [interactive]$ exit
exit

ilb@login-frida:~ [interactive]$ exit
exit
salloc: Relinquishing job allocation 498
salloc: Job allocation 498 has been revoked.

ilb@login-frida:~$

The following snipped, on the other hand, allocates resources on dev partition and starts a bash shell in one call, this time requesting for 32 vCPU cores and 32 GB of RAM. It then checks what was allocated, and finally exits from the bash shell jointly releasing the allocated resources.

ilb@login-frida:~$ srun -p dev -c32 --mem=32G --pty bash

ilb@ana-frida:~ [bash]$ scontrol show job $SLURM_JOB_ID | grep AllocTRES
   AllocTRES=cpu=32,mem=32G,node=1,billing=32

ilb@ana-frida:~ [bash]$ exit
exit

ilb@login-frida:~$

Access to the allocated resources (nodes) can be achieved also by invoking ssh; however, note that then the Slurm-defined environment variables will not be available to you. The auto inclusion of the corresponding environment variables is an added benefit of accessing the allocated resources via srun --pty. With multi-node resource allocations use -w to specify the node on which you wish to start a shell.

ilb@login-frida:~$ srun -p dev --pty bash

ilb@ana-frida:~ [bash]$ echo $SLURM_
$SLURM_CLUSTER_NAME       $SLURM_JOB_ID             $SLURM_JOB_QOS            $SLURM_SUBMIT_HOST
$SLURM_CONF               $SLURM_JOB_NAME           $SLURM_NNODES             $SLURM_TASKS_PER_NODE
$SLURM_JOBID              $SLURM_JOB_NODELIST       $SLURM_NODELIST
$SLURM_JOB_ACCOUNT        $SLURM_JOB_NUM_NODES      $SLURM_NODE_ALIASES
$SLURM_JOB_CPUS_PER_NODE  $SLURM_JOB_PARTITION      $SLURM_SUBMIT_DIR

ilb@ana-frida:~ [bash]$ exit
exit

ilb@login-frida:~$

Note also that when you have multiple jobs running on a single node ssh will always open a shell with the resources allocated by the last submitted job. The srun command on the other hand provides two parameters through which one can specify the desire to not start a new job, but open an additional shell in an already running job. For example, assuming a job with id 500 is already running, the next snippet shows how to open a second bash shell in the same job.

ilb@login-frida:~$ srun --overlap --jobid=500 --pty bash

ilb@ana-frida:~ [bash]$ echo $SLURM_JOB_ID
500

ilb@ana-frida:~ [bash]$ exit
exit

ilb@login-frida:~$

Running jobs in containers

FRIDA was set up by following deepops, an open-source infrastructure automation tool by NVIDIA. Our setup is focused on supporting ML/AI training tasks, and based on current trends of containerization even in HPC systems, it is also purely container-based (modules are not supported). Here we opted for Enroot an extremely lightweight container runtime capable of turning traditional container/OS images into unprivileged sandboxes. An added benefit of Enroot is its tight integration into Slurm commands via the Pyxis plugin, thus providing a very good user experience.

Regardless if the Enroot/Pyxis bundle turns containers/OS images into unprivileged sandboxes, automatic root remapping allows users to extend and update the imported images with ease. The changes can be retained for the duration of a job or exported to a squashfs file for cross-job retention. Some of the most commonly used Pyxis-provided srun parameters are listed in the table below. For further details on all available parameters consult the official Pyxis documentation, while information on how to set up credentials that enable importing containers from private container registries can be found in the official Enroot documentation.

srun parameter	format	notes
`--container-image`	`[USER@][REGISTRY#]IMAGE[:TAG]\\|PATH`	container image to load, path must point to a squashfs file
`--container-name`	`NAME`	name the container for easier access to running containters; can be used to change container contents without saving to a squashfs file, but note that non-running named containers are kept only within a `salloc` or `sbatch` allocation
`--container-save`	`PATH`	save the container state to a squashfs file
`--container-mount-home`		bind mount the user home; not mounted by default to prevent local config collisions
`--container-mounts`	`SRC:DST[:FLAGS][,SRC:DST...]`	bind points to mount into the container; `$SCRATCH` is auto-mounted

For example, the following snippet will check the OS release on dev partition, node ana, then start a bash shell in a CentOS container on the same node and recheck. In this case, the CentOS image tagged latest will be pulled from the official DockerHub page.

ilb@login-frida:~$ srun -p dev -w ana bash -c 'echo "$HOSTNAME: `grep PRETTY /etc/os-release`"'
ana: PRETTY_NAME="Ubuntu 22.04.3 LTS"

ilb@login-frida:~$ srun -p dev -w ana --container-image=centos --pty bash
pyxis: importing docker image: centos
pyxis: imported docker image: centos
[root@ana /]# echo "$HOSTNAME: `grep PRETTY /etc/os-release`"
ana: PRETTY_NAME="CentOS Linux 8"

[root@ana /]# exit
exit

ilb@login-frida:~$

The following snippet allocates resources on dev partition, then starts an ubuntu:20.04 container and checks if file utility exists. The command results in an error indicating that the utility is not present. The snippet then creates a named container that is based on ubuntu:20.04 and installs the utility. Then it rechecks if the utility exists in the named container. The snippet ends by releasing the resources, which jointly purges the named container.

ilb@login-frida:~$ salloc -p dev
salloc: Granted job allocation 526
salloc: Waiting for resource configuration
salloc: Nodes ana are ready for job

ilb@login-frida:~ [interactive]$ srun --container-image=ubuntu:20.04 which file
pyxis: importing docker image: ubuntu:20.04
pyxis: imported docker image: ubuntu:20.04
srun: error: ana: task 0: Exited with exit code 1

ilb@login-frida:~ [interactive]$ srun --container-image=ubuntu:20.04 --container-name=myubuntu sh -c 'apt-get update && apt-get install -y file'
pyxis: importing docker image: ubuntu:20.04
pyxis: imported docker image: ubuntu:20.04
Get:1 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal InRelease [265 kB]
Get:3 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3238 kB]
Get:4 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.3 kB]
Get:5 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3079 kB]
Get:6 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1143 kB]
Get:7 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal/main amd64 Packages [1275 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal/multiverse amd64 Packages [177 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal/restricted amd64 Packages [33.4 kB]
Get:12 http://archive.ubuntu.com/ubuntu focal/universe amd64 Packages [11.3 MB]
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [3726 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3228 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [32.0 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1439 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [55.2 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [28.6 kB]
Fetched 29.4 MB in 3s (9686 kB/s)
Reading package lists...
Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libmagic-mgc libmagic1
The following NEW packages will be installed:
  file libmagic-mgc libmagic1
0 upgraded, 3 newly installed, 0 to remove and 0 not upgraded.
Need to get 317 kB of archives.
After this operation, 6174 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal/main amd64 libmagic-mgc amd64 1:5.38-4 [218 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal/main amd64 libmagic1 amd64 1:5.38-4 [75.9 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/main amd64 file amd64 1:5.38-4 [23.3 kB]
debconf: delaying package configuration, since apt-utils is not installed
Fetched 317 kB in 1s (345 kB/s)
Selecting previously unselected package libmagic-mgc.
(Reading database ... 4124 files and directories currently installed.)
Preparing to unpack .../libmagic-mgc_1%3a5.38-4_amd64.deb ...
Unpacking libmagic-mgc (1:5.38-4) ...
Selecting previously unselected package libmagic1:amd64.
Preparing to unpack .../libmagic1_1%3a5.38-4_amd64.deb ...
Unpacking libmagic1:amd64 (1:5.38-4) ...
Selecting previously unselected package file.
Preparing to unpack .../file_1%3a5.38-4_amd64.deb ...
Unpacking file (1:5.38-4) ...
Setting up libmagic-mgc (1:5.38-4) ...
Setting up libmagic1:amd64 (1:5.38-4) ...
Setting up file (1:5.38-4) ...
Processing triggers for libc-bin (2.31-0ubuntu9.12) ...


ilb@login-frida:~ [interactive]$ srun --container-name=myubuntu which file
/usr/bin/file

ilb@login-frida:~ [interactive]$ exit
exit
salloc: Relinquishing job allocation 526
salloc: Job allocation 526 has been revoked.

ilb@login-frida:~$

Start a second shell inside the same container

Using named containers becomes handy on occasions when one needs to open a second shell inside the same container (within the same job with the same resources). Let's assume a job with id 527 that was created with the parameter --container-name=myubuntu is running. Let's also assume that the file utility has been installed. The next snippet shows how to open a second bash shell into the same container.

ilb@login-frida:~$ srun --overlap --jobid=527 --container-name=myubuntu --pty bash

root@ana:/# which file
/usr/bin/file

root@ana:/# exit
exit

ilb@login-frida:~$

Without the use of named containers, the task becomes more challenging as one needs to first start an overlapping shell on the node, find out the PID of the Enroot container in question, and then use Enroot commands to start a bash shell in that container. The following snippet shows an example.

ilb@login-frida:~$ srun --overlap --jobid=527 --pty bash

ilb@ana-frida:~ [bash]$ enroot list -f
NAME               PID    COMM  STATE  STARTED  TIME   MNTNS       USERNS      COMMAND
pyxis_527_527.0  11229  bash  Ss+    20:35    00:23  4026538269  4026538268  /usr/bin/bash

ilb@ana-frida:~ [bash]$ enroot exec 11229 bash

root@ana:/# which file
/usr/bin/file

root@ana:/# exit
exit

ilb@login-frida:~$

Save the container image to a local filesystem

Named containers work very well throughout a single sbatch allocation, but when the same scripts are run multiple times, downloading from online container registries may take too much time. With multi-node runs certain container registries may even throttle downloads (e.g. when a large number of nodes starts to download concurrently). Or simply the container is seen just as a starting point on which one builds (installs other dependencies). For such cases it is useful to create a local copy of the container via the parameter --container-save. For example, the following snippet shows this workflow on the earlier example with file.

ilb@login-frida:~$ srun -p dev --container-image=ubuntu:20.04 --container-save=./ubuntu_with_file.sqfs sh -c 'apt-get update && apt-get install -y file'
srun: job 7822 queued and waiting for resources
srun: job 7822 has been allocated resources
pyxis: importing docker image: ubuntu:20.04
pyxis: imported docker image: ubuntu:20.04
...
pyxis: exported container pyxis_7822_7822.0 to ./ubuntu_with_file.sqfs

ilb@login-frida:~$ ls -alht ubuntu_with_file.sqfs
-rw-r----- 1 ilb lpt 128M Oct 24 12:07 ubuntu_with_file.sqfs

ilb@login-frida:~$ srun -p dev --container-image=./ubuntu_with_file.sqfs which file
srun: job 7823 queued and waiting for resources
srun: job 7823 has been allocated resources
/usr/bin/file

Warning

The parameter --container-save expects a file path, providing a folder path may lead to data loss. The parameter --container-image by default assumes an online image name, so local files should be provided in absolute path format, or prepended with ./ in the case of relative path format.

Private container registries

Enroot uses credentials configured through $HOME/.config/enroot/.credentials. Because the Pyxis/Enroot bundle pulls containers on the fly during the Slurm job start, the credentials file needs to be accessible in a shared filesystem which all nodes can access at job start. The file format of the credentials file is the following.

$HOME/.config/enroot/.credentials

machine <hostname> login <username> password <password>

For example, credentials for the NVIDIA NGC registry would look as follows.

# NVIDIA GPU Cloud (both endpoints are required)
machine nvcr.io login $oauthtoken password <token>
machine authn.nvidia.com login $oauthtoken password <token>

For further details consult the Enroot documentation, which provides additional examples.

Requesting GPUs

All FRIDA compute nodes provide GPUs, but they differ in provider, architecture, and available memory. In Slurm, a request for allocation of GPU resources can be done using either by specifying the --gpus=[type:]{count} or --gres=gpu[:type][:{count}] parameter, as shown in the snippet below. The available GPU types depend on the selected partition, whose available GPU types in turn depend on the nodes that constitute the partition (see section partitions).

ilb@login-frida:~$ srun -p dev --gres=gpu nvidia-smi -L
GPU 0: NVIDIA L4 (UUID: GPU-42461b88-e6ff-2a6d-cad6-583f99df26d2)

ilb@login-frida:~$ srun -p dev --gpus=A100_80GB:4 nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-845e3442-e0ca-a376-a3de-50e4cb7fd421)
GPU 1: NVIDIA A100 80GB PCIe (UUID: GPU-7f73bd51-df5b-f0ba-2cf2-5dadbfa297e1)
GPU 2: NVIDIA A100 80GB PCIe (UUID: GPU-d666e6d2-5265-29ae-9a6f-d8772807d34f)
GPU 3: NVIDIA A100 80GB PCIe (UUID: GPU-fd552b1f-e767-edd6-5cc0-69514f1748d2)

Code tunnel

When an interactive session is intended for code development, testing, and/or debugging, many a time it is desirable to work with a suitable IDE. The requirement of resource allocation via Slurm and the lack of any toolsets on bare-metal hosts might seem too much of an added complexity, but in reality, there is a very elegant solution by using Visual Studio Code in combination with the Remote Development Extension (via code_tunnel). Running code_tunnel will allow you to use VSCode to connect directly to the container that is running on the compute node that was assigned to your job. Combined with root remapping this has the added benefit of a user experience that feels like working with your own VM.

On every run of code_tunnel you will need to register the tunnel with your GitHub or Microsoft account; this interaction requires the srun command to be run with the parameter --pty, and for this reason, a suitable alias command named stunnel was setup. Once registered you will be able to access the compute node (while code_tunnel is running) either in a browser by visiting https://vscode.dev or with a desktop version of VSCode via the Remote Explorer (part of the Remote Development Extension pack). In both cases, the 'Remotes (Tunnels/SSH)' list in the Remote Explorer pane should contain a list of tunnels to FRIDA (named frida-{job-name}) that you have registered and also denote which of these are currently online (with your job running). In the browser, it is also possible to connect directly to a workspace on the node (on which the code_tunnel job is running). Simply visit the URL of the form https://vscode.dev/tunnel/frida-{job-name}[/{path-to-workspace}]. If you wish to close the tunnel before the job times out press Ctrl-C in the terminal where you started the job.

ilb@login-frida:~$ stunnel -c32 --mem=48G --gres=gpu:1 --container-image=nvcr.io#nvidia/pytorch:23.10-py3 --job-name torch-debug
# srun -p dev -c32 --mem=48G --gres=gpu:1 --container-image=nvcr.io#nvidia/pytorch:23.10-py3 --job-name torch-debug --pty code_tunnel
pyxis: importing docker image: nvcr.io#nvidia/pytorch:23.10-py3
pyxis: imported docker image: nvcr.io#nvidia/pytorch:23.10-py3
Tue Nov 28 21:31:40 UTC 2023
*
* Visual Studio Code Server
*
* By using the software, you agree to
* the Visual Studio Code Server License Terms (https://aka.ms/vscode-server-license) and
* the Microsoft Privacy Statement (https://privacy.microsoft.com/en-US/privacystatement).
*
✔ How would you like to log in to Visual Studio Code? · Github Account
To grant access to the server, please log into https://github.com/login/device and use code 17AC-E3AE
[2023-11-28 21:31:15] info Creating tunnel with the name: frida-torch-debug

Open this link in your browser https://vscode.dev/tunnel/frida-torch-debug/workspace

[2023-11-28 21:31:41] info [tunnels::connections::relay_tunnel_host] Opened new client on channel 2
[2023-11-28 21:31:41] info [russh::server] wrote id
[2023-11-28 21:31:41] info [russh::server] read other id
[2023-11-28 21:31:41] info [russh::server] session is running
[2023-11-28 21:31:43] info [rpc.0] Checking /local/scratch/root/590/.vscode/cli/ana/servers/Stable-1a5daa3a0231a0fbba4f14db7ec463cf99d7768e/log.txt and /local/scratch/root/590/.vscode/cli/ana/servers/Stable-1a5daa3a0231a0fbba4f14db7ec463cf99d7768e/pid.txt for a running server...
[2023-11-28 21:31:43] info [rpc.0] Downloading Visual Studio Code server -> /tmp/.tmpbbxB9w/vscode-server-linux-x64.tar.gz
[2023-11-28 21:31:45] info [rpc.0] Starting server...
[2023-11-28 21:31:45] info [rpc.0] Server started
^C[2023-11-28 21:37:45] info Shutting down: Ctrl-C received
[2023-11-28 21:37:45] info [rpc.0] Disposed of connection to running server.
Tue Nov 28 21:37:45 UTC 2023

ilb@login-frida:~$

Keeping interactive jobs alive

Interactive sessions are typically run in the foreground, require a terminal, and last for the duration of the SSH session, or until the Slurm reservation times out (whichever comes first). On flaky internet connections, this can become a problem, where a lost connection might lead to a stopped job. One remedy is to use tmux or screen which allows for running interactive jobs in detached mode.

Batch jobs

Once development, testing, and/or debugging is complete, at least in ML/AI training tasks, the datasets typically become such that jobs last much longer than a normal SSH session. They can also be safely run without any interaction and also without a terminal. In Slurm parlance these types of jobs are termed batch jobs. A batch job is a shell script (typically marked by a .sbatch extension), whose header provides Slurm parameters that specify the resources needed to run the job. For details of all available parameters consult the official Slurm documentation. Below is a very brief deep-dive introduction to Slurm batch jobs.

Assume a toy MNIST training script based on Pytorch and Lightning, named train.py.

train.py

import os
import torch
from torch import nn
import torch.nn.functional as F
from torchvision import transforms
from torchvision.datasets import MNIST
from torch.utils.data import DataLoader, random_split
from torchmetrics.functional import accuracy
import lightning.pytorch as L
from pathlib import Path


class LitMNIST(L.LightningModule):
    def __init__(self, data_dir=os.getcwd(), hidden_size=64, learning_rate=2e-4):
        super().__init__()

        # We hardcode dataset specific stuff here.
        self.data_dir = data_dir
        self.num_classes = 10
        self.dims = (1, 28, 28)
        channels, width, height = self.dims
        self.transform = transforms.Compose(
            [
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
            ]
        )

        self.hidden_size = hidden_size
        self.learning_rate = learning_rate

        # Build model
        self.model = nn.Sequential(
            nn.Flatten(),
            nn.Linear(channels * width * height, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, hidden_size),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(hidden_size, self.num_classes),
        )

    def forward(self, x):
        x = self.model(x)
        return F.log_softmax(x, dim=1)

    def training_step(self, batch):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        logits = self(x)
        loss = F.nll_loss(logits, y)
        preds = torch.argmax(logits, dim=1)
        acc = accuracy(preds, y, task="multiclass", num_classes=10)
        self.log("val_loss", loss, prog_bar=True)
        self.log("val_acc", acc, prog_bar=True)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=self.learning_rate)
        return optimizer

    def prepare_data(self):
        # download
        MNIST(self.data_dir, train=True, download=True)
        MNIST(self.data_dir, train=False, download=True)

    def setup(self, stage=None):
        # Assign train/val datasets for use in dataloaders
        if stage == "fit" or stage is None:
            mnist_full = MNIST(self.data_dir, train=True, transform=self.transform)
            self.mnist_train, self.mnist_val = random_split(mnist_full, [55000, 5000])

        # Assign test dataset for use in dataloader(s)
        if stage == "test" or stage is None:
            self.mnist_test = MNIST(
                self.data_dir, train=False, transform=self.transform
            )

    def train_dataloader(self):
        return DataLoader(self.mnist_train, batch_size=4)

    def val_dataloader(self):
        return DataLoader(self.mnist_val, batch_size=4)

    def test_dataloader(self):
        return DataLoader(self.mnist_test, batch_size=4)


# Some prints that might be useful
print("SLURM_NTASKS =", os.environ["SLURM_NTASKS"])
print("SLURM_TASKS_PER_NODE =", os.environ["SLURM_TASKS_PER_NODE"])
print("SLURM_GPUS_PER_NODE =", os.environ.get("SLURM_GPUS_PER_NODE", "N/A"))
print("SLURM_CPUS_PER_NODE =", os.environ.get("SLURM_CPUS_PER_NODE", "N/A"))
print("SLURM_NNODES =", os.environ["SLURM_NNODES"])
print("SLURM_NODELIST =", os.environ["SLURM_NODELIST"])
print(
    "PL_TORCH_DISTRIBUTED_BACKEND =",
    os.environ.get("PL_TORCH_DISTRIBUTED_BACKEND", "nccl"),
)


# Explicitly specify the process group backend if you choose to
ddp = L.strategies.DDPStrategy(
    process_group_backend=os.environ.get("PL_TORCH_DISTRIBUTED_BACKEND", "nccl")
)

# setup checkpointing
checkpoint_callback = L.callbacks.ModelCheckpoint(
    dirpath="./checkpoints/",
    filename="mnist_{epoch:02d}",
    every_n_epochs=1,
    save_top_k=-1,
)

# resume from checkpoint
existing_checkpoints = list(Path('./checkpoints/').iterdir()) if Path('./checkpoints/').exists() else []
last_checkpoint = str(sorted(existing_checkpoints, key=os.path.getmtime)[-1]) if len(existing_checkpoints)>0 else None

# model
model = LitMNIST()
# train model
trainer = L.Trainer(
    callbacks=[checkpoint_callback],
    max_epochs=100,
    accelerator="gpu",
    devices=-1,
    num_nodes=int(os.environ["SLURM_NNODES"]),
    strategy=ddp,
)
trainer.fit(model,ckpt_path=last_checkpoint)

To run this training script on a single node with a single GPU of type NVIDIA A100 40GB SXM4 one needs to prepare the following script (train_1xA100.sbatch). The script will reserve 16 vCPU, 32GB RAM, and 1 NVIDIA A100 40GB SXM4 for 20 minutes. It will also give the job a name, for easier finding when listing the partition queue status. As part of the actual job steps, it will start a pytorch:23.08-py3 container, mount the current path, install lightning:2.1.2, and finally run the training, all with one single srun command. Without additional parameters, the srun command will use all of the allocated resources. The standard output of the script will be saved into a file named slurm-{SLURM_JOB_ID}.out.

train_1xA100.sbatch

#!/bin/bash
#SBATCH --job-name=mnist-demo
#SBATCH --time=20:00
#SBATCH --gres=gpu:A100
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G

srun \
  --container-image nvcr.io#nvidia/pytorch:23.08-py3 \
  --container-mounts ${PWD}:${PWD} \
  --container-workdir ${PWD} \
  bash -c 'pip install lightning==2.1.2; python3 train.py'

The Slurm job is then executed with the following command. It runs in the background without any terminal output, but its progress can be monitored by viewing the corresponding output file.

ilb@login-frida:~/mnist-demo$ sbatch -p frida train_1x100.sbatch
Submitted batch job 600

ilb@login-frida:~/mnist-demo$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               600     frida mnist-de      ilb  R       0:35      1 axa

ilb@login-frida:~/mnist-demo-1$ tail -f slurm-600.out
pyxis: importing docker image: nvcr.io#nvidia/pytorch:23.08-py3
pyxis: imported docker image: nvcr.io#nvidia/pytorch:23.08-py3
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting lightning==2.1.2
  Obtaining dependency information for lightning==2.1.2 from https://files.pythonhosted.org/packages/9e/8a/9642fdbdac8de47d68464ca3be32baca3f70a432aa374705d6b91da732eb/lightning-2.1.2-py3-none-any.whl.metadata
  Downloading lightning-2.1.2-py3-none-any.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.8/61.8 kB 2.8 MB/s eta 0:00:00
...

To run the same training script on more GPUs one needs to change the SBATCH line with --gres and add a line that defines the number of parallel tasks Slurm should start (--tasks). For example, to run on 2 GPUs of type NVIDIA H100 80GB HBM3 and this time save the standard output in a file named {SLURM_JOB_NAME}_{SLURM_JOB_ID}.out, one needs to prepare the following batch script.

train_2xH100.sbatch

#!/bin/bash
#SBATCH --job-name=mnist-demo
#SBATCH --partition=frida
#SBATCH --time=20:00
#SBATCH --gres=gpu:H100:2
#SBATCH --tasks=2
#SBATCH --cpus-per-task=16
#SBATCH --mem=32G
#SBATCH --output=%x_%j.out

srun \
  --container-image nvcr.io#nvidia/pytorch:23.08-py3 \
  --container-mounts ${PWD}:${PWD} \
  --container-workdir ${PWD} \
  bash -c 'pip install lightning==2.1.2; python3 train.py'

Since the script also preselects the partition on which to run it is not necesary to provide it when executing the sbatch command.

ilb@login-frida:~/mnist-demo$ sbatch train_2xH100.sbatch
Submitted batch job 601

ilb@login-frida:~/mnist-demo$

A more advanced script that allows for a custom organization of experiment results, autoarchival of ran scripts, can be run single-node single-GPU or multi-node multi-GPU, modifies the required container on the fly, and is capable of auto-re-queueing itself and continuation, can be seen in the following listing. The process of submitting the job remains the same as before.

train_2x2.sbatch

#!/bin/bash
#SBATCH --job-name=mnist-demo # provide a job name
#SBATCH --account=lpt # povide the account which determines the resource limits
#SBATCH --partition=frida # provide the partition from where the resources should be allocated
#SBATCH --time=20:00 # provide the time you require the resources for
#SBATCH --nodes=2 # provide the number of nodes you are requesting
#SBATCH --ntasks-per-node=2 # provide the number of tasks to run on a node; slurm runs this many replicas of the subsequent commands
#SBATCH --gpus-per-node=2 # provide the number of GPUs that is required per node; avoid using --gpus-per-tasks as there are issues with allocation as well as inter-gpu communication - see https://github.com/NVIDIA/pyxis/issues/73
#SBATCH --cpus-per-task=4 # provide the number of CPUs that is required per task; be mindful that different nodes have different numbers of available CPUs
#SBATCH --mem-per-cpu=4G # provide the required memory per CPU that is required per task; be mindful that different nodes have different amounts of memory
#SBATCH --output=/dev/null # we do not care for the log of this setup script, we redirect the log of the execution srun
#SBATCH --signal=B:USR2@90 # instruct Slurm to send a notification 90 seconds prior to running out of time
# SBATCH --signal=B:TERM@60 # this is to enable the graceful shutdown of the job, see https://slurm.schedmd.com/sbatch.html#lbAH and lightning auto-resubmit, see https://lightning.ai/docs/pytorch/stable/cloudsy/cluster_advanced.html#enable-auto-wall-time-resubmissions

# get the name of this script
if [ -n "${SLURM_JOB_ID:-}" ] ; then
  SBATCH=$(scontrol show job "$SLURM_JOB_ID" | awk -F= '/Command=/{print $2}')
else
  SBATCH=$(realpath "$0")
fi

# allow passing a version, for autorequeue
if [ $# -gt 1 ] || [[ "$0" == *"help" ]] || [ -z "${SLURM_JOB_ID:-}" ] || [[ "$1" != "--version="* ]]; then
  echo -e "\nUsage: sbatch ${SBATCH##*$PWD/} [--version=<version>] \n"
  exit 1
fi

# convert the --key=value arguments to variables
for argument in "$@"
do
  if [[ $argument == *"="* ]]; then
    key=$(echo $argument | cut -f1 -d=)
    value=$(echo $argument | cut -f2 -d=)
    if [[ $key == *"--"* ]]; then
        v="${key/--/}"
        declare ${v,,}="${value}"
   fi
  fi
done

# setup signal handler for autorequeue
if [ "${version}" ]; then
  sig_handler() {
    echo -e "\n\n$(date): Signal received, requeuing...\n\n" >> ${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}/${RUN_NAME}.1.txt
    scontrol requeue ${SLURM_JOB_ID}
  }
  trap 'sig_handler' SIGUSR2
fi

# time of running script
DATETIME=`date "+%Y%m%d-%H%M"`
version=${version:-$DATETIME} # if version is not set, use DATETIME as default

# work dir
WORK_DIR=${HOME}/mnist

# experiment name
EXPERIMENT_NAME=demo

# experiment dir
EXPERIMENT_DIR=${WORK_DIR}/exp
mkdir -p ${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}

# set run name
if [ "${version}" == "${DATETIME}" ]; then
  RUN_NAME=${version}
else
  RUN_NAME=${version}_R${DATETIME}
fi

# archive this script
cp -rp ${SBATCH} ${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}/${RUN_NAME}.sbatch

# prepare execution script name
SCRIPT=${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}/${RUN_NAME}.sh
touch $SCRIPT
chmod a+x $SCRIPT

IS_DISTRIBUTED=$([ 1 -lt $SLURM_JOB_NUM_NODES ] && echo " distributed over $SLURM_JOB_NUM_NODES nodes" || echo " on 1 node")

# prepare the execution script content
echo """#!/bin/bash

# using `basename $SBATCH` -> $RUN_NAME.sbatch, running $SLURM_NPROCS tasks$IS_DISTRIBUTED

# prepare sub-script for debug outputs
echo -e \"\"\"
# starting at \$(date)
# running process \$SLURM_PROCID on \$SLURMD_NODENAME
\$(nvidia-smi | grep Version | sed -e 's/ *| *//g' -e \"s/   */\n# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\" -e \"s/^/# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\")
\$(nvidia-smi -L | sed -e \"s/^/# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\")
\$(python -c 'import torch; print(f\"torch: {torch.__version__}\")' | sed -e \"s/^/# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\")
\$(python -c 'import torch, torch.utils.collect_env; torch.utils.collect_env.main()' | sed \"s/^/# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\")
\$(env | grep -i Slurm | sort | sed \"s/^/# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\")
\$(cat /etc/nccl.conf | sed \"s/^/# \${SLURMD_NODENAME}.\${SLURM_PROCID}>   /g\")
\"\"\"

# set unbuffered python for realtime container logging
export PYTHONFAULTHANDLER=1
export NCCL_DEBUG=INFO

# train
python /train/train.py

echo -e \"\"\"
# finished at \$(date)
\"\"\"
""" >> $SCRIPT

# install lightning into container, ensure it is run only once per node
srun \
  --ntasks=${SLURM_JOB_NUM_NODES} \
  --ntasks-per-node=1 \
  --container-image nvcr.io#nvidia/pytorch:23.08-py3 \
  --container-name ${SLURM_JOB_NAME} \
  --output="${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}/${RUN_NAME}.%s.txt" \
  pip install lightning==2.1.2 \
  || exit 1

# run training script, run in background to receive signal notification prior to timout
srun \
  --container-name ${SLURM_JOB_NAME} \
  --container-mounts ${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}:/exp,${WORK_DIR}:/train \
  --output="${EXPERIMENT_DIR}/${EXPERIMENT_NAME}/${version}/${RUN_NAME}.%s.txt" \
  --container-workdir /exp \
  /exp/${RUN_NAME}.sh \
  &
PID=$!
wait $PID;

This script can then be run and monitored with the standard commands.

ilb@login-frida:~/mnist-demo$ sbatch train_2x2.sbatch --version=v1
Submitted batch job 602

ilb@login-frida:~/mnist-demo$ tail -f exp/demo/v1/v1_20231213-2237.1.txt
...