This is a brief document on the usage of the A100 GPU nodes on Atlas. The A100 GPUs installed in Atlas are the 80GB Memory versions. There are 5 A100 nodes, each housing 8 A100 GPUs. 4 of these nodes have their A100’s in a MIG (Multi-Instance GPU) mode. This means that each A100 is partitioned into 7 individual GPU instances, with 10GB of memory attached to each instance. That gives each of these nodes 56 GPU instances. One of the nodes does not have its A100s in a MIG configuration, so it will offer all 8 A100s in their non-MIG form. All A100 GPUs are located in the gpu-a100 partiton.
There are 224 total MIG instances between the 4 nodes with A100’s in a MIG configuration. The name of each instance is a100_1g.10gb. In Slurm, GPUs are a type of Generic Resource, or gres in your submission scripts. To allocate a single A100 instance, you would request a gres with that resource type and name.
#SBATCH --gres=gpu:a100_1g.10gb:1
This would allocate 1 of the MIG instances. The format of the Gres allocation section is:
--gres=(type):(name-of-gres):(Number-requested)
There is one type of Gres installed in Atlas; that is the GPU type, but there are three different names for these resources, depending on which GPU you intend to use.
v100 -> These are the existing V100 cards in the "gpu" partition.
a100 -> These are the A100 GPUs (E.G., --gres:gpu:a100:1)
a100_1g.10gb -> This is the name of 1 A100 MIG instance
Here is an example of an interactive allocation of 1 A100 MIG instance, utilizing 2 processor cores, with a time limit of 6 hours:
[joey.jones@atlas-login-2 ~]$ salloc -p gpu-a100 -n 2 --gres=gpu:a100_1g.10gb:1 -A admin -t 6:00:00 salloc: Pending job allocation 12163950 salloc: job 12163950 queued and waiting for resources salloc: job 12163950 has been allocated resources salloc: Granted job allocation 12163950 salloc: Waiting for resource configuration salloc: Nodes atlas-0241 are read for job [joey.jones@atlas-login-2 ~]$ srun hostname atlas-0241 atlas-0241
NVIDIA provides a number of containers that are optimized for the A100 hardware. It is possible that requested software is a package that is available through the NVIDIA container repository. The following link contains a repository of available containers. Requests for containers to be made available can be submitted to the helpdesk: scinet_vrsc@usda.gov
Nvidia Container Repository
As an example, pytorch is one of the optimized software packages that NVIDIA distributes. That container was installed on the A100 application tree under /apps/containers/ and can be accessed as follows.
module load apptainer/1.0.2 apptainer exec --nv /apps/containers/pytorch/pytorch-23.04.sif python3
After these commands, a python prompt will be available with access to pytorch:
>>>import torch >>>torch.cuda.is_available() True >>>
It is also possible to run these in a script or batch method:
$cat myscript.py import torch print(torch.cuda.is_available()) $module load apptainer/1.0.2 $apptainer exec --nv /apps/containers/pytorch/pytorch-23.04.sif python3 myscript.py True
And as an extension, this could be done via slurm's batch method, using the same python script:
$cat sbatch.test #!/bin/bash -l #SBATCH -J Container-GPU #SBATCH -N 1 #SBATCH -n 1 #SBATCH -p gpu-a100 #SBATCH --gres=gpu:a100_1g.10gb:1 #SBATCH -t 4:00:00 #SBATCH -A Admin module purge module load apptainer/1.0.2 srun apptainer exec --nv /apps/containers/pytorch/pytorch-23.04.sif python3 myscript.py $ cat slurm-12169583.out 13:4: not a valid test operator: ( 13:4: not a valid test operator: 525.105.17 True
Note that utilizing containers in the MIG subsections on the A100 Nodes is largely experimental, and we are monitoring and updating with patches from the associated vendors whenever possible.