Atlas | HPC2

Introduction

This is a brief document on the usage of the A100 GPU nodes on Atlas. The A100 GPUs installed in Atlas are the 80GB Memory versions. There are 5 A100 nodes, each housing 8 A100 GPUs. 2 of these nodes have their A100’s in a MIG (Multi-Instance GPU) mode. This means that each A100 is partitioned into 7 individual GPU instances, with 10GB of memory attached to each instance. That gives each of these nodes 56 GPU instances. Three of the nodes does not have its A100s in a MIG configuration, so it will offer all 8 A100s in their non-MIG form. The MIG nodes are in partition --partition=gpu-a100-mig7, while the non-MIG nodes are in partition --partition=gpu-a100.

Allocation of a MIG instance

There are 112 total MIG instances between the 2 nodes with A100’s in a MIG configuration. The name of each instance is a100_1g.10gb. In Slurm, GPUs are a type of Generic Resource, or gres in your submission scripts. To allocate a single A100 instance, you would request a gres with that resource type and name.

 #SBATCH --gres=gpu:a100_1g.10gb:1

This would allocate 1 of the MIG instances. The format of the Gres allocation section is:

  --gres=(type):(name-of-gres):(Number-requested)

General GPU allocation

There is one type of Gres installed in Atlas; that is the GPU type, but there are three different partitions for these resources, depending on which GPU you intend to use.

gpu-v100 -> These are the V100 GPUs (E.G., --gres:gpu:v100:1)  

gpu-a100 -> These are the A100 GPUs (E.G., --gres:gpu:a100:1) 

gpu-a100-mig7  -> These are the A100 MIG GPUs (E.G., --gres:gpu:a100_1g.10gb:1)

Here is an example of an interactive allocation of 1 A100 MIG instance, utilizing 2 processor cores, with a time limit of 6 hours:

[joey.jones@atlas-login-2 ~]$ salloc -p gpu-a100-mig7 -n 2 --gres=gpu:a100_1g.10gb:1 -A admin -t 6:00:00
salloc: Pending job allocation 12163950
salloc: job 12163950 queued and waiting for resources
salloc: job 12163950 has been allocated resources
salloc: Granted job allocation 12163950
salloc: Waiting for resource configuration
salloc: Nodes atlas-0241 are read for job
[joey.jones@atlas-login-2 ~]$ srun hostname
atlas-0241
atlas-0241

Using Containers with the A100

NVIDIA provides a number of containers that are optimized for the A100 hardware. It is possible that requested software is a package that is available through the NVIDIA container repository. The following link contains a repository of available containers. Requests for containers to be made available can be submitted to the helpdesk: scinet_vrsc@usda.gov

Nvidia Container Repository

As an example, pytorch is one of the optimized software packages that NVIDIA distributes. That container was installed on the A100 application tree under /apps/containers/ and can be accessed as follows.

module load apptainer
apptainer exec --nv /apps/containers/pytorch/pytorch-23.04.sif python3

After these commands, a python prompt will be available with access to pytorch:

>>>import torch
>>>torch.cuda.is_available()
True
>>>

It is also possible to run these in a script or batch method:

$cat myscript.py
import torch
print(torch.cuda.is_available())

$module load apptainer
$apptainer exec --nv /apps/containers/pytorch/pytorch-23.04.sif python3 myscript.py
True

And as an extension, this could be done via slurm's batch method, using the same python script:

$cat sbatch.test
#!/bin/bash -l
#SBATCH -J Container-GPU
#SBATCH -N 1
#SBATCH -n 1
#SBATCH -p gpu-a100-mig7
#SBATCH --gres=gpu:a100_1g.10gb:1
#SBATCH -t 4:00:00
#SBATCH -A Admin

module purge
module load apptainer
srun apptainer exec --nv /apps/containers/pytorch/pytorch-23.04.sif python3 myscript.py

$ cat slurm-12169583.out
13:4: not a valid test operator: (
13:4: not a valid test operator: 525.105.17
True

Note that utilizing containers in the MIG subsections on the A100 Nodes is largely experimental, and we are monitoring and updating with patches from the associated vendors whenever possible.

Script Example: Alphafold

#!/bin/bash -l
 
#SBATCH --partition=gpu-a100
#SBATCH --qos=normal
#SBATCH --job-name=afoldE8
#SBATCH --output=JobName.%J.out
#SBATCH --error=JobName.%J.err
#SBATCH --account=vrsc
#SBATCH --time=72:00:00
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --ntasks=16
 
module purge
module load apptainer/1.3.3
 
# Set environment variables
export TF_FORCE_GPU_ALLOW_GROWTH=true  # Allows dynamic GPU memory allocation
# Bind directories from the host to the container
#export APPTAINER_BIND=/reference/data/alphafold/2.3.0
 
# Path to AlphaFold container
CONTAINER_PATH=/reference/containers/alphafold/2.3.2/alphafold-2.3.2.sif
 
# Paths to input and output directories (inside the container)
FASTA_PATH=${SLURM_SUBMIT_DIR}/multimer.fasta
OUTPUT_DIR=${SLURM_SUBMIT_DIR}/E8_output_A100
DATA_DIR=/reference/data/alphafold/2.3.0
 
# Run AlphaFold
apptainer exec --nv $CONTAINER_PATH \
    python /app/alphafold/run_alphafold.py \
    --fasta_paths=$FASTA_PATH \
    --mgnify_database_path=$DATA_DIR/mgnify/mgy_clusters_2022_05.fa \
    --uniref30_database_path=$DATA_DIR/uniref30/UniRef30_2021_03 \
    --bfd_database_path=$DATA_DIR/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --data_dir=$DATA_DIR \
    --template_mmcif_dir=$DATA_DIR/pdb_mmcif/mmcif_files \
    --obsolete_pdbs_path=$DATA_DIR/pdb_mmcif/obsolete.dat \
    --output_dir=$OUTPUT_DIR \
    --model_preset=multimer \
    --max_template_date=2030-01-01 \
    --uniref90_database_path=$DATA_DIR/uniref90/uniref90.fasta \
    --uniprot_database_path=$DATA_DIR/uniprot/uniprot.fasta \
    --pdb_seqres_database_path=$DATA_DIR/pdb_seqres/pdb_seqres.txt \
    --use_gpu_relax=True

Atlas A100 GPU Usage Documentation

Introduction

Allocation of a MIG instance

General GPU allocation

Using Containers with the A100

Return to Main Documentation