Atlas

Atlas Changes (2024/05/03)


Note the following changes to the atlas compute cluster

OS: Rocky 9.1 linux distribution is now utilized

SW: a new software stack has been installed with many version changes.
Use the module commands to see available software.
examples: "module av" and "module spider _pkg_"
It should be expected that most user codes will need to be recompiled.
Submit a help ticket for any software issues or to request any additional software?

Partition changes:

development-gpu partition has been removed and the node added to the gpu-v100 partition
A combination of the other gpu partitions and the qos's should now be utilized

gpu partition renamed to gpu-v100 to better reflect the gpu type utilized

gpu-a100-mig7 partition created to better differentiate between the A100 GPUs that are
configured for multi instance graphics.  The number of nodes in this partition has been reduced
from 4 to 2 with the nodes added to the gpu-a100 partition

gpu-a100 now has 3 nodes. Each with 8x full A100 GPUs available.



ssh host keys:
all service nodes (atlas-login-[1-2],atlas-dtn-[1-2],atlas-devel-[1-2])
have updated ssh host keys and will require the acceptance of these new keys.
This can be done issuing the following command on linux systems.
Old keys can be removed using the following commands

ssh-keygen -R atlas-devel.hpc.msstate.edu
ssh-keygen -R atlas-devel-1.hpc.msstate.edu
ssh-keygen -R atlas-devel-2.hpc.msstate.edu
ssh-keygen -R atlas-dtn.hpc.msstate.edu
ssh-keygen -R atlas-dtn-1.hpc.msstate.edu
ssh-keygen -R atlas-dtn-2.hpc.msstate.edu
ssh-keygen -R atlas-login.hpc.msstate.edu
ssh-keygen -R atlas-login-1.hpc.msstate.edu
ssh-keygen -R atlas-login-2.hpc.msstate.edu

Atlas is a Cray CS500 Linux Cluster with 11,520 2.40GHz Xeon Platinum 8260 processor cores, 101 terabytes of RAM, 8 NVIDIA V100 GPUs, and a Mellanox HDR100 InfiniBand interconnect. Atlas has a peak performance of 565 TeraFLOPS. Atlas also has an additional five HPE Apollo XL675d nodes equipped with eight NVIDIA A100 GPUs per node, for a total of 40 physical GPUs. Additional information available here.

Atlas is composed of 240 nodes, two login nodes, and two data transfer nodes. All nodes contain two 2.40GHz Xeon Platinum 8260 2nd Generation Scalable Processors. The processors contain 24 cores each, for a total of 48 cores per node. All Atlas Memory is DDR4-2933 2R RDIMM. The 228 standard Atlas compute nodes plus the login nodes have a total of 384GB RAM in each node. The 2 Data Transfer nodes have a total of 192GB RAM in each node. The 8 Big Mem nodes have a total of 1536GB RAM in each node. The 4 GPU nodes have 384GB RAM and 2 NVIDIA V100 GPUs in each node. The 5 A100 GPU nodes have 1152GB RAM and 8 NVIDIA A100 GPUs in each node.

Nodes and Specifications

Node Names	Node Type	Cores/Node, CPU Type	Memory/Node, Configuration	GPUs/Node
atlas-login-[1-2]	Login	48 cores (2x24 core 2.40GHz Intel Cascade Lake Xeon Platinum 8260)	384 GB (12x 32GB DDR-4 Dual Rank 2933MHz)	N/A
atlas-dtn-[1-2]	Data Transfer	48 cores (2x24 core 2.40GHz Intel Cascade Lake Xeon Platinum 8260)	192 GB (12x 16GB DDR-4 Dual Rank 2933MHz)	N/A
atlas-devel-[1-2]	Development	48 cores (2x24 core 2.40GHz Intel Cascade Lake Xeon Platinum 8260)	768 GB (12x 64GB DDR-4 Dual Rank 2933MHz)	N/A
atlas-[0001-0228]	Compute	48 cores (2x24 core 2.40GHz Intel Cascade Lake Xeon Platinum 8260)	384 GB (12x 32GB DDR-4 Dual Rank 2933MHz)	N/A
atlas-[0229-0236]	Compute (Big Mem)	48 cores (2x24 core 2.40GHz Intel Cascade Lake Xeon Platinum 8260)	1536 GB (24x 64GB DDR-4 Dual Rank 2933MHz)	N/A
atlas-[0237-0240]	Compute (GPU V100)	48 cores (2x24 core 2.40GHz Intel Cascade Lake Xeon Platinum 8260)	384 GB (12x 32GB DDR-4 Dual Rank 2933MHz)	2x NVIDIA V100
atlas-[0241-0242]	Compute (GPU A100, MIG=7)	128 cores (2x64 core 2.00GHz AMD EPYC Milan 7713)	1 TB (16x 64GB DDR-4 Dual Rank 3200MHz)	8x NVIDIA A100 (80GB) mig=7
atlas-[0243-0245]	Compute (GPU A100)	128 cores (2x64 core 2.00GHz AMD EPYC Milan 7713)	1 TB (16x 64GB DDR-4 Dual Rank 3200MHz)	8x NVIDIA A100 (80GB) mig=1

Partitions and Limits

Partition	TotalNodes	Nodes	MaxNodes (Per Job)	MaxTime	DefMemPerCPU	Allowed Qos
atlas*	228	atlas-[0001-0228]	QoS limited	QoS limited	7686	ALL
bigmem	8	atlas-[0229-0236]	QoS limited	QoS limited	31879	ALL
gpu-v100	4	atlas-[0237-0240]	QoS limited	QoS limited	7686	ALL
gpu-a100-mig7	2	atlas-[0241-0242]	QoS limited	QoS limited	7686	ALL
gpu-a100	3	atlas-[0243-0245]	QoS limited	QoS limited	7686	ALL
development	2	atlas-devel-[1-2]	QoS limited	QoS limited	15710	ALL
service	2	atlas-dtn-[1-2]	QoS limited	QoS limited	3827	ALL

* Default Partition

QoS's and Limits

QoS	Priority	MaxNodes	MaxTime	Notes
normal	20	30	14 Days	Default QoS
debug	30	3	30 Minutes	Debugging Jobs MaxJobsPerUser=2
ood	1	1	8 Days	Open OnDemand Interactive Desktop Jobs MaxCPUSPerNode=2
special	20	UNLIMITED	UNLIMITED	By request and approval only
priority	100	UNLIMITED	UNLIMITED	By request and approval only
sandbox	10	1	2 Hours	MaxJobsPerUser=2 MaxJobsPerAccount=10

Accessing Atlas

Currently, only 4 machines from the Atlas cluster can be accessed from outside of the Mississippi State HPC2 Network. The two login nodes and two DTN nodes can be accessed by connecting via ssh:

 ssh <SCINet UserID>@Atlas-login.hpc.msstate.edu 
 ssh <SCINet UserID>@Atlas-dtn.hpc.msstate.edu

For older Microsoft Windows machines, we recommend using PuTTY or OpenSSH (see the SciNet Quick Start Guide) When you log in, you will be on the Atlas login node. The login node is a shared resource among all SCINet users that are currently logged in to the system. Please do NOT run computationally or memory intensive tasks on the login node, this will negatively impact performance for all other users on the system. See the Slurm section for instructions on how to run such tasks on compute nodes.

File Transfers

Globus Online can be used to transfer data to and from the Atlas cluster. The two DTN nodes on Atlas can also be accessed from outside of the Mississippi State HPC2 network via a single globus endpoint:

 msuhpc2#Atlas-dtn

For small amounts of data that need to be transferred to user home directories, the scp command can be used. This command copies files between hosts on a network. It uses ssh for data transfer, and uses the same authentication and provides the same security as ssh. SCP will ask for passwords as well as two-factor authentication codes.

To copy a file from a remote host to local host:

 $ scp  <username>@<remotehost>:/path/to/file.txt  /local/directory/

To copy a file from a local host to a remote host:

 $ scp  /path/to/file.txt  <username>@<remotehost>:/remote/directory/

To copy a directory from a remote host to local host:

 $ scp  -r  <username>@<remotehost>:/remote/directory  /local/directory

To copy a directory from a local host to a remote host:

 $ scp  -r  /local/directory  <username>@<remotehost>:/remote/directory

Internet Connectivity

On this cluster, only certain nodes are reachable from the internet. Any software packages, libraries, or datasets needed for jobs or software development can be downloaded on the login nodes, devel nodes, or dtn nodes. The compute nodes of the cluster are on a private network, and they are unreachable from the internet.

Modules

Atlas uses LMOD as an environment module system. For a guide on how to use LMOD to set up the programming environment, please refer to the official LMOD User Guide.

Atlas uses a heirarchy based on the Compilers and MPI implementations. Software in the Core tree is built using the default system compilers. Software built against a specific compiler is available only after that compiler module has been loaded. Software built against a specific MPI implementation is available only after that MPI module has been loaded. Information on available modules can be found with the "module avail" and "module spider" commands:

 $ module spider mesa

 ---------------------------------------------
   mesa: mesa/20.1.6
 ---------------------------------------------
    This module can be loaded directly: module load mesa/20.1.6

    Additional variants of this module can also be loaded after loading the following modules:
      gcc/10.2.0

Quota

Each user has a home folder with a 10GB quota. Once this limit is exceeded, the quota will be enforced and users will not be able to create any more files until they have cleared enough disk space to be back under their alloted quota. The command to see quota usage (in human readable format) is:

 $ quota -s

Each project directory under /project also has a quota. This quota is applied to the entire project space, and is not set on a per-user basis. The default storage space for each project on Atlas is 1 TB. The usage and quota of all projects can be checked by running the following script:

 $ /apps/bin/reportFSUsage

Specific projects can also be added as an argument to the script. With the -p flag, it will display information for single projects as well as comma-separated lists of project names:

 $ /apps/bin/reportFSUsage -p proj1,proj2,proj3

 ------------------------------------------------------------------------------------
 Directory/Group             Usage(GB)   Quota(GB)   Limit(GB)      Files  Percentage
 ------------------------------------------------------------------------------------
 proj1                          41417       92160      102400    2211354        44.9
 proj2                          18287       23040       25600    1816769        79.4
 proj3                              0         922        1024          1         0.0

Project Space

On Atlas, the parallel filesystem is mounted as /project. All projects have their own directory located at /project/<projectname>/, and have quotas that are set by VRSC. This filesystem is considered a 'temporary' or 'scratch' filesystem.

DATA IN THIS LOCATION IS NOT BACKED UP AND CANNOT BE RESTORED IF DELETED!

It is important that users submit and run jobs from their respective /project directories instead of their home directories. The /home filesystem is not designed or configured for high performance use, nor does it have much space on it. Home directories will run out of space quickly on parallel jobs and will cause jobs to fail. After useful data is generated from supercompute jobs, it is recommended that users transfer this data to a more long-term storage location

/project/reference/data is the location for reference datasets. It is sync'd nightly from ceres.scinet.usda.gov:/reference/data

Local Scratch Space

Atlas compute nodes provide up to 2TB of local disk space at /local/scratch that may be used for temporary storage. However, data stored in these locations is not backed up, and is susceptible to data loss due to disk failure or corruption. Each job sets up a unique local space available only with the job script via the environmental $TMPDIR variable:


 TMPDIR=/local/scratch/$SLURM_JOB_USER/$SLURM_JOB_ID

You can use this for any scratch disk space you need, or if you plan to compute on an existing large data set (such as a sequence assembly job) it might be beneficial to copy all your input data to this space at the beginning of your job, and then do all your computation on $TMPDIR. You must copy any output data you need to keep back to permanent storage before the job ends, since $TMPDIR will be erased upon job exit. The following example shows how to copy data in, and then run from $TMPDIR:

 #!/bin/bash -l

 #SBATCH --job-name="TMPDIR example"
 #SBATCH --partition=atlas
 #SBATCH --account=projectname
 #SBATCH --nodes=1
 #SBATCH --ntasks=48
 #SBATCH --time=08:00:00

 # Always good practice to reset environment when you start
 module purge

 # start staging data to the job temporary directory in $TMPDIR
 MYDIR=`pwd`
 /bin/cp -r $MYDIR $TMPDIR/
 cd $TMPDIR

 # add regular job commands like module load
 # and commands to launch scientific software

 # copy output data off of local scratch
 /bin/cp -r output $MYDIR/output

$TMPDIR is defined as the above directory at the beginning of every job, before the job scripts are executed. Users must overwrite this definition inside of the batch script itself if TMPDIR needs to be set to a different location.

Arbiter

On each login node, we have a utility named Arbiter which regulates activity by monitoring and limiting resource consumption via cgroups. Users are limited to using 4 cpu cores and 50GB of memory at time while their status is "normal". When a user uses more than half of their cap for 10 minutes, they are sent a warning email and penalized by having their usage caps reduced. Each violation of this usage policy results in an occurance which raises the penalty level. A user's occurance level will drop by 1 for every 3 hours that they go without triggering another usage violation. The following table outlines the penalty/status levels that we currently have defined.

Status	CPU Cap	Memory Cap	Penalty Timeout
Normal	4 Cores	50 GB	N/A
Penalty1	3 Cores	40 GB	30 Minutes
Penalty2	2 Cores	25 GB	1 Hour
Penalty3	1 Cores	15 GB	2 Hours
Penalty4	0.2 Cores	5 GB	4 Hours

Certain programs, such as compilers and build utilities are whitelisted. These whitelisted programs will not cause the user to be penalized. Login nodes can still be freely used to build and test software. The purpose of Arbiter is to identify and limit computationally intensive jobs which should be run on the compute nodes instead of the login and development nodes.

Slurm

Atlas uses the Slurm Workload Manager as a scheduler and resource manager. For a guide on how to use the Slurm system to submit and run jobs on this cluster, please refer to the official Slurm Quickstart Guide.

Slurm has three primary job allocation commands which accept almost identical options:

SBATCH Submits a job runscript for later execution (batch mode)
SALLOC Creates a job allocation and starts a shell to use it (interactive mode)
SRUN Creates a job allocation and launches the job step (typically an MPI job)

The salloc command is configured to have the default functionality on Atlas. The salloc command allocates resources for the job, but spawns a shell on the login node with various Slurm environment variables set. Job steps can be launched from the salloc shell with the srun command.

Example salloc usage:

 jake.frulla@Atlas-login-1 ~$ salloc -A admin
 salloc: Pending job allocation 527990
 salloc: job 527990 queued and waiting for resources
 salloc: Granted job allocation 527990
 salloc: Waiting for resource configuration
 salloc: Nodes Atlas-0029 are ready for job

 jake.frulla@Atlas-login-1 ~$ srun hostname
 
 Atlas-0029.HPC.MsState.Edu

The srun command can be used to launch an interactive shell on an allocated node or set of nodes. Simply specify the --pty option while launching a shell (such as bash) with srun. It is also recommended to set the wallclock limit along with the number of nodes and processors needed for the interactive shell.

Example interactive shell:

 jake.frulla@Atlas-login-1 ~$ srun -A admin --pty --preserve-env bash
 srun: job 527987 queued and waiting for resources
 srun: job 527987 has been allocated resources

 jake.frulla@Atlas-0029 ~$  hostname
 
 Atlas-0029.HPC.MsState.Edu

When running batch jobs, it is necessary to interact with the job queue. It is usually helpful to be able to see information about the system, the queue, the nodes, and your job. This can be accomplished a set of important commands:

SQUEUE Displays information about jobs in the scheduling queue.
SJSTAT Displays short summary of running jobs and scheduling pool data.
SHOWUSERJOBS Displays short summary of jobs by user and account, along with a summary of node state.
SHOWPARTITIONS Displays short summary and current state of the available partitions.
SSTAT Displays information about specific jobs.
SINFO Reports system status (nodes, queues, etc).
SACCT Displays accounting information from the Slurm database.

Each of these commands has a variety of functions, options, and filters that refine the information returned and displayed. Users can customize filtering, sorting, and output format using command line options or environment variables. Please consult the man page of each command or the Slurm Documentation for more information on using these commands.

The default walltime is 15 minutes. Any jobs that do not specify a walltime will be terminated 15 minutes after starting.
The default allocation is 1 node. Any jobs that do not specify the number of nodes will run on one node.
The default number of tasks is 1 core. Any jobs that do not specify the number of tasks will run on only 1 core.

When submitting jobs, all users must specify a valid account that they are associated with

To see which accounts you are on, along with valid QoS's for that account, use the following command:

 $ sacctmgr show associations where user=$USER format=account%20,qos%50

Nodesharing

Currently, the atlas partition is not set to assign users exclusive nodes by default. Users will only get the amount of cores specified per node and will leave the rest of the cores on the nodes unallocated and available for other users' jobs.

Slurm allocates all of a node's memory by default, so in order to take advantage of nodesharing, users must specify the memory required per node for their jobs using the --mem option in their runscript or srun command. Specifying a memory limit with the --mem option will ensure that user jobs are allocated the amount specified. For example, if a user's job only needs 150 GB of memory per node, the user must specify the following sbatch directive:

 $ srun -n 10 -N 2 --mem=150G ./example_program

If a user requests 10 cores and 50 GB of memory for one job, along with 10 cores and 50 GB of memory for a second job, then both of these jobs may run on the same node. The same principle would also work for jobs owned by two different users.

In order to disallow sharing the remainder of the cores while running on less than 48 cores, users must specify the "--exclusive" option in their runscripts or in their salloc/srun commands:

 $ srun -n 10 -N 1 --exclusive ./example_program

The gpu and bigmem partitions will give users exclusive nodes by default.

Job Dependencies and Pipelines

Job dependencies are used to defer the start of a job until the specified dependencies have been satisfied. They are specified with the --dependency option to sbatch or swarm in the format:

 sbatch --dependency=<type:job_id[:job_id][,type:job_id[:job_id]]> ...

Dependency types:

after:jobid[:jobid...]	job can begin after the specified jobs have started
afterany:jobid[:jobid...]	job can begin after the specified jobs have terminated
afternotok:jobid[:jobid...]	job can begin after the specified jobs have failed
afterok:jobid[:jobid...]	job can begin after the specified jobs have run to completion with an exit code of zero
singleton	jobs can begin execution after all previously launched jobs with the same name and user have ended

Job dependencies are useful in setting up job pipelines. If a particular job needs a dataset downloaded before it can run, this must be submit as two jobs: the first job downloads the dataset on the DTN nodes in the service partition and the second job operates on the dataset retrieved by the first job. To set up pipelines using job dependencies the most useful types are afterany, afterok and singleton. The simplest way is to use the afterok dependency for single consecutive jobs. For example:

 $ sbatch job1.sh
11254323

 $ sbatch --dependency=afterok:11254323 job2.sh

Now when job1 ends with an exit code of zero, job2 will become eligible for scheduling. However, if job1 fails (ends with a non-zero exit code), job2 will not be scheduled but will remain in the queue and needs to be canceled manually. As an alternative, the afterany dependency can be used and checking for successful execution of the prerequisites can be done in the jobscript itself.

Atlas Job Script Generator

Compute Node Type *
Select the type of job that you want to run:

QoS
Select the Quality of Service for this job:

Job Name
Type out a name for your job:

Output Filename
Specify the name of your output file:

Job Binary
Type out your job binary with options:

Account Name
Specify account to charge against:

Email
Specify an email address to use for job notifications:

Walltime

Hours:

Minutes:

Seconds:

Email Options
Specify when you want to receive emails:

On Start	On End	On Fail	On Requeue
Walltime 90%	Walltime 80%	Walltime 50%	Never

Nodes Needed

Minimum:

Maximum:

Processes Needed

Processes per node:

Total:

Your script will appear here...

Container Notes

Containers are a portable method for running software on separate machines in a reproducible manner. Atlas has both Apptainer for containerized packages, from the original singularity developers, available. The packages can be loaded into a user's environment with one of the following commands:

module load apptainer/1.1.9

Apptainer is configured on Atlas such that users do not have to define additional environment variables to have access to their working folders in the container. However, users wishing to utilize the "remote build" features will need to unset the APPTAINER_BIND variable.

Many containers are available on Atlas, and inquiries about accessing existing containers or adding new containers may be submitted by emailing scinet_vrsc@usda.gov

Atlas Documentation