Skip to:

Atlas

Atlas is a Cray CS500 Linux Cluster with 11,520 2.40GHz Xeon Platinum 8260 processor cores, 101 terabytes of RAM, 8 NVIDIA V100 GPUs, and a Mellanox HDR100 InfiniBand interconnect. Atlas has a peak performance of 565 TeraFLOPS.

Node Specifications

Atlas is composed of 240 nodes, two login nodes, and two data transfer nodes. All nodes contain two 2.40GHz Xeon Platinum 8260 2nd Generation Scalable Processors. The processors contain 24 cores each, for a total of 48 cores per node. All Atlas Memory is DDR4-2933 2R RDIMM. The 228 standard Atlas compute nodes plus the login nodes have a total of 384GB RAM in each node. The 2 Data Transfer nodes have a total of 192GB RAM in each node. The 8 Big Mem nodes have a total of 1536GB RAM in each node. The 4 GPU nodes have 384GB RAM and 2 NVIDIA V100 GPUs in each node.

Node Name(s) Node Type Memory GPUs
atlas-login Login Nodes 384 GB N/A
atlas-dtn Data Transfer Nodes 192 GB N/A
atlas-[0001-0228] Compute Nodes 384 GB N/A
atlas-[0229-0236] Compute Nodes (Big Mem) 1536 GB N/A
atlas-[0237-0240] Compute Nodes (GPU) 384 GB NVIDIA V100 x 2

Accessing Atlas

Currently, only 4 machines from the Atlas cluster can be accessed from outside of the Mississippi State HPC2 Network. The two login nodes and two DTN nodes can be accessed by connecting via ssh:

 ssh @Atlas-login.hpc.msstate.edu 
 ssh @Atlas-dtn.hpc.msstate.edu

File Transfers

Globus Online can be used to transfer data to and from the Atlas cluster. The two DTN nodes on Atlas can also be accessed from outside of the Mississippi State HPC2 network via a single globus endpoint:

 msuhpc2#Atlas-dtn

For small amounts of data that need to be transferred to user home directories, the scp command can be used. This command copies files between hosts on a network. It uses ssh for data transfer, and uses the same authentication and provides the same security as ssh. SCP will ask for passwords as well as two-factor authentication codes.

To copy a file from a remote host to local host:

 $ scp  <username>@<remotehost>:/path/to/file.txt  /local/directory

To copy a file from a local host to a remote host:

 $ scp  /path/to/file.txt  <username>@<remotehost>:/remote/directory

To copy a directory from a remote host to local host:

 $ scp  -r  <username>@<remotehost>:/remote/directory  /local/directory

To copy a directory from a local host to a remote host:

 $ scp  -r  /local/directory  <username>@<remotehost>:/remote/directory

Modules

Atlas uses LMOD as an environment module system. For a guide on how to use LMOD to set up the programming environment, please refer to the official LMOD User Guide.

Atlas uses a heirarchy based on the Compilers and MPI implementations. Software in the Core tree is built using the default system compilers. Software built against a specific compiler is available only after that compiler module has been loaded. Software built against a specific MPI implementation is available only after that MPI module has been loaded. Information on available modules can be found with the "module avail" and "module spider" commands:

 $ module spider mesa

 ---------------------------------------------
   mesa: mesa/20.1.6
 ---------------------------------------------
    This module can be loaded directly: module load mesa/20.1.6

    Additional variants of this module can also be loaded after loading the following modules:
      gcc/10.2.0

Quota

Each user has a home folder with a 5GB quota. Once the data exceeds 5GB, the quota will be enforced and users will not be able to create any more files until they have cleared enough disk space to be back under the 5GB quota. The command to see quota usage (in human readable format) is:

 $ quota -s

Project Space

On Atlas, the parallel filesystem is mounted as /project. All projects have their own directory located at /project/<projectname>/. Unlike users' home directories, /project does not have a quota on the amount of data that can be stored, but this filesystem is considered a 'temporary' or 'scratch' filesystem.

DATA IN THIS LOCATION IS NOT BACKED UP AND CANNOT BE RESTORED IF DELETED!

It is important that users submit and run jobs from their respective /project directories instead of their home directories. The /home filesystem is not designed or configured for high performance use, nor does it have much space on it. Home directories will run out of space quickly on parallel jobs and will cause jobs to fail. After useful data is generated from supercompute jobs, it is recommended that users transfer this data to a more long-term storage location

/project/reference/data is the location for reference datasets. It is sync'd nightly from ceres.scinet.usda.gov:/reference/data

Local Disk Storage

Atlas compute nodes provide a limited amount of local disk space at /local/scratch that may be used for temporary storage. However, data stored in these locations is not backed up, and is susceptible to data loss due to disk failure or corruption.

Arbiter

On each login node, we have a utility named Arbiter which regulates activity by monitoring and limiting resource consumption via cgroups. Users are limited to using 10 cpu cores and 50GB of memory at time while their status is "normal". When a user uses more than half of their cap for 10 minutes, they are sent a warning email and penalized by having their usage caps reduced. Each violation of this usage policy results in an occurance which raises the penalty level. A user's occurance level will drop by 1 for every 3 hours that they go without triggering another usage violation. The following table outlines the penalty/status levels that we currently have defined.

Status CPU Cap Memory Cap Penalty Timeout
Normal 10 Cores 50 GB N/A
Penalty1 8 Cores 40 GB 30 Minutes
Penalty2 5 Cores 25 GB 1 Hour
Penalty3 3 Cores 15 GB 3 Hours

Certain programs, such as compilers and build utilities are whitelisted. These whitelisted programs will not cause the user to be penalized. Login nodes can still be freely used to build and test software. The purpose of Arbiter is to identify and limit computationally intensive jobs which should be run on the compute nodes instead of the login and development nodes.

Slurm

Atlas uses the Slurm Workload Manager as a scheduler and resource manager. For a guide on how to use the Slurm system to submit and run jobs on this cluster, please refer to the official Slurm Quickstart Guide.

The salloc command is configured to have the default functionality on Atlas. The salloc command allocates compute nodes and spawns a shell on the login node with various Slurm environment variables set. Job steps can be launched from the salloc shell with the srun command.

Example salloc usage:

 jake.frulla@Atlas-login-1 ~$ salloc -A admin
 salloc: Pending job allocation 527990
 salloc: job 527990 queued and waiting for resources
 salloc: Granted job allocation 527990
 salloc: Waiting for resource configuration
 salloc: Nodes Atlas-0029 are ready for job

 jake.frulla@Atlas-login-1 ~$ srun hostname
 
 Atlas-0029.HPC.MsState.Edu

The srun command can be used to launch an interactive shell on an allocated node or set of nodes. Simply specify the --pty option while launching a shell (such as bash) with srun. It is also recommended to set the wallclock limit along with the number of nodes and processors needed for the interactive shell.

Example interactive shell:

 jake.frulla@Atlas-login-1 ~$ srun -A admin --pty --preserve-env bash
 srun: job 527987 queued and waiting for resources
 srun: job 527987 has been allocated resources

 jake.frulla@Atlas-0029 ~$  hostname
 
 Atlas-0029.HPC.MsState.Edu

When running batch jobs, it is necessary to interact with the job queue. It is usually helpful to be able to see information about the system, the queue, the nodes, and your job. This can be accomplished a set of important commands:

SQUEUE Displays information about jobs in the scheduling queue.
SJSTAT Displays short summary of running jobs and scheduling pool data.
SHOWUSERJOBS Displays short summary of jobs by user and account, along with a summary of node state.
SHOWPARTITIONS Displays short summary and current state of the available partitions.
SSTAT Displays information about specific jobs.
SINFO Reports system status (nodes, queues, etc).
SACCT Displays accounting information from the Slurm database.

Each of these commands has a variety of functions, options, and filters that refine the information returned and displayed. You can customize filtering, sorting, and output format using command line options or environment variables. Below, the most common and useful examples of each command are given, but many more options exist for each of them. Please consult the man page of each command or the Slurm Documentation for more information on them.

Available Atlas QOS's

QOS Priority Max Nodes
(Per QOS)
Max Walltime Notes
normal 20 30 14 Days Default QOS
debug 30 6 30 Minutes Can run a max of 2 jobs per user,
3 nodes per job
special 20 N/A N/A Must be requested and approved
priority 100 N/A N/A Must be requested and approved

Available Atlas Partitions

Partition Available Nodes Memory Notes
atlas* 228 384 GB N/A
bigmem 8 1536 GB N/A
gpu 4 384 GB 2 x NVIDIA V100 GPUs per node
service 2 192 GB Partition consists of DTN nodes
For submitting Data Transfer jobs
* Default Partition