Introduction to high-performance computing (HPC)

On this page

When dealing with computational problems, various resource and time limitations could arise difficulties and disrupt the solutions. If we have enough time and resources, we might find the answer. A supercomputer or cluster with high level of performance could help us tackle the problem. The followings are some great workshops about HPC:

Using HPC systems often involves the use of a Shell through a command line interface which is necessary for this topic (see here).

This tutorial provides a list of basic scheduling commands, submitting jobs, methods of transferring files from local computers, and installing software on clusters.

Scheduling jobs

On an HPC system, we need a scheduler to manage how jobs running on a cluster. One of the most common schedulers is SLURM. The following are some practical SLURM commands (quick start user guide):

sinfo -s # shows summary info about all patritions
sjstat -c # shows computing resources info
srun # run parallel jobs
sbatch # submit a job to the scheduler
JOB_ID=$(sbatch --parsable file.sh) # keep the JOB ID right after reading the command
sbatch --dependency=afterok:JOB_ID file.sh # submit a job file after finishing other jobs
sbatch --dependency=singleton # submit a job after ending a job with a same name 
sacct # displays accounting data for all jobs and job steps in the SLURM job accounting log
squeue -u <userid> # check on a user's job status
squeue -u <userid> --start # show estimation time to start pending jobs
scancel JOBID # cancel the job with JOBID
scancel -u <useride> # cancel all the user jobs

To see more details about these commands use <command> --help.

Let’s connect to the cluster through ssh user@server, and do some practices. For example, use nano example-job.sh to make a job file including:

#!/bin/bash
#SBATCH --nodes 1
#SBATCH --mem 16G
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --partition hpc0
#SBATCH --account general
#SBATCH --time 02-05:00
#SBATCH --job-name NewJobName
#SBATCH --mail-user your@email.com
#SBATCH --mail-type END

echo 'This script is running on:'
hostname
sleep 120

Special characters #! (shebang) at the beginning of scripts specifies what program should be used (i.e. /bin/bash or /usr/bin/python3). SLURM uses #SBATCH special comment to denote special scheduler-specific options. To see more options, use sbatch --help. For example, the above file uses 1 nodes, 16 gigabytes of memory, 1 taks and 4 CPUs per task, and using partition hpc0 with a general account for 2 days and 5 hours of walltime, gives a new name to the job, and email you when the job is ended. Now we can submit the job file by sbatch example-job.sh. We can use squeue -u USER or sacct to check the job file status, and use scancel JOBID to cancel the job. You may find more sbatch options here.

To run a single command, we can use srun. For instance, srun -c 2 echo "This job will use 2 CPUs." submits a job and allocates 2 CPUs. Also, we can use srun to open a program in an interaction mode. For example, srun --pty bash will open a Bash shell in a computation node (not specified).

Note: in general, when we connect to a cluster we will go to a node, called login node, which is not meant to do heavy computational tasks. So, to do our computations in a proper way, we should always use either sbatch or srun.

Usually there are many modules available on the clusters. To find and load these modules use:

module avail # shows all avaliable madules (programs) in the cluster
module load <name> # to load a module ex. module load R or python
module list # shows list of the loaded modules
module unload <name> # to unload a module
module purge # to unload all modules

To create a simple template sbatch job file, use the following steps:

generate any files including all codes that we want to run in the cluster (that could be several Python or R or other scripts)
generate a Bash file including all modules that are required for the sending job (environment.sh)
generate a Bash file to call steps 1 and 2 including all #SBATCH options (job_file.sh)
use sbatchto run the file in step 3

For example, let’s run the following Python code called test.py:

#!/usr/bin/python3

print("Hello world")

Then use nano environment.sh to create the environment file including:

#!/bin/bash

module load miniconda3

Then use nano job-test.sh to make the job file by:

#!/bin/bash

#SBATCH --mem 1G
#SBATCH --job-name Test1

echo === $(date)
echo $SLURM_JOB_ID

source ./environment.sh
module list

srun python3 ./test.py

echo === $(date) $(hostname)

Now we can use sbach job-test.sh to run this job.

If there are some dependencies between jobs, slurm can defer the start of a job until the specified dependencies have been satisfied completed. For instance, let’s create another job called job-test-2.sh:

#!/bin/bash

#SBATCH --mem 1G
#SBATCH --job-name Test2

echo === $(date)
echo $SLURM_JOB_ID
echo === This is a new job
echo === $(date) $(hostname)

We need another job, called job-test-3.sh, to run both job-test.sh and job-test-2.sh:

#!/bin/bash

#SBATCH --mem 1G
#SBATCH --job-name Dependency

echo === $(date)
JID=$(sbatch --parsable job-test.sh)
echo $JID

sbatch --dependency=afterok:$JID job-test-2.sh
echo === $(date) $(hostname)

Where JID is the job ID for sbatch job-test.sh which is a dependency for job-test-2.sh. Now, by running sbatch job-test-3.sh we will make sure that job-test-2.sh will run after that job-test.sh is completed successfully.

Note that there are some other tools, such as Snakemake, that could be used for workflow management.

Transferring files

Secure copy (scp)

We can use secure copy or scp to transfer files from a local computer to a cluster and vice versa. For example, let’s transfer code_example.py from temp/ directory on the remote.edu cluster to Documents/ directory in your local computer when we are using the local computer. For this we can use:

cd ~/Documents
scp user@remote.edu:/temp/code_example.py .

The . at the end of the code says, paste with the same name of the source file. To do the reverse task with:

cd ~/Documents
scp code_example.py user@remote.edu:/temp/.

To recursively copy a directory (with all files in the directory), we just need to add the -r (recursive) flag. For example to download temp folder use:

cd ~/Documents
scp -r user@remote.edu:/temp .

Rsync

Rsync is a fast, versatile, remote (and local) file-copying tool. Rsync has two great features, first it always syncs your data (i.e. only transfer files that changed since last transfer) and second, the compress option that makes transferring large files easier. To use rsync, follow:

# From local to remote
rsync local-directory user@remote.edu:remote-directory

# From remote to local
rsync user@remote.edu:remote-directory local-directory

That recursively transfer all files from the directory local-directory on the local machine into the remote-directory on the remote machine. Some important options for rsync are (use rsync -help to see all options):

-r, --recursive: recurse into directories
-v, --verbose: increase verbosity
-h, --human-readable: human-readable format
-z, --compress: compress file data during the transfer
-P, --partial --progress: to keep partially transferred files which should make a subsequent transfer of the rest of the file much faster

For example:

rsync -rPz ./home/myfiles user@remote.edu:./myproject

Will transfer files in “partial” mode from ./home/myfiles/ in the local machine to the remote ./myproject directory. Additionally, compression will be used to reduce the size of data portions of the transfer.

SSH file transfer protocol (sftp)

The other method to transfer data between the local machine and a cluster is SSH file transfer protocol or sftp. The great advantage of this way is tab completion in both local and remote that makes finding origins and destinations much easier. We can connect to a cluster through sftp very similar to ssh by running sftp username@server.

We can also use most of the Bash commands when using sftp and at the same time access to both cluster and local computer. Usually we can apply the bash commands in the local system by adding l to the beginning of the commands. For example:

pwd # print working directory on the cluster
lpwd # print working directtory on the local computer
cd # chnage directory on the cluster
lcd # change directtory on the local computer

Use put to upload and get to download a file.

Wget

To get data from web, also, we can use wget command to download files on the cluster.

Software installation

When we login to a cluster, as a user we only have permission to change user level files (home directory cd ~ and higher). So, in this case, we never be able to install/update software that are located in the root directory (cd /). Note that we can find the location of software by module show <software-name> command.

As a cluster user, we have several ways to build our own system and install and update our required software:

Python: If we only need several Python packages probably the easiest way is making a virtual environment by venv module in Python3. After that we will be able to use pip package manager to install packages.
Miniconda: It let us install many software including Python, R and their packages. We can try module load anaconda3 to load the module and then use conda to create a virtual environment and install software and packages. Note that if the cluster does not include miniconda3, then you may use the third option to install it first. Review Virtual environments in Python to learn more.
Spack: it gives more variety of software and packages to install (see here). To use Spack, we need to install it on a local directory (which is cd ~ and above) and then use spack to install and load packages. Note that this way might take more time to install Spack and required modules, so, first make sure the second option could not install your requirements. Review Install software with Spack to learn more.
Manually: Still there are many software that are not available through Conda or Spack. We should follow the software instruction to install them. Make sure to review README or INSTALL file (if exist) and check configure options, ./configure --help, in the installation directory. Since, we are not using root directory, make sure to use the right directory that all the dependencies already installed (i.e ./configure --prefix=${PWD}). This method could be the hardest way, so, first make sure Conda and Spack could not help you. Note that software names might be slightly different in Conda or Spack, so have a look on all names that are close.