When dealing with computational problems, various resource and time limitations could arise difficulties and disrupt the solutions. If we have enough time and resources, we might find the answer. A supercomputer or cluster with high level of performance could help us tackle the problem. The followings are some great workshops about HPC:
Using HPC systems often involves the use of a Shell through a command line interface which is necessary for this topic (see here).
This tutorial provides a list of basic scheduling commands, submitting jobs, methods of transferring files from local computers, and installing software on clusters.
Related document:
On an HPC system, we need a scheduler to manage how jobs running on a cluster. One of the most common schedulers is SLURM. The following are some practical SLURM commands (quick start user guide):
sinfo -s # shows summary info about all patritions
sjstat -c # shows computing resources info
srun # run parallel jobs
sbatch # submit a job to the scheduler
JOB_ID=$(sbatch --parsable file.sh) # keep the JOB ID right after reading the command
sbatch --dependency=afterok:JOB_ID file.sh # submit a job file after finishing other jobs
sbatch --dependency=singleton # submit a job after ending a job with a same name
sacct # displays accounting data for all jobs and job steps in the SLURM job accounting log
squeue -u <userid> # check on a user's job status
squeue -u <userid> --start # show estimation time to start pending jobs
scancel JOBID # cancel the job with JOBID
scancel -u <useride> # cancel all the user jobs
To see more details about these commands use
<command> --help
.
Let’s connect to the cluster through ssh user@server
,
and do some practices. For example, use nano example-job.sh
to make a job file including:
#!/bin/bash
#SBATCH --nodes 1
#SBATCH --mem 16G
#SBATCH --ntasks 1
#SBATCH --cpus-per-task 4
#SBATCH --partition hpc0
#SBATCH --account general
#SBATCH --time 02-05:00
#SBATCH --job-name NewJobName
#SBATCH --mail-user your@email.com
#SBATCH --mail-type END
echo 'This script is running on:'
hostname
sleep 120
Special characters #!
(shebang) at the beginning of
scripts specifies what program should be used
(i.e. /bin/bash
or /usr/bin/python3
). SLURM
uses #SBATCH
special comment to denote special
scheduler-specific options. To see more options, use
sbatch --help
. For example, the above file uses 1 nodes, 16
gigabytes of memory, 1 taks and 4 CPUs per task, and using partition
hpc0
with a general
account for 2 days and 5
hours of walltime, gives a new name to the job, and email you when the
job is ended. Now we can submit the job file by
sbatch example-job.sh
. We can use
squeue -u USER
or sacct
to check the job file
status, and use scancel JOBID
to cancel the job. You may
find more sbatch options here.
To run a single command, we can use srun
. For instance,
srun -c 2 echo "This job will use 2 CPUs."
submits a job
and allocates 2 CPUs. Also, we can use srun
to open a
program in an interaction mode. For example,
srun --pty bash
will open a Bash shell in a computation
node (not specified).
Note: in general, when we connect to a cluster we
will go to a node, called login node, which is not
meant to do heavy computational tasks. So, to do our computations in a
proper way, we should always use either
sbatch
or
srun
.
Usually there are many modules available on the clusters. To find and load these modules use:
module avail # shows all avaliable madules (programs) in the cluster
module load <name> # to load a module ex. module load R or python
module list # shows list of the loaded modules
module unload <name> # to unload a module
module purge # to unload all modules
To create a simple template sbatch
job file, use the
following steps:
environment.sh
)#SBATCH
options (job_file.sh
)sbatch
to run the file in step 3For example, let’s run the following Python code called
test.py
:
#!/usr/bin/python3
print("Hello world")
Then use nano environment.sh
to create the environment
file including:
#!/bin/bash
module load miniconda3
Then use nano job-test.sh
to make the job file by:
#!/bin/bash
#SBATCH --mem 1G
#SBATCH --job-name Test1
echo === $(date)
echo $SLURM_JOB_ID
source ./environment.sh
module list
srun python3 ./test.py
echo === $(date) $(hostname)
Now we can use sbach job-test.sh
to run this job.
If there are some dependencies between jobs, slurm
can
defer the start of a job until the specified dependencies have been
satisfied completed. For instance, let’s create another job called
job-test-2.sh
:
#!/bin/bash
#SBATCH --mem 1G
#SBATCH --job-name Test2
echo === $(date)
echo $SLURM_JOB_ID
echo === This is a new job
echo === $(date) $(hostname)
We need another job, called job-test-3.sh
, to run both
job-test.sh
and job-test-2.sh
:
#!/bin/bash
#SBATCH --mem 1G
#SBATCH --job-name Dependency
echo === $(date)
JID=$(sbatch --parsable job-test.sh)
echo $JID
sbatch --dependency=afterok:$JID job-test-2.sh
echo === $(date) $(hostname)
Where JID
is the job ID for
sbatch job-test.sh
which is a dependency for
job-test-2.sh
. Now, by running
sbatch job-test-3.sh
we will make sure that
job-test-2.sh
will run after that job-test.sh
is completed successfully.
Note that there are some other tools, such as Snakemake, that could be used for workflow management.
We can use secure copy or scp
to transfer files from a
local computer to a cluster and vice versa. For example, let’s transfer
code_example.py
from temp/
directory on the
remote.edu
cluster to Documents/
directory in
your local computer when we are using the local computer. For this we
can use:
cd ~/Documents
scp user@remote.edu:/temp/code_example.py .
The .
at the end of the code says, paste with the same
name of the source file. To do the reverse task with:
cd ~/Documents
scp code_example.py user@remote.edu:/temp/.
To recursively copy a directory (with all files in the directory), we just need to add the -r (recursive) flag. For example to download temp folder use:
cd ~/Documents
scp -r user@remote.edu:/temp .
Rsync is a fast, versatile, remote (and local) file-copying tool.
Rsync has two great features, first it always syncs your data (i.e. only
transfer files that changed since last transfer) and second, the
compress option that makes transferring large files easier. To use
rsync
, follow:
# From local to remote
rsync local-directory user@remote.edu:remote-directory
# From remote to local
rsync user@remote.edu:remote-directory local-directory
That recursively transfer all files from the directory
local-directory
on the local machine into the
remote-directory
on the remote machine. Some important
options for rsync
are (use rsync -help
to see
all options):
-r, --recursive
: recurse into directories-v, --verbose
: increase verbosity-h, --human-readable
: human-readable format-z, --compress
: compress file data during the
transfer-P, --partial --progress
: to keep partially transferred
files which should make a subsequent transfer of the rest of the file
much fasterFor example:
rsync -rPz ./home/myfiles user@remote.edu:./myproject
Will transfer files in “partial” mode from
./home/myfiles/
in the local machine to the remote
./myproject
directory. Additionally, compression will be
used to reduce the size of data portions of the transfer.
The other method to transfer data between the local machine and a
cluster is SSH file transfer protocol or sftp
. The great
advantage of this way is tab completion in both local and remote that
makes finding origins and destinations much easier. We can connect to a
cluster through sftp
very similar to ssh
by
running sftp username@server
.
We can also use most of the Bash commands when using
sftp
and at the same time access to both cluster and local
computer. Usually we can apply the bash commands in the local system by
adding l
to the beginning of the commands. For example:
pwd # print working directory on the cluster
lpwd # print working directtory on the local computer
cd # chnage directory on the cluster
lcd # change directtory on the local computer
Use put
to upload and get
to download a
file.
To get data from web, also, we can use wget
command to
download files on the cluster.
When we login to a cluster, as a user we only have permission to
change user level files (home directory cd ~
and higher).
So, in this case, we never be able to install/update software that are
located in the root directory (cd /
). Note that we can find
the location of software by
module show <software-name>
command.
As a cluster user, we have several ways to build our own system and install and update our required software:
Python: If we only need several Python packages
probably the easiest way is making a virtual environment by
venv
module in Python3. After that we will be able to use
pip
package manager to install packages.
Miniconda: It let us install many software
including Python, R and their packages. We can try
module load anaconda3
to load the module and then use
conda
to create a virtual environment and install software
and packages. Note that if the cluster does not include
miniconda3
, then you may use the third option to install it
first. Review Virtual environments in
Python to learn more.
Spack: it gives more variety of software and
packages to install (see here).
To use Spack, we need to install it on a local directory (which is
cd ~
and above) and then use spack
to install
and load packages. Note that this way might take more time to install
Spack and required modules, so, first make sure the second option could
not install your requirements. Review Install
software with Spack to learn more.
Manually: Still there are many software that are
not available through Conda or Spack. We should follow the software
instruction to install them. Make sure to review README or INSTALL file
(if exist) and check configure options, ./configure --help
,
in the installation directory. Since, we are not using root directory,
make sure to use the right directory that all the dependencies already
installed (i.e ./configure --prefix=${PWD}
). This method
could be the hardest way, so, first make sure Conda and Spack could not
help you. Note that software names might be slightly different in Conda
or Spack, so have a look on all names that are close.