Using R in High Performance Computing (HPC) Cluster

December 09, 2022

HPC Cluster

An HPC cluster is a type of computing system that consists of multiple interconnected computers or nodes working together to provide high-performance computing capabilities.

Component of HPC Cluster

Cluster includes servers, networks and storage.
Partitions: Collection of nodes
Node (server): Contains multiple processors.
Processor: Contains multiple cores. Eg: CPU and GPU.
Core: Receives instructions and performs calculations.

Nodes

Login nodes: The secure entry point to the cluster, where users can submit jobs, manage files, and interact with the cluster.
Compute nodes: Perform the actual computation tasks requested by users.
Storage nodes: High-capacity, high-performance storage for user data and applications that are connected to the compute nodes via a high-speed network.
Data transfer nodes: Dedicated to high-speed data transfer between the cluster and external networks or storage systems.

When to use the HPC Cluster

Computation needs much more memory than the memory limit of personal computers.
The same program needs to be run multiple times.
The program that you use takes too long to run, but it can be run faster in parallel.

Using R in the HPC Cluster

Module

A module refers to a software package or application that can be loaded and unloaded dynamically, allowing users to easily switch between different versions of the same software or to use different software packages without conflicts.

Load R to the HPC Cluster

R is treated as a module in the HPC Cluster, you can check available versions of R using

[ID@login ~]$ module avail

Then, load ‘R’ and the ‘gcc’ compiler to the Cluster.

[ID@login ~]$ module load gcc/11.3.0
[ID@login ~]$ module load r/4.2.2

If you want the cluster to load R automatically every time you log in to the cluster, you may use

[ID@login ~]$ module initadd gcc/11.3.0
[ID@login ~]$ module initadd r/4.2.2

Install R Packages

Open R in the Cluster

[ID@login ~]$ R

Install needed packages

> install.packages("RequiredPackages")

Quit R

> q()

Submit R jobs

In the HPC cluster, users cannot interact with compute node directly. Thus, a submission scripts are needed to schedule jobs. A submission script can either be written on the user’s own PC and then transferred to the cluster or written directly in the login node using editors such as Vim or Nano.

An Example of Submission Script

The common job scheduler in the HPC cluster is the Slurm Workload Manager. If you have an R script named ‘my.R’, you may write the following commonds to a file named ‘simul.slurm’.

#!/bin/bash
#SBATCH --partition=general  ## Specify the partition
#SBATCH --nodes=1            ## Number of nodes
#SBATCH --cpus-per-task=1     ## Number of CPUs requested in each node
#SBATCH --time=12:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=xxxx@xxxx.com
Rscript my.R

Submit the job

Simply run the following command on the Cluster

[NetID@cn01 ~]$ sbatch simul.slurm

Commonly Used Slurm Commonds

Submit jobs
- Jobs should be submitted via bash scripts.
```
  [NetID@cn01 ~]$ sbatch <script_name>.slurm
```

You can read detailed documentations for sbatch.

Cancel jobs

Cancel one job

  [NetID@cn01 ~]$ scancel <job_id>

Cancel multiple jobs start with same numbers

  [NetID@cn01 ~]$ squeue -u <user_id> | grep <common_names> | awk '{sprint $1}' | xrgs -n 1 scancel

You can read detailed documentations for scancel.

Check current job queue
- Check all jobs by all users
```
  [NetID@cn01 ~]$ squeue
```
- Check all jobs by a specific user
```
  [NetID@cn01 ~]$ squeue -u <user_id>
```

You can read detailed documentations for squeue.

View or modify Slurm configuration and state.
- Hold a job: This commond will kill the running job and force it hold.
```
  [NetID@cn01 ~]$ scontrol hold <job_id>
```
- Requeue a job: This commond will requeue a running job but won’t requeue a completed job.
```
  [NetID@cn01 ~]$ scontrol requeue <job_id>
```
- Release a hold job.
```
  [NetID@cn01 ~]$ scontrol release <job_id>
```
- Check imformation about a partition.
```
  [NetID@cn01 ~]$ scontrol show partition <partition_name>
```

You can read detailed documentations for scontrol.

You can only control a running job or job ended within five miniutes.

View and modify Slurm account information.

Check the QOS about a partition

  [NetID@cn01 ~]$ sacctmgr show qos where name=<partition_name> format name,maxJob,maxSubmit

You can read detailed documentations for sacctmgr.

Check jobs’ history
- Check the details of your ended jobs
```
  [NetID@cn01 ~]$ shist
```

Parallel Computing

R is an interpreted language, meaning that code is executed line-by-line at runtime. This can slow down the execution of loops compared to compiled languages like C or Fortran. Moreover, R is a dynamically typed language, which means that the type of a variable can change during runtime. This can cause additional overhead when looping over large data structures, as R needs to constantly check and update variable types. Memory management in R can also contribute to slow loops, especially when dealing with large data structures that require frequent copying and allocation of memory. Thus, parallel computing is needed when the number of iteration in the loop is large. I will introduce how to use R to do parallel computing on a personal computer in my next blog.