Zehan Yang

Using R in High Performance Computing (HPC) Cluster

HPC Cluster

An HPC cluster is a type of computing system that consists of multiple interconnected computers or nodes working together to provide high-performance computing capabilities.

Component of HPC Cluster

Nodes

When to use the HPC Cluster

Using R in the HPC Cluster

Module

A module refers to a software package or application that can be loaded and unloaded dynamically, allowing users to easily switch between different versions of the same software or to use different software packages without conflicts.

Load R to the HPC Cluster

R is treated as a module in the HPC Cluster, you can check available versions of R using

[ID@login ~]$ module avail

Then, load ‘R’ and the ‘gcc’ compiler to the Cluster.

[ID@login ~]$ module load gcc/11.3.0
[ID@login ~]$ module load r/4.2.2

If you want the cluster to load R automatically every time you log in to the cluster, you may use

[ID@login ~]$ module initadd gcc/11.3.0
[ID@login ~]$ module initadd r/4.2.2

Install R Packages

Open R in the Cluster

[ID@login ~]$ R

Install needed packages

> install.packages("RequiredPackages")

Quit R

> q()

Submit R jobs

In the HPC cluster, users cannot interact with compute node directly. Thus, a submission scripts are needed to schedule jobs. A submission script can either be written on the user’s own PC and then transferred to the cluster or written directly in the login node using editors such as Vim or Nano.

An Example of Submission Script

The common job scheduler in the HPC cluster is the Slurm Workload Manager. If you have an R script named ‘my.R’, you may write the following commonds to a file named ‘simul.slurm’.

#!/bin/bash
#SBATCH --partition=general  ## Specify the partition
#SBATCH --nodes=1            ## Number of nodes
#SBATCH --cpus-per-task=1     ## Number of CPUs requested in each node
#SBATCH --time=12:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=xxxx@xxxx.com
Rscript my.R

Submit the job

Simply run the following command on the Cluster

[NetID@cn01 ~]$ sbatch simul.slurm

Commonly Used Slurm Commonds

You can read detailed documentations for sbatch.

You can read detailed documentations for scancel.

You can read detailed documentations for squeue.

You can read detailed documentations for scontrol.

You can only control a running job or job ended within five miniutes.

You can read detailed documentations for sacctmgr.

Parallel Computing

R is an interpreted language, meaning that code is executed line-by-line at runtime. This can slow down the execution of loops compared to compiled languages like C or Fortran. Moreover, R is a dynamically typed language, which means that the type of a variable can change during runtime. This can cause additional overhead when looping over large data structures, as R needs to constantly check and update variable types. Memory management in R can also contribute to slow loops, especially when dealing with large data structures that require frequent copying and allocation of memory. Thus, parallel computing is needed when the number of iteration in the loop is large. I will introduce how to use R to do parallel computing on a personal computer in my next blog.