Slurm Job Array
When conducting a simulation study, it is often necessary to run the same
code with various inputs numerous times. To handle this task efficiently,
slurm job arrays can be utilized to swiftly submit and manage a large quantity
of comparable jobs.
Example: Bootstrap Replicates of a Linear Model
To obtain the R square coefficients for 10,000 bootstrap replicates of a
linear model, we aim to run the following function 10,000 times on an HPC
cluster with distinct random seeds.
myfunc <- function(seed) {
assign(".Random.seed", seed, envir = .GlobalEnv)
b_df <- mtcars[sample(nrow(mtcars), rep = TRUE), ]
mdl <- lm(mpg ~ wt + disp, data = b_df)
return(summary(mdl)$r.square)
}
To achieve this, we will partition the 10,000 replicates into 250 job arrays, each
consisting of 40 replicates. As the replicates all run the same code with
different random seeds, we can utilize the slurm array to carry out this
procedure. To use the slurm array, we will need to add the following line to the
batch script prior to submission.
We can utilize the sys.getenv('SLURM_ARRAY_TASK_ID')
function to specify which input
arguments should be partitioned into separate job arrays. This is necessary because the
command only allows us to submit specific arguments (i.e., random seeds) to the HPC
cluster. For example, if SLURM_ARRAY_TASK_ID = k, we must provide the 40k
-th argument up
to the 40(k+1)-1
-th argument as input for the k-th job array. The R script submitted to
the cluster is provided below.
library(parallel)
# Write the function to excecute
myfunc <- function(seed) {
assign(".Random.seed", seed, envir = .GlobalEnv)
b_df <- mtcars[sample(nrow(mtcars), rep = TRUE), ]
mdl <- lm(mpg ~ wt + disp, data = b_df)
return(summary(mdl)$r.square)
}
# Generate the random seeds
RNGkind("L'Ecuyer-CMRG")
njobs <- 10000 ## Total number of jobs
set.seed(2002)
seeds <- list(.Random.seed)
for (i in seq(2, njobs, 1)) {
seeds[[i]] <- parallel::nextRNGStream(seeds[[i - 1]])
}
# Obtain the task id
id <- sys.getenv('SLURM_ARRAY_TASK_ID')
# Derive the results
nary <- 250 ## Number of job arrays
jpc <- njobs / nary ## Number of jobs per job array
trd <- 2 ## Number of threads requested in each CPU
out <- mclapply(seeds[(id * jpc):((id + 1) * jpc - 1)],
myfunc, mc.cores = trd,
mc.set.seed = FALSE)
saveRDS(out, file = paste0('results_', id, '.RDS'))
The R
script shown above can be saved as my.R
. To submit the job to the
HPC cluster, we can create a bash script as follows:
#!/bin/bash
#SBATCH --array=0-249
#SBATCH --output=slurm_%a.out
#SBATCH --partition=general
#SBATCH --nodes=1
#SBATCH --cpus-per-task=1
#SBATCH --time=12:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=xxxx@xxxx.com
#SBATCH --mem-per-cpu=4G
Rscript my.R
It’s worth noting that in the bash script provided above, the number of nodes
is set to one and the number of CPUs per task is also set to one. This is because
running independent job arrays like the previous example does not require
additional computing resources.
Assuming that the jobid
is 1000
, we can cancel some job arrays by
running the following command in the terminal:
[NetID@cn01 ~]$ scancel 1000_[array_id1,array_id2,...]
where array_id1
, array_id2
, etc. refer to the specific array IDs that need
to be canceled. If we want to cancel all job arrays, they just need to run
[NetID@cn01 ~]$ scancel 1000
The rslurm
Package
Introduction
The rslurm package is an R
package that provides an
interface for submitting and managing batch jobs on
HPC clusters that use the Slurm workload manager. It allows users to
submit Slurm
batch scripts from within R
, and to interact with job
arrays, task arrays, and parallel jobs. The package
includes functions for submitting jobs, checking job
status, killing jobs, and retrieving output files. It
also provides functions for setting job parameters such
as the number of nodes, CPUs, and memory required for a
job, and for specifying dependencies between jobs.
Overall, rslurm
aims to simplify the process of
submitting and managing batch jobs on HPC clusters using
Slurm
.
The Previous Example
If we use the rslurm
package to run the previous example, we do not need
to create batch scripts for running job arrays ourselves.
Instead, we can use the slurm_map()
function to let the computer write
the bash scripts for us. Here is
an example of how we could use the rslurm
package to run the myfunc()
function in parallel across 10,000 bootstrap replicates, with 250 job
arrays each containing 40 replicates:
library(rslurm)
# Write the function to excecute
myfunc <- function(seed) {
assign(".Random.seed", seed, envir = .GlobalEnv)
b_df <- mtcars[sample(nrow(mtcars), rep = TRUE), ]
mdl <- lm(mpg ~ wt + disp, data = b_df)
return(summary(mdl)$r.square)
}
# Generate the random seeds
RNGkind("L'Ecuyer-CMRG")
njobs <- 10000 ## Total number of jobs
set.seed(2002)
seeds <- list(.Random.seed)
for (i in seq(2, njobs, 1)) {
seeds[[i]] <- parallel::nextRNGStream(seeds[[i - 1]])
}
# Derive the results
out <- slurm_map(seeds, myfunc, nodes = 250,
cpus_per_node = 1,
slurm_options =
list(partition = "general",
time = "12:00:00",
`mail-type` = "END",
`mail-user` = "xxxx@xxxx.com",
`mem-per-cpu` = "4G"),
submit = TRUE)
save(out, file = "result.RData")
The following bash script can be used to execute the previous code on an HPC
cluster, and it can save a significant amount of time when multiple job arrays need
to be submitted simultaneously.
#!/bin/bash
#SBATCH --partition=general
#SBATCH --nodes=1
#SBATCH --time=00:10:00
#SBATCH --mail-type=END
#SBATCH --mail-user=xxxx@xxxx.com
#SBATCH --mem-per-cpu=4G
Rscript my.R