Zehan Yang

Parallel Computing in R

Loops in Computer Languages

Compiled Language (C/Fortran)

Interpreted Language (R/Python)

Loops in compiled languages are generally faster than loops in interpreted languages, since the machine code for the loop has already been generated and optimized for the specific processor architecture.

Parallel Computing

Parallel computing is a type of computing architecture in which multiple processors or computing resources work together to solve a problem or execute a program. In parallel computing, a task is divided into smaller sub-tasks, which can be executed simultaneously by multiple processors or computing resources.

Parallel Computing in R

Parallel computing in R can be achieved using several techniques, including the parallel package, the foreach package, and the multicore package.

The parallel Package

This package provides support for parallel computing using multiple processes. The parallel package comes pre-installed with R, so there is no need to install it separately. Here’s an example of how to use the parSapply() function from the parallel package to parallelize the application of a function to a list of inputs.

To begin with, you need to create a cluster on your own computer by utilizing the makeCluster() function, and the number of nodes to be specified in the function should correspond to the number of cores available on your CPU.

library(parallel)
cl <- makeCluster(n) # Create the cluster with n cores

After creating the cluster, you can execute your function myfunc() with various inputs mylist using the parSapply() function. The inputs will be split across the n cores, and all cores will execute their assigned jobs concurrently

results <- parSapply(cl, mylist, myfunc())
stopCluster(cl) # Stop the cluster

Assign Random Seeds

For reproducibility, each job in parallel computing should be assigned a unique random seed. In R, the set.seed() function can be used to set a random seed for a single experiment. However, when running 1000 parallel jobs, assigning sequential values such as 1 to 1000 as random seeds is not truly random and may not produce the desired level of reproducibility.

An often employed method involves generating a new random seed by utilizing another random seed. To produce the initial seed, the following code snippet can be utilized:

RNGkind("L'Ecuyer-CMRG")
njobs <- 1000
set.seed(2002)
seeds <- list(.Random.seed) # Set up an initial seed

Subsequently, generate the next seed by building upon the previous one by the nextRNGStream() function in the parallel package.

for (i in seq(2, njobs, 1)) {
  # Generate the following seed based on the previous one
  seeds[[i]] <- parallel::nextRNGStream(seeds[[i - 1]])
}

It is important to note that for each task, the random seed produced using this approach must be allocated to the global environment to have an impact. As an illustration, if the initial random seed needs to be established for the first task, the following code must be included:

assign(".Random.seed", seeds[[1]], envir = .GlobalEnv)

Example

The subsequent code generates 10,000 bootstrap replicates of a linear model and returns the R square coefficients.

library(parallel)
library(microbenchmark)
# Write the function
myfunc <- function(seed) {
  assign(".Random.seed", seed, envir = .GlobalEnv)
  b_df <- mtcars[sample(nrow(mtcars), rep = TRUE), ]
  mdl <- lm(mpg ~ wt + disp, data = b_df)
  return(summary(mdl)$r.square)
}
# Generate the random seeds
RNGkind("L'Ecuyer-CMRG")
njobs <- 10000
set.seed(2002)
seeds <- list(.Random.seed)
for (i in seq(2, njobs, 1)) {
  seeds[[i]] <- parallel::nextRNGStream(seeds[[i - 1]])
}

wrap <- function(ncore) {
  # Execute the code on the cluster
  cl <- makeCluster(ncore)
  results <- parLapply(cl, seeds, myfunc)
  stopCluster(cl)
  return(results)
}

# Compare the computing between parLapply, lapply and mcLapply
microbenchmark(lapply(seeds, myfunc), wrap(4),
               mclapply(seeds, myfunc, mc.cores = 4,
               mc.set.seed = FALSE), times = 10)

The parallelized versions of lapply() are parLapply() and mclapply(). The key difference is that mclapply() can only be used on OS/Linux operating systems and does not require creating a cluster environment. To ensure reproducibility, I advise users to use self-defined random seeds instead of setting mc.set.seed = TRUE, which enables tracking of the random seeds employed in each sub-task.

The parallel package contains several apply functions for various purposes, and the documentation of the parallel package provides comprehensive information on this matter.

Parallel computing on HPC Cluster using R

A personal computer has a restricted number of CPU cores. Consequently, executing 100 or 1000 tasks simultaneously on a personal computer is not feasible, and a High-Performance Computing (HPC) cluster is required. In my upcoming blog post, I will outline how to utilize parallel computing in R on an HPC cluster.