An Advanced Introduction to R

Daniel Hintz | Seth Rankins | 2024-03-25

A R Ladies Workshop

Outline (1)

  • Loops, Functionals (to replace loops), and Logicals
  • R function Basics
    • Structure
    • Some practice to start
  • R function’s Extended
    • formals(), body(), and environment()
    • More on Environments
    • Default Argument Assignment
    • NULL Conditional Execution
    • message(), warning(), and stop()
    • Conditional Execution with missing()
    • Using the Three Dots Ellipsis, ...
    • Anonymous Functions

Outline (2)

  • Functional Programming with base R
    • Function Look Table
    • Examples
  • Functional programming with purrr
    • Function Look Table
    • Examples
  • Parallel Computing
    • Parallel Computing with furrr
    • A Note on Parallel Computing
  • Practice Questions

Webr Code Execution

  • Use cmd + enter to execute a line

  • As of webR 0.2.3, webr does not support smart execution, so for multiline code highlight the entire section before hitting cmd + enter

Loops, Functions, and Logical Statements

What is the Difference Between Vectorization and Iteration?

  • If y = 4x + 3, what is y for x = 4?
  • If \(N_{(t + 1)} = N_{t}e^{r\left(1 ~-~ \frac{N_{t}}{K}\right)}\), what is \(N\) for \(t = 4\) \(~~\)(\(N_{1} = 2\), \(r = 0.5\), and \(K = 30\))?

Now for the Math in R

Now Onto a More Complex for Loop

  • For people that write their own algorithms, while loops are a good skill to know1

Now for apply Functions

apply Functions (cont’d)

  • You might want to look at the pbapply package if you regularly parallelize your code.

Logical Statements (1)

Logical Statements (2)

Lets Try Some Exercises

Q1

Read in the 20 “Weather Station” data files with a for loop and combine into one dataframe.

Q2

Now modify your for loop to omit any files that contain an NA.

Q3

Use an apply function to perform a Shapiro-Wilk test on each column of the data set created below (dat) to determine if the data in each column is normally distributed.

Q4

Make a column in the data created below (a subset of the weather station data) for week. When X (this is the row ID or day of the month) is 1-7, Week should be 1; when X (day) is 8-14, Week should be 2; when X (day) is 15-21, Week should be 3; when X (day) is 22-28, Week should be 4; and for X (day) from 28-30, Week should be assigned a value of NA.

R Function Basics

Function Structure

Functions are like a grammatically correct sentences; they require arguments, a body, and an environment1.

Lets Write Some Functions

Functions need an object call (i.e. x or a return(x)) in order to return output from x to the global environment

R Function’s Extended

Function formals(), body() and environment()

More on Environments

  • Environments are important to understand as they relate to Lexical scoping

Lexical Scoping

Default Argument Assignment

NULL Conditional Execution (1)

message(), warning(), and stop()

NULL Conditional Execution (2)

Conditional Execution with missing()

... dot-dot-dot (1)

  • The `...` argument is also known as an Ellipsis or simply dot-dot-dot.

... dot-dot-dot (2)

`...` has pros and cons, for more info see here

... dot-dot-dot (3)

So why does f5 work?

Anonymous Functions

“An Anonymous Function (also known as a lambda expression) is a function definition that is not bound to an identifier. That is, it is a function that is created and used, but never assigned to a variable” (see link)

base R anonymous function syntax:

purrr’s anonymous function syntax:

Functional Programming with base R

Functional Programming with base R

Function Description
apply(X, MARGIN, FUN, ...) Applies a function over the margins (rows or columns) of an array or matrix.
sapply(X, FUN, ...) Simplifies the result of lapply() by attempting to reduce the result to a vector, matrix, or higher-dimensional array.
vapply(X, FUN, FUN.VALUE) Similar to sapply(), but with a specified type of return value, making it safer and faster by avoiding unexpected type coercion.
lapply(X, FUN, ...) Applies a function to each element of a list or vector and returns a list.
tapply(X, INDEX, FUN = NULL) Applies a function over subsets of a vector, array, or data frame, split by the levels of a factor or list of factors
do.call(what, args, ...) constructs and executes a function call from a name or a function and a list of arguments to be passed to it
mapply(FUN, ...) A multivariate version of sapply(), applies a function to the 1st elements of each argument, then the 2nd elements of each argument, and so on.
Map(f, ...) Similar to mapply() but always returns a list, regardless of the output type.
Reduce(f, x, init, ...) Applies a function successively to elements of a vector from left to right so as to reduce the vector to a single value.

base R Examples (1)

base R Examples (2)

  • “[1] 3 2” indicates that among the three random samples generated, the numbers 3 and 2 are the only ones that appear at least once in all three vectors.

Functional Programming with purrr

Functional Programming with purrr

Function Description
map(.x, .f, ...) Applies a function to each element of a list or vector and returns a list. Useful for operations on list elements.
map2(.x, .y, .f, ...) Applies a function to the corresponding elements of two vectors/lists, useful for element-wise operations on two inputs.
pmap(.l, .f, ...) Applies a function to each element of a list or vector in parallel, taking multiple arguments from parallel lists or vectors.
reduce(.x, .f, ..., .init, .) Reduces a list or vector to a single value by iteratively applying a function that takes two arguments.
_dbl, _int _chr, _lgl and _vec map, map2 and pmap variants to change output type, e.g., map_dbl, map_int, map_chr, map_lgl, map_vec, map2_dbl ...

purrr Examples

Parallel Computing

Parallel Computing with furrr

fib_n <- function(n) {
    if ((n == 0) | (n == 1)) 
        return(1)
    else
        return(fib_n(n-1) + fib_n(n-2))
}
library(furrr) # library(furrr) loads future by default
library(tictoc)

future::plan(multisession, workers = 1) # setting num of cores/workers
tic()
num <- 1:30 |> future_map(fib_n)
toc() # 4.151 sec elapsed
future::plan(multisession, workers = 2) # Using 1 additional core
tic()
num <- 1:30 |> future_map(fib_n)
toc() # 2.174 sec elapsed

A Note on Parallel Computing

Parallel Computing is not a magic bullet. Performance depends on Overhead of Parallelization, Task Granularity, and whether or not the task is sequential

  • Sequential Tasks, generally speaking, are not capable of being parallized, though they can be “functionally parallized”; meaning given a function, seq_func, the internals of seq_func cannot not be parallized, but the call to seq_func can be parallized.
  • We will illustrate this concept with Random Walks

Random Walks

expand for random_walk() source code
library(furrr)
library(tictoc)

nworkers = parallel::detectCores() - 1 # select nworkers to amount of cores - 1 
random_walk <- function(steps) {
  position <- numeric(steps) # Initialize the position vector
  position[1] <- 0 # Start at the origin
  for (i in 2:steps) {                   # Simulate each step of the walk
    if (runif(1) < 0.5) {
      position[i] <- position[i - 1] + 1 # Move forward
    } else {
      position[i] <- position[i - 1] - 1 # Move backward
    }
  }
  return(position)
}
steps = 10000; n_random_walks = 300 # Define the number of steps and walks

future::plan(multisession, workers = 1) # setting num of cores/workers
tic() # Measure time taken to execute the random walk
set.seed(1); walks = future_map(1:n_random_walks , ~random_walk(steps),.options = furrr_options(seed = TRUE)) 
toc() # 3.088 sec elapsed

tic()
future::plan(multisession, workers = nworkers) # setting num of cores/workers
set.seed(1);walks = future_map(1:n_random_walks , ~random_walk(steps),.options = furrr_options(seed = TRUE)) 
toc() # 1.713 sec elapsed

pdf("random_walks.pdf")
invisible(
  lapply(1:10, function(i) 
    plot(walks[[i]],type = "l", ylab = "Position", xlab = "Step",
         main = paste("Random Walk",i)))
  );dev.off()

Our First 10 Random Walks Plots

Bootstraps in Parallel

expand for source code for boot() and samp.o()
boot <- function(x, B = 5000, m, theta.f, w = 1, rdist, ...) {
  plan(multisession, workers = w) # Set up for parallel execution
  b_indices <- 1:B # vector of indices for bootstrapping iterations
  iterate_func <- function(b) { # apply for each bootstrap iteration
    if (m == "p") {
      d.b <- rdist(...) # parametric bootstrap
    } else if (m == "np") {
      d.b <- x[sample(1:length(x), replace = TRUE)] # nonparametric bootstrap
    } else {
      stop("possible arguments for m is 'p' (parametric) or 'np' (nonparametric)")
    }
    theta.f(d.b)
  }
  # future_map_dbl to apply iterate_func over each index in parallel with proper seeding
  t.s <- future_map_dbl(b_indices, iterate_func, .options = furrr_options(seed = TRUE))
  samp.o(t.s) # Summarize the bootstrap results
}
samp.o = function(t.s) {
  round(c(mean=mean(t.s),sd=sd(t.s),lower=quantile(t.s, 0.025, names = F),
  upper= quantile(t.s, 0.975, names = F)),digits=6)}
library(purrr)
library(future)
library(tictoc)

# boot <- function(x, B = 5000, m, theta.f, w = 1, rdist, ...) {} # see above
# samp.o = function(t.s) {} # see above
theta.f = function(d.b) {p = sum(d.b)/n; p/(1-p)} 

set.seed(1); n = 800000; y = 480; B = 5000
data <- c(rep(1, y), rep(0, n-y)); phat <- sum(data)/n
tic()
(b_p_future <- boot(data, B = B, m = "p", theta.f = theta.f, w = 1,
                    rdist = rbinom, n = n, size = 1, prob = phat))
toc() # 49.859 sec elapsed
tic()
(b_p_future = boot(data, B = B, m = "p", theta.f = theta.f, w = 9,
                    rdist = rbinom, n = n, size = 1, prob = phat))
toc() # 8.014 sec elapsed

Its Time for Some More Exercises

Q5

Write an R function named is_positive that takes a single numeric input and returns TRUE if the number is positive, and FALSE otherwise.

Q6

Create a function named sqrt_safe that computes the square root of a number. If the input is negative, the function should stop execution and return an error message "Cannot take square root of a negative number."

Q7

Write a function named find_first_negative that takes a numeric vector and returns the position of the first negative number. If there are no negative numbers, return NA.

Q8

Create a function named halve_until_less_than_one that takes a single numeric argument and keeps halving it until it is less than 1, then returns the result. Keep track of the number of times the input is halved; print the function output as list(result = x, nsteps = count)

Q9

Write a function named scale_columns that takes a matrix and scales (normalizes) each column to have a mean of 0 and a standard deviation of 1. Use the given dataframe M

Q10

Using the purrr package, write a function that takes a list of numeric vectors and returns a list of their means. Use purrr::map.

Q11

Create a function named multiply_and_add that takes an arbitrary number of numeric vectors. It should multiply each vector by its index in the argument list and then sum all the results into a single number.

Acknowledgements

  • Thank you Molly Caldwell your efforts organizing!
  • A big thanks to the R-Ladies Community!
  • Thank you Seth for your patience in my experiment using quarto revealjs and webr!!
  • Photo Credit: Conny Schneider

Wanna Hear From Us?

Thank you!!