An Advanced Introduction to R

Daniel Hintz | Seth Rankins | 2024-03-25

A R Ladies Workshop

Outline (1)

Loops, Functionals (to replace loops), and Logicals
R function Basics
- Structure
- Some practice to start
R function’s Extended
- formals(), body(), and environment()
- More on Environments
- Default Argument Assignment
- NULL Conditional Execution
- message(), warning(), and stop()
- Conditional Execution with missing()
- Using the Three Dots Ellipsis, ...
- Anonymous Functions

Outline (2)

Functional Programming with base R
- Function Look Table
- Examples
Functional programming with purrr
- Function Look Table
- Examples
Parallel Computing
- Parallel Computing with furrr
- A Note on Parallel Computing
Practice Questions

Webr Code Execution

Use cmd + enter to execute a line
As of webR 0.2.3, webr does not support smart execution, so for multiline code highlight the entire section before hitting cmd + enter

Loops, Functions, and Logical Statements

What is the Difference Between Vectorization and Iteration?

If y = 4x + 3, what is y for x = 4?
If \(N_{(t + 1)} = N_{t}e^{r\left(1 ~-~ \frac{N_{t}}{K}\right)}\), what is \(N\) for \(t = 4\) \(~~\)(\(N_{1} = 2\), \(r = 0.5\), and \(K = 30\))?

Now for the Math in R

Now Onto a More Complex `for` Loop

For people that write their own algorithms, while loops are a good skill to know¹

Now for `apply` Functions

`apply` Functions (cont’d)

You might want to look at the pbapply package if you regularly parallelize your code.

Logical Statements (1)

Logical Statements (2)

Lets Try Some Exercises

Read in the 20 “Weather Station” data files with a for loop and combine into one dataframe.

Q2

Question
Answer

Now modify your for loop to omit any files that contain an NA.

Q3

Question
Answer

Use an apply function to perform a Shapiro-Wilk test on each column of the data set created below (dat) to determine if the data in each column is normally distributed.

Q4

Question
Answer

Make a column in the data created below (a subset of the weather station data) for week. When X (this is the row ID or day of the month) is 1-7, Week should be 1; when X (day) is 8-14, Week should be 2; when X (day) is 15-21, Week should be 3; when X (day) is 22-28, Week should be 4; and for X (day) from 28-30, Week should be assigned a value of NA.

R Function Basics

Function Structure

Functions are like a grammatically correct sentences; they require arguments, a body, and an environment¹.

Lets Write Some Functions

Functions need an object call (i.e. `x` or a `return(x)`) in order to return output from `x` to the global environment

R Function’s Extended

Function `formals()`, `body()` and `environment()`

Lexical Scoping

Imagine a castle where of the three levels (Keep, Main, Lower), some levels can see more levels than others. For example from the Keep, you can see inside the keep, into the Main level as well as the lower level. In R, their are also varying degrees of “sight”. For example, inside a function, objects that were defined inside the function are visible as well as objects that were defined outside the function, say like the global environment.
If you were in the castle and you wanted to search for an object, one possible procedure is to first search the lower level (say a f3), then if you still haven’t found it, proceed to search the main level (f4) and finally, if still haven’t found it, you would search the keep and if were still not found you would declare the object is not in the castle. This is analogous to how lexical scoping works in R.
In R, the <<- operator does not simply assign a value to a variable in the parent environment; it assigns a value to a variable in the parent environment if the variable already exists there. If the variable does not exist in the parent environment, R searches up through the chain of environments until it finds an environment where the variable exists and assigns the value there. If it does not find the variable in any parent environments, it will assign the value in the global environment.

Default Argument Assignment

`NULL` Conditional Execution (1)

`message()`, `warning()`, and `stop()`

`NULL` Conditional Execution (2)

Conditional Execution with `missing()`

`...` dot-dot-dot (1)

The `...` argument is also known as an Ellipsis or simply dot-dot-dot.

`...` dot-dot-dot (2)

`...` has pros and cons, for more info see here

`...` dot-dot-dot (3)

So why does f5 work?

Anonymous Functions

“An Anonymous Function (also known as a lambda expression) is a function definition that is not bound to an identifier. That is, it is a function that is created and used, but never assigned to a variable” (see link)

base R anonymous function syntax:

purrr’s anonymous function syntax:

Functional Programming with `base` R

Function	Description
`apply(X, MARGIN, FUN, ...)`	Applies a function over the margins (rows or columns) of an array or matrix.
`sapply(X, FUN, ...)`	Simplifies the result of `lapply()` by attempting to reduce the result to a vector, matrix, or higher-dimensional array.
`vapply(X, FUN, FUN.VALUE)`	Similar to `sapply()`, but with a specified type of return value, making it safer and faster by avoiding unexpected type coercion.
`lapply(X, FUN, ...)`	Applies a function to each element of a list or vector and returns a list.
`tapply(X, INDEX, FUN = NULL)`	Applies a function over subsets of a vector, array, or data frame, split by the levels of a factor or list of factors
`do.call(what, args, ...)`	constructs and executes a function call from a name or a function and a list of arguments to be passed to it
`mapply(FUN, ...)`	A multivariate version of `sapply()`, applies a function to the 1st elements of each argument, then the 2nd elements of each argument, and so on.
`Map(f, ...)`	Similar to `mapply()` but always returns a list, regardless of the output type.
`Reduce(f, x, init, ...)`	Applies a function successively to elements of a vector from left to right so as to reduce the vector to a single value.

`base` R Examples (1)

apply
sapply
lapply
vapply
tapply

`base` R Examples (2)

do.call
mapply
Map
Reduce

“[1] 3 2” indicates that among the three random samples generated, the numbers 3 and 2 are the only ones that appear at least once in all three vectors.

Functional Programming with `purrr`

Function	Description
`map(.x, .f, ...)`	Applies a function to each element of a list or vector and returns a list. Useful for operations on list elements.
`map2(.x, .y, .f, ...)`	Applies a function to the corresponding elements of two vectors/lists, useful for element-wise operations on two inputs.
`pmap(.l, .f, ...)`	Applies a function to each element of a list or vector in parallel, taking multiple arguments from parallel lists or vectors.
`reduce(.x, .f, ..., .init, .)`	Reduces a list or vector to a single value by iteratively applying a function that takes two arguments.
`_dbl,` `_int` `_chr`, `_lgl and` `_vec`	`map, map2` and `pmap` variants to change output type, e.g., `map_dbl`, `map_int`, `map_chr`, `map_lgl`, `map_vec`, `map2_dbl ...`

`purrr` Examples

map
map2
pmap
reduce

Parallel Computing

Parallel Computing with `furrr`

fib_n <- function(n) {
    if ((n == 0) | (n == 1)) 
        return(1)
    else
        return(fib_n(n-1) + fib_n(n-2))
}

library(furrr) # library(furrr) loads future by default
library(tictoc)

future::plan(multisession, workers = 1) # setting num of cores/workers
tic()
num <- 1:30 |> future_map(fib_n)
toc() # 4.151 sec elapsed

future::plan(multisession, workers = 2) # Using 1 additional core
tic()
num <- 1:30 |> future_map(fib_n)
toc() # 2.174 sec elapsed

A Note on Parallel Computing

Parallel Computing is not a magic bullet. Performance depends on Overhead of Parallelization, Task Granularity, and whether or not the task is sequential

Sequential Tasks, generally speaking, are not capable of being parallized, though they can be “functionally parallized”; meaning given a function, seq_func, the internals of seq_func cannot not be parallized, but the call to seq_func can be parallized.
We will illustrate this concept with Random Walks

Random Walks

expand for random_walk() source code

library(furrr)
library(tictoc)

nworkers = parallel::detectCores() - 1 # select nworkers to amount of cores - 1 
random_walk <- function(steps) {
  position <- numeric(steps) # Initialize the position vector
  position[1] <- 0 # Start at the origin
  for (i in 2:steps) {                   # Simulate each step of the walk
    if (runif(1) < 0.5) {
      position[i] <- position[i - 1] + 1 # Move forward
    } else {
      position[i] <- position[i - 1] - 1 # Move backward
    }
  }
  return(position)
}

steps = 10000; n_random_walks = 300 # Define the number of steps and walks

future::plan(multisession, workers = 1) # setting num of cores/workers
tic() # Measure time taken to execute the random walk
set.seed(1); walks = future_map(1:n_random_walks , ~random_walk(steps),.options = furrr_options(seed = TRUE)) 
toc() # 3.088 sec elapsed

tic()
future::plan(multisession, workers = nworkers) # setting num of cores/workers
set.seed(1);walks = future_map(1:n_random_walks , ~random_walk(steps),.options = furrr_options(seed = TRUE)) 
toc() # 1.713 sec elapsed

pdf("random_walks.pdf")
invisible(
  lapply(1:10, function(i) 
    plot(walks[[i]],type = "l", ylab = "Position", xlab = "Step",
         main = paste("Random Walk",i)))
  );dev.off()

Our First 10 Random Walks Plots

Bootstraps in Parallel

expand for source code for boot() and samp.o()

boot <- function(x, B = 5000, m, theta.f, w = 1, rdist, ...) {
  plan(multisession, workers = w) # Set up for parallel execution
  b_indices <- 1:B # vector of indices for bootstrapping iterations
  iterate_func <- function(b) { # apply for each bootstrap iteration
    if (m == "p") {
      d.b <- rdist(...) # parametric bootstrap
    } else if (m == "np") {
      d.b <- x[sample(1:length(x), replace = TRUE)] # nonparametric bootstrap
    } else {
      stop("possible arguments for m is 'p' (parametric) or 'np' (nonparametric)")
    }
    theta.f(d.b)
  }
  # future_map_dbl to apply iterate_func over each index in parallel with proper seeding
  t.s <- future_map_dbl(b_indices, iterate_func, .options = furrr_options(seed = TRUE))
  samp.o(t.s) # Summarize the bootstrap results
}
samp.o = function(t.s) {
  round(c(mean=mean(t.s),sd=sd(t.s),lower=quantile(t.s, 0.025, names = F),
  upper= quantile(t.s, 0.975, names = F)),digits=6)}

library(purrr)
library(future)
library(tictoc)

# boot <- function(x, B = 5000, m, theta.f, w = 1, rdist, ...) {} # see above
# samp.o = function(t.s) {} # see above
theta.f = function(d.b) {p = sum(d.b)/n; p/(1-p)} 

set.seed(1); n = 800000; y = 480; B = 5000
data <- c(rep(1, y), rep(0, n-y)); phat <- sum(data)/n

tic()
(b_p_future <- boot(data, B = B, m = "p", theta.f = theta.f, w = 1,
                    rdist = rbinom, n = n, size = 1, prob = phat))
toc() # 49.859 sec elapsed

tic()
(b_p_future = boot(data, B = B, m = "p", theta.f = theta.f, w = 9,
                    rdist = rbinom, n = n, size = 1, prob = phat))
toc() # 8.014 sec elapsed

Its Time for Some More Exercises

Q5

Question
Answer

Write an R function named is_positive that takes a single numeric input and returns TRUE if the number is positive, and FALSE otherwise.

Q6

Question
Answer

Create a function named sqrt_safe that computes the square root of a number. If the input is negative, the function should stop execution and return an error message "Cannot take square root of a negative number."

Q7

Question
Answer

Write a function named find_first_negative that takes a numeric vector and returns the position of the first negative number. If there are no negative numbers, return NA.

Q8

Question
Answer

Create a function named halve_until_less_than_one that takes a single numeric argument and keeps halving it until it is less than 1, then returns the result. Keep track of the number of times the input is halved; print the function output as list(result = x, nsteps = count)

Q9

Question
Answer

Write a function named scale_columns that takes a matrix and scales (normalizes) each column to have a mean of 0 and a standard deviation of 1. Use the given dataframe M

Q10

Question
Answer

Using the purrr package, write a function that takes a list of numeric vectors and returns a list of their means. Use purrr::map.

Q11

Question
Answer

Create a function named multiply_and_add that takes an arbitrary number of numeric vectors. It should multiply each vector by its index in the argument list and then sum all the results into a single number.

Acknowledgements

Thank you Molly Caldwell your efforts organizing!
A big thanks to the R-Ladies Community!
Thank you Seth for your patience in my experiment using quarto revealjs and webr!!
Photo Credit: Conny Schneider

Wanna Hear From Us?

Cover Image

Cover Image

An Advanced Introduction to R

Daniel Hintz | Seth Rankins | 2024-03-25

A R Ladies Workshop

Outline (1)

Outline (2)

Webr Code Execution

Loops, Functions, and Logical Statements

What is the Difference Between Vectorization and Iteration?

Now for the Math in R

Now Onto a More Complex for Loop

Now for apply Functions

apply Functions (cont’d)

Logical Statements (1)

Logical Statements (2)

Lets Try Some Exercises

Q1

Q2

Q3

Q4

R Function Basics

Function Structure

Lets Write Some Functions

Functions need an object call (i.e. x or a return(x)) in order to return output from x to the global environment

R Function’s Extended

Function formals(), body() and environment()

More on Environments

Lexical Scoping

Default Argument Assignment

NULL Conditional Execution (1)

message(), warning(), and stop()

NULL Conditional Execution (2)

Conditional Execution with missing()

... dot-dot-dot (1)

... dot-dot-dot (2)

... dot-dot-dot (3)

Anonymous Functions

Functional Programming with base R

Functional Programming with base R

base R Examples (1)

base R Examples (2)

Functional Programming with purrr

Functional Programming with purrr

purrr Examples

Parallel Computing

Parallel Computing with furrr

A Note on Parallel Computing

Random Walks

Our First 10 Random Walks Plots

Bootstraps in Parallel

Its Time for Some More Exercises

Q5

Q6

Q7

Q8

Q9

Q10

Q11

Acknowledgements

Wanna Hear From Us?

Thank you!!

Now Onto a More Complex `for` Loop

Now for `apply` Functions

`apply` Functions (cont’d)

Functions need an object call (i.e. `x` or a `return(x)`) in order to return output from `x` to the global environment

Function `formals()`, `body()` and `environment()`

`NULL` Conditional Execution (1)

`message()`, `warning()`, and `stop()`

`NULL` Conditional Execution (2)

Conditional Execution with `missing()`

`...` dot-dot-dot (1)

`...` dot-dot-dot (2)

`...` dot-dot-dot (3)

Functional Programming with `base` R

Functional Programming with `base` R

`base` R Examples (1)

`base` R Examples (2)

Functional Programming with `purrr`

Functional Programming with `purrr`

`purrr` Examples

Parallel Computing with `furrr`