29 Simulation and Reproducibility – Statistical Computing using R and Python

29.1 Objectives

Understand the limitations of pseudorandom number generation
Program a simulation for a specific task, process, or model
Determine the appropriate data structure to use for storing simulation task results

Simulation is an extremely important part of computational statistics. Bayesian statistics, in particular, relies on Markov Chain Monte Carlo (MCMC) to get results from even the most basic of models. In this module, we’re going to touch on a few foundational pieces of simulation in computing. Hopefully, you will get more exposure to simulation in both theory and methods courses down the line.

29.2 Pseudorandom Number Generation

Computers are almost entirely deterministic, which makes it very difficult to come up with “random” numbers. In addition to the deterministic nature of computing, it’s also somewhat important to be able to run the same code and get the same results every time, which isn’t possible if you rely on truly random numbers.

Historically, pseudorandom numbers were generated using linear congruential generators (LCGs) [1]. These algorithms aren’t typically used anymore, but they provide a good demonstration of how one might go about generating numbers that seem “random” but are actually deterministic. LCGs use modular arithmetic: \[X_{n+1} = (aX_n + c) \mod m\] where $X_0$ is the start value (the seed), $a$ is the multiplier, $c$ is the increment, and $m$ is the modulus. When using a LCG, the user generally specifies only the seed.

LCGs generate numbers which at first appear random, but once sufficiently many numbers have been generated, it is clear that there is some structure in the data. (Image from Wikimedia)

The important thing to note here is that if you specify the same generator values ($a$, $c$, $m$, and $X_0$), you will always get the same series of numbers. Since $a$, $c$, $m$ are usually specified by the implementation, as a user, you should expect that if you specify the same seed, you will get the same results, every time.

Warning

It is critically important to set your seed if you want the results to be reproducible and you are using an algorithm that depends on randomness.

Once you set your seed, the remaining results will only be reproducible if you generate the same set of random numbers every time.

I once helped a friend fit a model for their masters thesis using Simulated Annealing (which relies on random seeds). We got brilliant results, but couldn’t ever reproduce them, because I hadn’t set the seed first and we never could figure out what the original seed was. 😭

Example: Setting Seeds for Reproducibility

R
Python

set.seed(342512)

# Get 10 numbers after the seed is set
sample(1:100, 10)
##  [1] 65 51 64 21 45 53  3  6 43  8

# Compute something else that depends on randomness
mean(rnorm(50))
## [1] -0.1095366

# Get 10 more numbers
sample(1:100, 10)
##  [1]  4 57 69 10 76 15 67  1  3 91

import random
import numpy as np

# Create a random generator with a specific seed
rng = np.random.default_rng(342512)

# Generate 10 integers
rng.integers(low = 1, high = 100, size = 10)
## array([18, 43, 71,  4, 35, 26, 41, 91, 42, 13])

# Generate 500 std normal draws and take the mean
np.mean(rng.standard_normal(500))
## -0.008197259441979758

# Get 10 more numbers
rng.integers(low = 1, high = 100, size = 10)
## array([33, 38,  3, 95,  3, 58, 79,  3, 77, 23])

Compare the results above to these results:

R
Python

set.seed(342512)

# Get 10 numbers after the seed is set
sample(1:100, 10)
##  [1] 65 51 64 21 45 53  3  6 43  8

# Compute something else that depends on randomness
mean(rnorm(30))
## [1] -0.1936645

# Get 10 more numbers
sample(1:100, 10)
##  [1]  49  37   6  34   9   3 100  43   7  29

import random
import numpy as np

# Create a random generator with a specific seed
rng = np.random.default_rng(342512)

# Generate 10 integers
rng.integers(low = 1, high = 100, size = 10)
## array([18, 43, 71,  4, 35, 26, 41, 91, 42, 13])

# Generate 30 std normal draws and take the mean
np.mean(rng.standard_normal(30))
## 0.3016849078747997

# Get 10 more numbers
rng.integers(low = 1, high = 100, size = 10)
## array([21, 49, 21, 99, 45,  1, 56, 70, 15, 82])

Notice how the results have changed?

To make my documents more reproducible, I will sometimes set a new seed at the start of an important chunk, even if I’ve already set the seed earlier. This introduces certain “fixed points” where results won’t change immediately after I’ve re-set the seed. This is particularly important when I’m generating bootstrap estimates, fitting models, or simulating data for graphics experiments.

A gif of Dr. River Song from Dr. Who, in a spacesuit, saying 'Fixed points can be rewritten' while fluttering her eyelashes. — If you know, you know. If not, go watch some Dr. Who sometime. Source

Pick your seed in any way you want. I tend to just randomly wiggle my fingers over the number keys, but I have also heard of people using the date in yyyymmdd format, favorite people’s birthdays, the current time in hhmmss format… basically, you can use anything.

29.3 Built-in simulations from distributions

Often, we can get away with just simulating data from a known distribution. As both R and python are meant for statistical computing, this is extremely easy by design.

R
Python

You can see the various distribution options using ?Distributions. In general, dxxx is the PDF/PMF, pxxx is the CDF, qxxx is the quantile function, and rxxx gives you random nubmers generated from the distribution. (xxx, obviously, is whatever distribution you’re looking to use.)

library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(109025879)

tibble(
  norm = rnorm(500),
  gamma = rgamma(500, shape = 3, scale = 1),
  exp = rexp(500, rate = 1), # R uses a exp(-ax) 
  t = rt(500, df = 5),
  chisq = rchisq(500, 5)
) %>%
  pivot_longer(1:5, names_to = "dist", values_to = "value") %>%
  ggplot(aes(x = value)) + geom_density() + facet_wrap(~dist, scales = "free", nrow = 1)

import random
random.seed(109025879)

import pandas as pd
import numpy as np

wide_df = pd.DataFrame({
  "norm": np.random.normal(size=500),
  "gamma": np.random.gamma(size=500, shape = 3, scale = 1),
  "exp": np.random.exponential(size = 500, scale = 1),
  "t": np.random.standard_t(df = 5, size = 500),
  "chisq": np.random.chisquare(df = 5, size = 500)
})

long_df = pd.melt(wide_df, id_vars = None, var_name = "dist", value_name = "value")

from plotnine import *

(ggplot(long_df, aes(x = "value")) + geom_density() + facet_wrap("dist", scales="free", nrow = 1))
## <Figure Size: (640 x 480)>

Try it out

Generate variables x and y, where x is a sequence from -10 to 10 and y is equal to $x + \epsilon$, $\epsilon \sim N(0, 1)$. Fit a linear regression to your simulated data (in R, lm, in Python, sklearn.linear_model’s LinearRegression).

Hint: Sample code for regression using sklearn [2].

set.seed(20572983)
data <- tibble(x = seq(-10, 10, .1), 
               y = x + rnorm(length(x)))
regression <- lm(y ~ x, data = data)
summary(regression)
## 
## Call:
## lm(formula = y ~ x, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.14575 -0.70986  0.03186  0.65429  2.40305 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.01876    0.06869  -0.273    0.785    
## x            0.99230    0.01184  83.823   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9738 on 199 degrees of freedom
## Multiple R-squared:  0.9725, Adjusted R-squared:  0.9723 
## F-statistic:  7026 on 1 and 199 DF,  p-value: < 2.2e-16

import random

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression

random.seed(20572983)

data = pd.DataFrame({'x': np.arange(-10, 10, .1)})
data['y'] = data.x + np.random.normal(size = data.x.size)

# Fitting the regression and predictions
# scikit-learn requires that we reshape everything into
# nparrays before we pass them into the model.fit() function.
model = LinearRegression().\
  fit(data.x.values.reshape(-1, 1),\
      data.y.values.reshape(-1, 1))
data['pred'] = model.predict(data.x.values.reshape(-1, 1))

# Plotting the results
import matplotlib.pyplot as plt
plt.clf()

plt.scatter(data.x, data.y)
plt.plot(data.x, data.pred, color='red')
plt.show()

Simulation from different distributions can be used to determine which estimators are most appropriate for a given scenario, to determine how likely it is to observe a specific value in a sample of size $n$, and for many other applications.

Example: Mean vs. Median

Suppose you want to determine which estimator is better - mean or median - for $t$-distributions with different degrees of freedom and different numbers of observations.

You consider degrees of freedom $\nu = 2, 5, 10, 30$ and samples of size $n = 10, 20, 30, 50$. For each sample, you calculate the mean and median, $\hat\theta$, and you repeat this process $N=1000$ times. You then calculate the bias of the sample estimator $\hat\theta$ (mean, median) and the MSE of that estimator. The bias is $\text{Bias}(\hat\theta, \theta) = \text{Bias}_\theta\left[\hat\theta\right] = E_{x|\theta}\left[\hat\theta\right]- \theta = E_{x|\theta} \left[\hat\theta - \theta\right]$. The MSE (mean squared error) of $\hat\theta$ is $\text{MSE}(\hat\theta) = E\left[(\hat\theta-\theta)^2\right]$. We know that the MSE is related to the bias and the variance of $\hat\theta$: $\text{MSE}(\hat\theta) = \left(\text{Bias}(\hat\theta,\theta)\right)^2+\text{Var}(\hat\theta)$ (the Bias-Variance tradeoff).

How can you use this information to determine which estimator is preferable in each $\nu, n$ situation?

R
Python

First, let’s conduct the simulation. I find it helpful to make a function to simulate the data and calculate the necessary quantities, and then to use the purrr package to run my function $N$ times with each parameter set.

I always save the original simulated data, even when it’s not explicitly necessary, because I want to know exactly what results contributed, and this is the easiest way to get it. This is not the most memory efficient way to go, so if you are concerned about overflowing your computer’s memory, you might omit saving the data once you’re confident the simulation function works as planned.

library(tibble)
library(purrr)
library(tidyr)
library(dplyr)

# Write a function to draw a single sample
draw_sample <- function(n = 10, v = 2) {
  my_samp <- rt(n, v)
  tibble(data = list(my_samp), mean = mean(my_samp), median = median(my_samp))
}

pars <- expand_grid(n = c(10, 20, 30, 50), v = c(2, 5, 10, 30), id = 1:1000)

# Only recompute results if they're not present already
if (!file.exists("data/t-sim-results.RDS")) {
  results <- pars |>
  rowwise() |>
  mutate(res = map2(n, v, draw_sample))
  
  results <- unnest(results, "res")
  saveRDS(results, file = "data/t-sim-results.RDS")
} else {
  results <- readRDS("data/t-sim-results.RDS")
}


result_sum <- results |>
  pivot_longer(mean:median, names_to = "estimator", values_to = "value") |>
  group_by(n, v, estimator) |>
  summarize(bias = mean(value), variance = var(value), mse = mean(value^2))

Now that I have the simulation summaries, I can think about plotting the results to focus on the difference in Bias/MSE between estimators under different parameter sets. As our goal is to determine whether mean or median is preferable, we should focus on making that comparison as easy as possible by plotting mean and median lines on the same plot.

library(ggplot2)
result_sum |> 
  pivot_longer(bias:mse, names_to = "quantity", values_to = "value") |>
  mutate(quantity = factor(quantity, levels = c("bias", "variance", "mse"), ordered = T)) |>
  ggplot(aes(x = n, y = value, linetype = estimator, color = factor(v))) + 
  geom_line() + 
  guides(color = 'none') + 
  facet_grid(v~quantity, scales = "free_y", space = "free_y") + 
  xlab("Sample Size") + ylab("Expected Value") +
  theme_bw()

A chart showing lines comparing the mean and median performance under simulation. The chart has three columns and four rows, with columns corresponding to the bias, variance, and mse and rows to degrees of freedom (2, 5, 10, 30) shown as rows. The y-axis of each subplot is the expected value of the corresponding measurement (bias, variance, mse), and the x-axis of each subplot is the number of observations in the sample (10, 20, 30, 50). In each subplot, there are two lines showing the mean and median. The first row is allocated much more space than subsequent rows, with y values ranging between 0 and .75, while subsequent rows range between 0 and 0.15. The lines representing bias (first column) are very flat and close to 0, though there is more variability around the 0 line when degrees of freedom = 2. The variance and MSE tend to decrease as n increases across all panels, but again when DF=2, there is more variability and the trend is less clear. When DF=2, the mean line is higher than the median line for the variance and MSE over all values of n, indicating that the variance of the mean is higher than the variance of the median (the MSE is the bias + variance, and is technically redundant here). When DF=5 and we look at the variance or MSE, the lines are very close and it is difficult to determine which estimator would be better. When DF=10 or DF=20 and we consider the variance or MSE, the median line is higher than the mean across all values of n, indicating that in these situations we would prefer the mean as an estimator over the median. — Summary chart of t-distribution estimator simulation results. Across $n$ and $ u$, we see that the bias is around zero - that is, both mean and median are unbiased estimators (which is sensible given that the $t$-distribution is symmetrical). We see that when $ u=2$, the variance component is much higher than $ u=5,10,30$, and as MSE is the sum of the bias and variance, the MSE is also higher in this case. At $ u=5$, the mean and median perform fairly similarly; but for $ u=2$, the median has lower variance and thus lower MSE than the mean, which is due to the very heavy tails of the $t$-distribution with low degrees of freedom; the mean is influenced by the higher chances of extreme observations. Thus, when $ u=2$, we would prefer the median as an estimator. When $ u=10,30$, however, the mean has lower variance and thus MSE than the median across all values of $n$, indicating that when there are sufficient degrees of freedom, we should prefer the mean as an estimator.

from scipy.stats import t # t distribution class
import numpy as np # mean, median
import pandas as pd

rv = t(2)
n_opts = [10, 20, 30, 50]
v_opts = [2, 5, 10, 30]

pars = [(n, v, i) for n in n_opts for v in v_opts for i in range(1000)]
# Simulate t-data for each parameter combination and iteration
# IMO, it's easier to use list comprehension for this, 
# but you could absolutely write a function instead
data = [t(x[1]).rvs(x[0]) for x in pars]
median= [np.median(dat) for dat in data]
mean = [np.mean(dat) for dat in data]

# Construct a data structure to hold all results together
results = pd.DataFrame(pars, columns = ['n', 'df', 'i'])
results['data'] = data
results['median'] = median
results['mean'] = mean

def mse(x, x0=0):
  return np.mean((x - x0)**2)

# summarize
res_sum = results.groupby(['n', 'df']).agg({'median': ['mean', 'var', mse], 'mean': ['mean', 'var', mse]})
# Merge column indices together
res_sum.columns = ['_'.join(col) for col in res_sum.columns]
# "Ungroup" python style
res_sum = res_sum.reset_index()
# Melt to long form
res_sum_long = pd.melt(res_sum, id_vars=['n', 'df'], var_name = 'est', value_name = 'value')
res_sum_long[['estimator', 'measure']] = res_sum_long.est.str.split("_", expand = True)
res_sum_long['measure'] = res_sum_long['measure'].str.replace('mean', 'bias', regex = True)
res_sum_long = res_sum_long.drop('est', axis = 1)

Now that I have the simulation summaries, I can think about plotting the results to focus on the difference in Bias/MSE between estimators under different parameter sets. As our goal is to determine whether mean or median is preferable, we should focus on making that comparison as easy as possible by plotting mean and median lines on the same plot.

import seaborn.objects as so
import matplotlib.pyplot as plt
(
  so.Plot(res_sum_long, x = "n", y = "value", linestyle = 'estimator')
  .add(so.Line())
  .facet('measure', 'df')
  .share(y="row")
  .label(x = "Sample Size", y = "Value")
  .show()
)

Considering the results we obtained, it seems clear that when we have a very low ($\nu<=5$) degrees of freedom, the median is a preferable estimator because it has lower variance; when we have a higher number of degrees of freedom, the mean is a preferable estimator on the basis of variance. Both estimators are asymptotically unbiased, and are unbiased even for small sample sizes when the distribution is symmetric.

29.4 Simulation to test model assumptions

One of the more powerful ways to use simulation in practice is to use it to test the assumptions of your model. Suppose, for instance, that your data are highly skewed, but you want to use a method that assumes normally distributed errors. How bad will your results be? Where can you trust the results, and where should you be cautious?

The purrr::map notation specifies that we are using the map function from the purrr package. When functions are named generically, and there may be more than one package with a function name, it is often more readable to specify the package name along with the function.

purrr::map takes an argument and for each “group” calls the compute_interval function, storing the results in res. So each row in res is a 1x2 tibble with columns lb and ub.

This pattern is very useful in all sorts of applications. You can read more about purrr in Chapter 27.

Example: Confidence Interval coverage rates

Suppose, for instance, that we have a lognormal distribution (highly skewed) and we want to compute a 95% confidence interval for the mean of our 25 observations.

You want to assess the coverage probability of a confidence interval computed under two different modeling scenarios:

Working with the log-transformed values, ln(x), and then transform the computed interval back
Working with the raw values, x, compute an interval assuming the data are symmetric, essentially treating the lognormal distribution as if it were normal.

Scenario 1:

the expected value of the standard normal deviates is 0
the variance of the data is 1
the SE($\overline x$) is $\sqrt\frac{1}{25} = \frac{1}{5}$

Our theoretical interval should be $(\exp(-1.96/5), \exp(1.96/5)) = (0.6757, 1.4799)$.

Scenario 2

The expected value of the lognormal distribution is $\exp(1/2) = 1.6487213$
The variance of the data is $(\exp(1) - 1)(\exp(1)) = 4.6707743$
The SE($\overline x$) is thus $\sqrt{\frac{(e^1 - 1)e^1}{25}} = \frac{\sqrt{(e^1 - 1)e^1}}{5} = 0.4322$

Our theoretical interval should be $(0.8015, 2.4959)$. This interval could, if the circumstances were slightly different, contain 0, which is implausible for lognormally distributed data.

Our expected values are different under scenario 1 and scenario 2:

In scenario 1 we are computing an interval for $\mu$
In scenario 2, we are computing an interval for the population mean, which is $\exp(\mu + .5\sigma^2)$

Both are valid quantities we might be interested in, but they do not mean the same thing.

set.seed(40295023)

sim <- tibble(
  id = rep(1:100, each = 25), # generate 100 samples of 25 points each
  ln_x = rnorm(25*100), # generate 25 normal deviates for each sample
  x = exp(ln_x), # transform into lognormal deviates
) %>%
  # this creates a 100-row data frame, with one row for each id. 
  # the columns x, ln_x are stored in the data list-column as a tibble.
  nest(data = c(x, ln_x))
  
head(sim)
## # A tibble: 6 × 2
##      id data             
##   <int> <list>           
## 1     1 <tibble [25 × 2]>
## 2     2 <tibble [25 × 2]>
## 3     3 <tibble [25 × 2]>
## 4     4 <tibble [25 × 2]>
## 5     5 <tibble [25 × 2]>
## 6     6 <tibble [25 × 2]>
sim$data[[1]]
## # A tibble: 25 × 2
##        x    ln_x
##    <dbl>   <dbl>
##  1 0.310 -1.17  
##  2 0.622 -0.475 
##  3 0.303 -1.19  
##  4 1.05   0.0525
##  5 0.529 -0.636 
##  6 1.09   0.0891
##  7 1.97   0.676 
##  8 8.94   2.19  
##  9 0.598 -0.514 
## 10 0.183 -1.70  
## # ℹ 15 more rows

compute_interval <- function(x) {
  s1 <- exp(mean(log(x)) + c(-1, 1) * qnorm(.975) * sd(log(x))/sqrt(length(x)))
  s2 <- mean(x) + c(-1, 1) * qnorm(.975) * sd(x)/sqrt(length(x))
  tibble(scenario = c("scenario_1", "scenario_2"),
         mean = c(1, exp(1/2)),
         lb = c(s1[1], s2[1]), ub = c(s1[2], s2[2]),
         in_interval = (lb < mean) & (ub > mean))
}


sim_long <- sim %>%
  # This line takes each data entry and computes an interval for x.
  # .$x is code for take the argument you passed in to map and get the x column
  mutate(res = purrr::map(data, ~compute_interval(.$x))) %>%
  # this "frees" res and we end up with two columns: lb and ub, for each scenario
  unnest(res)
  

ci_df <- tibble(scenario = c("scenario_1", "scenario_2"),
                mu = c(1, exp(1/2)),
                lb = c(exp(-1.96/5), exp(.5) - 1.96*sqrt((exp(1) - 1)*exp(1))/5),
                ub = c(exp(1.96/5), exp(.5) + 1.96*sqrt((exp(1) - 1)*exp(1))/5))

ggplot() + 
  geom_rect(aes(xmin = lb, xmax = ub, ymin = -Inf, ymax = Inf), 
            data = ci_df,
            fill = "grey", alpha = .5, color = NA) + 
  geom_vline(aes(xintercept = mu), data = ci_df) + 
  geom_segment(aes(x = lb, xend = ub, y = id, yend = id, color = in_interval),
               data = sim_long) + 
  scale_color_manual(values = c("red", "black")) + 
  theme_bw() + 
  facet_wrap(~scenario)

From this, we can see that working with the log-transformed, normally distributed results has better coverage probability than working with the raw data and computing the population mean: the estimates in the latter procedure have lower coverage probability, and many of the intervals are much wider than necessary; in some cases, the interval actually lies outside of the domain.

Here is a similar example worked through in SAS with IML. Note the use of BY-group processing to analyze each group at once - this is very similar to the use of purrr::map() in the R code.

Example: Multilevel Regression and Post Stratification simulation

Multilevel regression and post-stratification simulation with toddler bedtimes [3]

This example talks about how to take a biased sample and then recover the original unbiased estimates – which is something you have to test using simulation to be sure it works, because you never actually know what the true population features are when you are working with real world data. When reading this example, you may not be all that interested with the specific model - but focus on the process of simulating data for your analysis so that you understand how and why you would want to simulate data in order to test a computational method.

Example: Regression and high-leverage points

What happens if we have one high-leverage point (e.g. a point which is an outlier in both x and y)? How pathological do our regression coefficient estimates get?

The challenging part here is to design a data generating mechanism.

gen_data <- function(o = 1, n = 30, error_sd = 2) {
  # generate the main part of the regression data
  data <- tibble(x = rnorm(n = n - o, 
                           mean = seq(-10, 10, length.out = n - o), 
                           sd = .1),
                 y = x + rnorm(length(x), 
                               mean = 0, 
                               sd = error_sd))
  # generate the outlier - make it at ~(-10, 5)
  outdata <- tibble(x = rnorm(o, -10), y = rnorm(o, 5, error_sd))
  bind_rows(data, outdata)
}

sim_data <- crossing(id = 1:100, outliers = 0:2) %>%
  mutate(
  # call gen_data for each row in sim_data, 
  # but don't use id as a parameter.
  data = purrr::map(outliers, gen_data) 
)

head(sim_data)
## # A tibble: 6 × 3
##      id outliers data             
##   <int>    <int> <list>           
## 1     1        0 <tibble [30 × 2]>
## 2     1        1 <tibble [30 × 2]>
## 3     1        2 <tibble [30 × 2]>
## 4     2        0 <tibble [30 × 2]>
## 5     2        1 <tibble [30 × 2]>
## 6     2        2 <tibble [30 × 2]>

# plot a few datasets just to check they look like we expect:
sim_data %>%
  filter(id %% 100 < 3) %>%
  unnest(data) %>%
  ggplot(aes(x = x, y = y)) + 
  geom_point() + 
  facet_grid(id ~ outliers, labeller = label_both)

library(broom) # the broom package cleans up model objects to tidy form

sim_data <- sim_data %>%
  # fit linear regression
  mutate(model = purrr::map(data, ~lm(y ~ x, data = .)))  %>%
  mutate(tidy_model = purrr::map(model, tidy))

# Get the coefficients out
tidy_coefs <- select(sim_data, id, outliers, tidy_model) %>%
  unnest(tidy_model) %>%
  mutate(group = case_when(outliers == 0 ~ "No HLPs",
                           outliers == 1 ~ "1 HLP",
                           outliers == 2 ~ "2 HLPs") %>%
           factor(levels = c("No HLPs", "1 HLP", "2 HLPs")))

ggplot(tidy_coefs, aes(x = estimate, color = group)) + 
  facet_grid(term ~ .) + 
  geom_density()

Obviously, you should experiment with different methods of generating a high-leverage point (maybe use a different distribution?) but this generating mechanism is simple enough for our purposes and shows that the addition of high leverage points biases the true values (slope = 1, intercept = 0).

Try it out

Let’s explore what happens to estimates when certain observations are censored.

Suppose we have a poorly-designed digital thermometer which cannot detect temperatures above 102$^\circ F$; for these temperatures, the thermometer will record a value of 102.0.

It is estimated that normal body temperature for dogs and cats is 101 to 102.5 degrees Fahrenheit, and values above 104 degrees F are indicative of illness. Given that you have this poorly calibrated thermometer, design a simulation which estimates the average temperature your thermometer would record for a sample of 100 dogs or cats, and determine the magnitude of the effect of the thermometer’s censoring.

If most pets have a normal body temperature between 101 and 102.5 degrees, can you use these bounds to determine appropriate parameters for a normal distribution? What if you assume that 101 and 102.5 are the 2SD bounds?

If 101 and 102.5 are the anchor points we have, let’s assume that 95% of normal pet temperatures fall in that range. So our average temperature would be 101.75, and our standard deviation would be .75/2 = 0.375.

We can simulate 1000 observations from $N(101.75, 0.375)$, create a new variable which truncates them at 102, and compute the mean of both variables to determine just how biased our results are.

set.seed(204209527)
dogtemp <- tibble(
  actual = rnorm(1000, 101.75, 0.375),
  read = pmin(actual, 102)
) 
dogtemp %>%
  summarize_all(mean) %>%
  diff()
## Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)]: non-numeric argument to binary operator

The effect of the thermometer’s censoring is around 0.06 degrees F for animals that are not ill.

import numpy as np
import pandas as pd
import random

random.seed(204209527)
dogtemp = pd.DataFrame({
  "actual": np.random.normal(size = 1000, loc = 101.75, scale = 0.375)
})
dogtemp['read'] = np.minimum(dogtemp.actual, 102)

np.diff(dogtemp.mean())
## array([-0.04985633])

The effect of the thermometer’s censoring is around 0.06 degrees F for animals that are not ill.

29.5 Monte Carlo methods

Monte carlo methods [4] are methods which rely on repeated random sampling in order to solve numerical problems. Often, the types of problems approached with MC methods are extremely difficult or impossible to solve analytically.

In general, a MC problem involves these steps:

Define the input domain
Generate inputs randomly from an appropriate probability distribution
Perform a computation using those inputs
Aggregate the results.

Example: Sum of Uniform Random Variables

Let’s try it out by using MC simulation to estimate the number of uniform (0,1) random variables needed for the sum to exceed 1.

More precisely, if $u_i \sim U(0,1)$, where _{i=1}^k u_i > 1, what is the expected value of $k$?

In this simulation, our input domain is [0,1].
Our input is $u_i \sim U(0,1)$
We generate new $u_i$ until $\sum_{i=1}^k > 1$ and save the value of $k$
We average the result of $N$ such simulations.

# It's easier to think through the code if we write it inefficiently first
sim_fcn <- function() {
  usum <- 0
  k <- 0
  # prevent infinite loops by monitoring the value of k as well
  while (usum < 1 & k < 15) {
    usum <- runif(1) + usum
    k <- k + 1
  }
  return(k)
}

set.seed(302497852)
res <- tibble(k = replicate(1000, sim_fcn(), simplify = T))

mean(res$k)
## [1] 2.717

If we want to see whether the result converges to something, we can increase the number of trials we run:

set.seed(20417023)

sim_res <- tibble(samp = replicate(250000, sim_fcn(), simplify = T)) 

sim_res <- sim_res %>%
  mutate(running_avg_est = cummean(samp),
         N = row_number())

ggplot(aes(x = N, y = running_avg_est), data = sim_res) + 
  geom_hline(yintercept = exp(1), color = "red") + 
  geom_line()

import numpy as np
import random
import pandas as pd


def sim_fcn():
  usum = 0
  k = 0
  # prevent infinite loops by monitoring the value of k as well
  while usum < 1 and k < 15:
    # print("k = ", k)
    usum = np.random.uniform(size=1) + usum
    k += 1
  return k

random.seed(302497852)
res = pd.DataFrame({"k": [sim_fcn() for _ in range(1000)]})

If we want to see whether the result converges to something, we can increase the number of trials we run:

random.seed(20417023)

sim_res = pd.DataFrame({"k": [sim_fcn() for _ in range(250000)]})
sim_res['running_avg_est'] = sim_res.k.expanding().mean()
sim_res['N'] = np.arange(len(sim_res))

import seaborn as sns
import matplotlib.pyplot as plt
plt.clf()

graph = sns.lineplot(data = sim_res, x = 'N', y = 'running_avg_est', color = "black")
graph.axhline(y = np.exp(1), xmin = 0, xmax = 1, color = "red")
plt.show()

The expected number of uniform RV draws required to sum to 1 is $e$!

Explanation of why this works

Monte Carlo methods are often used to approximate the value of integrals which do not have a closed-form (in particular, these integrals tend to pop up frequently in Bayesian methods).

Example: Integration

Suppose you want to integrate \[\int_0^1 x^2 \sin \left(\frac{1}{x}\right) dx\]

A function that oscillates around 0 very rapidly from 0 to 0.12 and then reaches a minimum at approximately x = 0.25 before increasing at an approximately linear rate through x = 1.0. — f(x) over the interval [0,1].

You could set up Riemann integration and evaluate the integral using a sum over $K$ points, but that approach only converges for smooth functions (and besides, that’s boring Calc 2 stuff, right?).

Instead, let’s observe that this is equivalent to $\int_0^1 x^2 \sin \left(\frac{1}{x}\right) \cdot 1 dx$, where $p(x) = 1$ for a uniform random variable. That is, this integral can be written as the expected value of the function over the interval $[0,1]$. What if we just generate a bunch of uniform(0,1) variables, evaluate the value of the function at those points, and average the result?

You can use the law of large numbers to prove that this approach will converge. [5]

set.seed(20491720)
fn <- function(x)  x^2 * sin(1/x)

sim_data <- tibble(x = runif(100000),
                   y = fn(x))
mean(sim_data$y)
## [1] 0.28607461

random.seed(20491720)

def fn(x):
  return x**2 * np.sin(1/x)

sim_data = pd.DataFrame({"x": np.random.uniform(size = 100000)})
sim_data['y'] = fn(sim_data.x)

sim_data.y.mean()
## 0.2872329317146997

fn <- function(x)  x^2 * sin(1/x)

riemann <- tibble(x = seq(0, 1, length.out = 10000)[-1],
                  y = fn(x))
mean(riemann$y)
## [1] 0.28657161

Example: Integration in 2d

Let’s say that you want to find an estimate for $\pi$, and you know that a circle with radius 1 has an area of exactly that. You also know, that all of the points on this circle can be written as $x^2 + y^2 \le 1$.

A circle centered in (0,0) with radius 1. — The unit circle.

Evaluating the area of the circle mathematically, would need us to either change to polar-coordinates or separate the graph into suitable functions (half-circles), and evaluate the integral between the top and the bottom: \[ \int_{-1}^1 2 \sqrt{1-x^2} dx \] Instead, we note that the circle is encapsulated in a square with side length 2. We can reach all points in that square by using two independent uniform random random variables over the interval $[-1,1]$, i.e. when we generate two random values from U[-1,1], and use one as the $x$ coordinate and one as the $y$ coordinate, we get a point in the square. If the sum of the squares of the coordinates are less than 1, the point will also fall inside the circle. If not, the point falls in one of the four corners of the square that are outside the circle.

A circle centered in (0,0) with radius 1 overlaid with randomly generated points. The points inside the circle are drawn in a different color from the ones outside the circle. — The unit circle is encapsulated by a square and overlaid with uniform points from U[-1,1] x U[-1,1].

How do we get to an estimate of $\pi$ from there? We know that the area of the square is simply $2^2 = 4$. The area of the circle is then directly proportional to the rate at which points fall into the circle, ie.

\[ \hat{\pi} = 4 \times \frac{\text{Number of points with } x^2+y^2 \le 1}{\text{Number of points generated}}. \] The more points we generate, the closer our estimate will be to the real value.

This problem is an example for Monte-Carlo Integration using an Acceptance-Rejection approach: we can slightly re-write the simulation and think of the generation of a new point in the circle as a two step process, where we first generate a value for $x$ from U[-1,1], and in second step generate a candidate $c$ for $y$ from U[-1, 1], which we will only accept as $y$, if $|c| \le \sqrt{1-x^2}$. Acceptance-Rejection sampling is the basis of a lot of Markov-Chain Monte-Carlo (MCMC) methods, such as e.g. the Metropolis-Hastings algorithm.

set.seed(20491720)

calculate_pi <- function(R) {
  x = runif(R, min=-1, max=1)
  y = runif(R, min=-1, max=1)
  in_circle = x^2+y^2<1
  
  4 * sum(in_circle) / R
}

# Quite a bit of variability with just 100 values
calculate_pi(100)
## [1] 3.16
calculate_pi(100)
## [1] 3.2
calculate_pi(100)
## [1] 2.96

# Better with 10,000
calculate_pi(10000)
## [1] 3.126
calculate_pi(10000)
## [1] 3.1392

# Better, but still only good for about 2-3 digits
calculate_pi(1000000) 
## [1] 3.14344

pi
## [1] 3.1415927

random.seed(20491720)

def calculate_pi(R):
  x = np.random.uniform(size = R)
  y = np.random.uniform(size = R)
  in_circle = x**2+y**2<1
  
  return 4 * sum(in_circle) / R


# Quite a bit of variability with just 100 values
calculate_pi(100)
## 3.24
calculate_pi(100)
## 3.04
calculate_pi(100)
## 3.04

# Better with 10,000
calculate_pi(10000)
## 3.1272
calculate_pi(10000)
## 3.1728

# Better, but still only good for about 2-3 digits
calculate_pi(1000000) 
## 3.1429

np.pi
## 3.141592653589793

set.seed(20491720)
fn <- function(x)  2*sqrt(1-x^2)

integrate(fn, lower=-1, upper=1)
## 3.1415927 with absolute error < 2e-09

pi
## [1] 3.1415927

Try it out

Buffon’s needle is a mathematical problem which can be boiled down to a simple physical simulation. Read this science friday description of the problem and develop a monte carlo simulation method which estimates $\pi$ using the Buffon’s needle method. Your method should be a function which

allows the user to specify how many sticks are dropped
plots the result of the physical simulation
prints out a numerical estimate of pi.

Let’s start out with horizontal lines at 0 and 1, and set our stick length to 1. We need to randomly generate a position (of one end of the stick) and an angle. The position in $x$ doesn’t actually make much of a difference (since what we care about is the $y$ coordinates), but we can draw a picture if we generate $x$ as well.

needle_sim <- function(sticks = 100) {
  df <- tibble(xstart = runif(sticks, 0, 10), 
         ystart = runif(sticks, 0, 1), 
         angle = runif(sticks, 0, 360),
         xend = xstart + cos(angle/180*pi), 
         yend = ystart + sin(angle/180*pi)
  ) %>%
    # We can see if a stick crosses a line if the floor() function of ystart is 
    # different than floor(yend). 
    # Note this only works for integer line values.
  mutate(crosses_line = floor(ystart) != floor(yend)) 
  
  
  gg <- ggplot() + 
  geom_hline(yintercept = c(0, 1)) + 
  geom_segment(aes(x = xstart, y = ystart, xend = xend, yend = yend,
                   color = crosses_line), data = df) + 
  coord_fixed()
  
  return(list(est = 2 * sticks / sum(df$crosses_line), plot = gg))
}

needle_sim(10)
## $est
## [1] 3.3333333
## 
## $plot


needle_sim(100)
## $est
## [1] 3.125
## 
## $plot


needle_sim(1000)
## $est
## [1] 3.1446541
## 
## $plot


needle_sim(10000)
## $est
## [1] 3.1730922
## 
## $plot

def needle_sim(sticks = 100):
  df = pd.DataFrame({
    "xstart": np.random.uniform(0, 10, size = sticks),
    "ystart": np.random.uniform(0, 1, size = sticks),
    "angle": np.random.uniform(0, 360, size = sticks)
  })
  
  df['xend'] = df.xstart + np.cos(df.angle/180*np.pi)
  df['yend'] = df.ystart + np.sin(df.angle/180*np.pi)
  df['crosses_line'] = np.floor(df.ystart) != np.floor(df.yend)
  
  return df

data = needle_sim(100000)
data['N'] = np.arange(len(data)) + 1
data['cum_est'] = 2*data.N / data.crosses_line.expanding().sum()


import seaborn as sns
import matplotlib.pyplot as plt
plt.clf()

graph = sns.lineplot(data = data, x = "N", y = "cum_est", color = "black")
graph.axhline(y = np.pi, xmin = 0, xmax = 1, color = "red")
plt.show()

29.6 Other Resources

Simulation (R programming for Data Science chapter)
Simulation - R Studio lesson
Simulation, focusing on statistical modeling (R)
Simulating Data with SAS (Excerpt)
Simulating a Drunkard’s Walk in 2D in SAS
Simulation from a triangle distribution (SAS)
Simulating the Monty Hall problem (SAS)

29.7 References

[1]

Wikipedia contributors, “Linear congruential generator.” Sep. 09, 2022 [Online]. Available: https://en.wikipedia.org/w/index.php?title=Linear_congruential_generator&oldid=1109285432

[2]

A. Menon, “Linear Regression in 6 lines of Python,” Medium. Oct. 2018 [Online]. Available: https://towardsdatascience.com/linear-regression-in-6-lines-of-python-5e1d0cd05b8d. [Accessed: Oct. 17, 2022]

[3]

R. Alexander, “Telling stories with data,” Jul. 27, 2023. [Online]. Available: https://tellingstorieswithdata.com/. [Accessed: Aug. 01, 2023]

[4]

Wikipedia contributors, “Monte Carlo method,” Wikipedia. Oct. 2022 [Online]. Available: https://en.wikipedia.org/w/index.php?title=Monte_Carlo_method&oldid=1116159451. [Accessed: Oct. 17, 2022]

[5]

Y.-C. Chen, “Lecture 2: Monte Carlo Simulation,” Monte Carlo Simulation. 2017 [Online]. Available: http://faculty.washington.edu/yenchic/17Sp_403/Lec2_MonteCarlo.pdf