29 Simulation and Reproducibility
29.1 Objectives
- Program a simulation for a specific task, process, or model
- Understand the limitaitons of pseudorandom number generation
Simulation is an extremely important part of computational statistics. Bayesian statistics, in particular, relies on Markov Chain Monte Carlo (MCMC) to get results from even the most basic of models. In this module, we’re going to touch on a few foundational pieces of simulation in computing, and you will get more exposure to simulation-based methods in other courses down the line.
29.2 Pseudorandom Number Generation
Computers are almost entirely deterministic, which makes it very difficult to come up with “random” numbers. In addition to the deterministic nature of computing, it’s also somewhat important to be able to run the same code and get the same results every time, which isn’t possible if you rely on truly random numbers.
Historically, pseudorandom numbers were generated using linear congruential generators (LCGs) [1]. These algorithms aren’t typically used anymore, but they provide a good demonstration of how one might go about generating numbers that seem “random” but are actually deterministic. LCGs use modular arithmetic: \[X_{n+1} = (aX_n + c) \mod m\] where \(X_0\) is the start value (the seed), \(a\) is the multiplier, \(c\) is the increment, and \(m\) is the modulus. When using a LCG, the user generally specifies only the seed.
The important thing to note here is that if you specify the same generator values (\(a\), \(c\), \(m\), and \(X_0\)), you will always get the same series of numbers. Since \(a\), \(c\), \(m\) are usually specified by the implementation, as a user, you should expect that if you specify the same seed, you will get the same results, every time.
It is critically important to set your seed if you want the results to be reproducible and you are using an algorithm that depends on randomness.
Once you set your seed, the remaining results will only be reproducible if you generate the same set of random numbers every time.
I once helped a friend fit a model for their masters thesis using Simulated Annealing (which relies on random seeds). We got brilliant results, but couldn’t ever reproduce them, because I hadn’t set the seed first and we never could figure out what the original seed was. 😭
import random
import numpy as np
## ModuleNotFoundError: No module named 'numpy'
# Create a random generator with a specific seed
= np.random.default_rng(342512)
rng ## NameError: name 'np' is not defined
# Generate 10 integers
= 1, high = 100, size = 10)
rng.integers(low ## NameError: name 'rng' is not defined
# Generate 500 std normal draws and take the mean
500))
np.mean(rng.standard_normal(## NameError: name 'np' is not defined
# Get 10 more numbers
= 1, high = 100, size = 10)
rng.integers(low ## NameError: name 'rng' is not defined
Compare the results above to these results:
import random
import numpy as np
## ModuleNotFoundError: No module named 'numpy'
# Create a random generator with a specific seed
= np.random.default_rng(342512)
rng ## NameError: name 'np' is not defined
# Generate 10 integers
= 1, high = 100, size = 10)
rng.integers(low ## NameError: name 'rng' is not defined
# Generate 30 std normal draws and take the mean
30))
np.mean(rng.standard_normal(## NameError: name 'np' is not defined
# Get 10 more numbers
= 1, high = 100, size = 10)
rng.integers(low ## NameError: name 'rng' is not defined
Notice how the results have changed?
To make my documents more reproducible, I will sometimes set a new seed at the start of an important chunk, even if I’ve already set the seed earlier. This introduces certain “fixed points” where results won’t change immediately after I’ve re-set the seed. This is particularly important when I’m generating bootstrap estimates, fitting models, or simulating data for graphics experiments.
Pick your seed in any way you want. I tend to just randomly wiggle my fingers over the number keys, but I have also heard of people using the date in yyyymmdd format, favorite people’s birthdays, the current time in hhmmss format… basically, you can use anything.
29.3 Built-in simulations from distributions
Often, we can get away with just simulating data from a known distribution. As both R and python are meant for statistical computing, this is extremely easy by design.
You can see the various distribution options using ?Distributions
. In general, dxxx
is the PDF/PMF, pxxx
is the CDF, qxxx
is the quantile function, and rxxx
gives you random nubmers generated from the distribution. (xxx
, obviously, is whatever distribution you’re looking to use.)
library(tibble)
library(dplyr)
library(tidyr)
library(ggplot2)
set.seed(109025879)
tibble(
norm = rnorm(500),
gamma = rgamma(500, shape = 3, scale = 1),
exp = rexp(500, rate = 1), # R uses a exp(-ax)
t = rt(500, df = 5),
chisq = rchisq(500, 5)
) %>%
pivot_longer(1:5, names_to = "dist", values_to = "value") %>%
ggplot(aes(x = value)) + geom_density() + facet_wrap(~dist, scales = "free", nrow = 1)
import random
109025879)
random.seed(
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
import numpy as np
## ModuleNotFoundError: No module named 'numpy'
= pd.DataFrame({
wide_df "norm": np.random.normal(size=500),
"gamma": np.random.gamma(size=500, shape = 3, scale = 1),
"exp": np.random.exponential(size = 500, scale = 1),
"t": np.random.standard_t(df = 5, size = 500),
"chisq": np.random.chisquare(df = 5, size = 500)
})## NameError: name 'pd' is not defined
= pd.melt(wide_df, id_vars = None, var_name = "dist", value_name = "value")
long_df ## NameError: name 'pd' is not defined
from plotnine import *
## ModuleNotFoundError: No module named 'plotnine'
= "value")) + geom_density() + facet_wrap("dist", scales="free", nrow = 1))
(ggplot(long_df, aes(x ## NameError: name 'ggplot' is not defined
Generate variables x and y, where x is a sequence from -10 to 10 and y is equal to \(x + \epsilon\), \(\epsilon \sim N(0, 1)\). Fit a linear regression to your simulated data (in R, lm
, in Python, sklearn.linear_model
’s LinearRegression
).
Hint: Sample code for regression using sklearn [2].
set.seed(20572983)
data <- tibble(x = seq(-10, 10, .1),
y = x + rnorm(length(x)))
regression <- lm(y ~ x, data = data)
summary(regression)
##
## Call:
## lm(formula = y ~ x, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.14575 -0.70986 0.03186 0.65429 2.40305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.01876 0.06869 -0.273 0.785
## x 0.99230 0.01184 83.823 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9738 on 199 degrees of freedom
## Multiple R-squared: 0.9725, Adjusted R-squared: 0.9723
## F-statistic: 7026 on 1 and 199 DF, p-value: < 2.2e-16
import random
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
import numpy as np
## ModuleNotFoundError: No module named 'numpy'
from sklearn.linear_model import LinearRegression
## ModuleNotFoundError: No module named 'sklearn'
20572983)
random.seed(
= pd.DataFrame({'x': np.arange(-10, 10, .1)})
data ## NameError: name 'pd' is not defined
'y'] = data.x + np.random.normal(size = data.x.size)
data[## NameError: name 'data' is not defined
# Fitting the regression and predictions
# scikit-learn requires that we reshape everything into
# nparrays before we pass them into the model.fit() function.
= LinearRegression().\
model -1, 1),\
fit(data.x.values.reshape(-1, 1))
data.y.values.reshape(## NameError: name 'LinearRegression' is not defined
'pred'] = model.predict(data.x.values.reshape(-1, 1))
data[## NameError: name 'model' is not defined
# Plotting the results
import matplotlib.pyplot as plt
## ModuleNotFoundError: No module named 'matplotlib'
plt.clf()## NameError: name 'plt' is not defined
plt.scatter(data.x, data.y)## NameError: name 'plt' is not defined
='red')
plt.plot(data.x, data.pred, color## NameError: name 'plt' is not defined
plt.show()## NameError: name 'plt' is not defined
29.4 Simulation to test model assumptions
One of the more powerful ways to use simulation in practice is to use it to test the assumptions of your model. Suppose, for instance, that your data are highly skewed, but you want to use a method that assumes normally distributed errors. How bad will your results be? Where can you trust the results, and where should you be cautious?
The purrr::map
notation specifies that we’re using the map
function from the purrr
package. When functions are named generically, and there may be more than one package with a function name, it is often more readable to specify the package name along with the function.
purrr::map
takes an argument and for each “group” calls the compute_interval function, storing the results in res
. So each row in res
is a 1x2 tibble with columns lb and ub.
This pattern is very useful in all sorts of applications. You can read more about purrr in Chapter 27.
Suppose, for instance, that we have a lognormal distribution (highly skewed) and we want to compute a 95% confidence interval for the mean of our 25 observations.
You want to assess the coverage probability of a confidence interval computed under two different modeling scenarios:
- Working with the log-transformed values, ln(x), and then transform the computed interval back
- Working with the raw values, x, compute an interval assuming the data are symmetric, essentially treating the lognormal distribution as if it were normal.
Scenario 1:
- the expected value of the standard normal deviates is 0
- the variance of the data is 1
- the SE(\(\overline x\)) is \(\sqrt\frac{1}{25} = \frac{1}{5}\)
Our theoretical interval should be \((\exp(-1.96/5), \exp(1.96/5)) = (0.6757, 1.4799)\).
Scenario 2
- The expected value of the lognormal distribution is \(\exp(1/2) = 1.6487213\)
- The variance of the data is \((\exp(1) - 1)(\exp(1)) = 4.6707743\)
- The SE(\(\overline x\)) is thus \(\sqrt{\frac{(e^1 - 1)e^1}{25}} = \frac{\sqrt{(e^1 - 1)e^1}}{5} = 0.4322\)
Our theoretical interval should be \((0.8015, 2.4959)\). This interval could, if the circumstances were slightly different, contain 0, which is implausible for lognormally distributed data.
Our expected values are different under scenario 1 and scenario 2:
- In scenario 1 we are computing an interval for \(\mu\)
- In scenario 2, we are computing an interval for the population mean, which is \(\exp(\mu + .5\sigma^2)\)
Both are valid quantities we might be interested in, but they do not mean the same thing.
set.seed(40295023)
sim <- tibble(
id = rep(1:100, each = 25), # generate 100 samples of 25 points each
ln_x = rnorm(25*100), # generate 25 normal deviates for each sample
x = exp(ln_x), # transform into lognormal deviates
) %>%
# this creates a 100-row data frame, with one row for each id.
# the columns x, ln_x are stored in the data list-column as a tibble.
nest(data = c(x, ln_x))
head(sim)
## # A tibble: 6 × 2
## id data
## <int> <list>
## 1 1 <tibble [25 × 2]>
## 2 2 <tibble [25 × 2]>
## 3 3 <tibble [25 × 2]>
## 4 4 <tibble [25 × 2]>
## 5 5 <tibble [25 × 2]>
## 6 6 <tibble [25 × 2]>
sim$data[[1]]
## # A tibble: 25 × 2
## x ln_x
## <dbl> <dbl>
## 1 0.310 -1.17
## 2 0.622 -0.475
## 3 0.303 -1.19
## 4 1.05 0.0525
## 5 0.529 -0.636
## 6 1.09 0.0891
## 7 1.97 0.676
## 8 8.94 2.19
## 9 0.598 -0.514
## 10 0.183 -1.70
## # ℹ 15 more rows
compute_interval <- function(x) {
s1 <- exp(mean(log(x)) + c(-1, 1) * qnorm(.975) * sd(log(x))/sqrt(length(x)))
s2 <- mean(x) + c(-1, 1) * qnorm(.975) * sd(x)/sqrt(length(x))
tibble(scenario = c("scenario_1", "scenario_2"),
mean = c(1, exp(1/2)),
lb = c(s1[1], s2[1]), ub = c(s1[2], s2[2]),
in_interval = (lb < mean) & (ub > mean))
}
sim_long <- sim %>%
# This line takes each data entry and computes an interval for x.
# .$x is code for take the argument you passed in to map and get the x column
mutate(res = purrr::map(data, ~compute_interval(.$x))) %>%
# this "frees" res and we end up with two columns: lb and ub, for each scenario
unnest(res)
ci_df <- tibble(scenario = c("scenario_1", "scenario_2"),
mu = c(1, exp(1/2)),
lb = c(exp(-1.96/5), exp(.5) - 1.96*sqrt((exp(1) - 1)*exp(1))/5),
ub = c(exp(1.96/5), exp(.5) + 1.96*sqrt((exp(1) - 1)*exp(1))/5))
ggplot() +
geom_rect(aes(xmin = lb, xmax = ub, ymin = -Inf, ymax = Inf),
data = ci_df,
fill = "grey", alpha = .5, color = NA) +
geom_vline(aes(xintercept = mu), data = ci_df) +
geom_segment(aes(x = lb, xend = ub, y = id, yend = id, color = in_interval),
data = sim_long) +
scale_color_manual(values = c("red", "black")) +
theme_bw() +
facet_wrap(~scenario)
From this, we can see that working with the log-transformed, normally distributed results has better coverage probability than working with the raw data and computing the population mean: the estimates in the latter procedure have lower coverage probability, and many of the intervals are much wider than necessary; in some cases, the interval actually lies outside of the domain.
Here is a similar example worked through in SAS with IML. Note the use of BY-group processing to analyze each group at once - this is very similar to the use of purrr::map()
in the R code.
Multilevel regression and post-stratification simulation with toddler bedtimes [3]
This example talks about how to take a biased sample and then recover the original unbiased estimates – which is something you have to test using simulation to be sure it works, because you never actually know what the true population features are when you are working with real world data. When reading this example, you may not be all that interested with the specific model - but focus on the process of simulating data for your analysis so that you understand how and why you would want to simulate data in order to test a computational method.
What happens if we have one high-leverage point (e.g. a point which is an outlier in both x and y)? How pathological do our regression coefficient estimates get?
The challenging part here is to design a data generating mechanism.
gen_data <- function(o = 1, n = 30, error_sd = 2) {
# generate the main part of the regression data
data <- tibble(x = rnorm(n = n - o,
mean = seq(-10, 10, length.out = n - o),
sd = .1),
y = x + rnorm(length(x),
mean = 0,
sd = error_sd))
# generate the outlier - make it at ~(-10, 5)
outdata <- tibble(x = rnorm(o, -10), y = rnorm(o, 5, error_sd))
bind_rows(data, outdata)
}
sim_data <- crossing(id = 1:100, outliers = 0:2) %>%
mutate(
# call gen_data for each row in sim_data,
# but don't use id as a parameter.
data = purrr::map(outliers, gen_data)
)
head(sim_data)
## # A tibble: 6 × 3
## id outliers data
## <int> <int> <list>
## 1 1 0 <tibble [30 × 2]>
## 2 1 1 <tibble [30 × 2]>
## 3 1 2 <tibble [30 × 2]>
## 4 2 0 <tibble [30 × 2]>
## 5 2 1 <tibble [30 × 2]>
## 6 2 2 <tibble [30 × 2]>
# plot a few datasets just to check they look like we expect:
sim_data %>%
filter(id %% 100 < 3) %>%
unnest(data) %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
facet_grid(id ~ outliers, labeller = label_both)
# Get the coefficients out
tidy_coefs <- select(sim_data, id, outliers, tidy_model) %>%
unnest(tidy_model) %>%
mutate(group = case_when(outliers == 0 ~ "No HLPs",
outliers == 1 ~ "1 HLP",
outliers == 2 ~ "2 HLPs") %>%
factor(levels = c("No HLPs", "1 HLP", "2 HLPs")))
ggplot(tidy_coefs, aes(x = estimate, color = group)) +
facet_grid(term ~ .) +
geom_density()
Obviously, you should experiment with different methods of generating a high-leverage point (maybe use a different distribution?) but this generating mechanism is simple enough for our purposes and shows that the addition of high leverage points biases the true values (slope = 1, intercept = 0).
Let’s explore what happens to estimates when certain observations are censored.
Suppose we have a poorly-designed digital thermometer which cannot detect temperatures above 102\(^\circ F\); for these temperatures, the thermometer will record a value of 102.0.
It is estimated that normal body temperature for dogs and cats is 101 to 102.5 degrees Fahrenheit, and values above 104 degrees F are indicative of illness. Given that you have this poorly calibrated thermometer, design a simulation which estimates the average temperature your thermometer would record for a sample of 100 dogs or cats, and determine the magnitude of the effect of the thermometer’s censoring.
If most pets have a normal body temperature between 101 and 102.5 degrees, can you use these bounds to determine appropriate parameters for a normal distribution? What if you assume that 101 and 102.5 are the 2SD bounds?
If 101 and 102.5 are the anchor points we have, let’s assume that 95% of normal pet temperatures fall in that range. So our average temperature would be 101.75, and our standard deviation would be .75/2 = 0.375.
We can simulate 1000 observations from \(N(101.75, 0.375)\), create a new variable which truncates them at 102, and compute the mean of both variables to determine just how biased our results are.
The effect of the thermometer’s censoring is around 0.06 degrees F for animals that are not ill.
import numpy as np
## ModuleNotFoundError: No module named 'numpy'
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
import random
204209527)
random.seed(= pd.DataFrame({
dogtemp "actual": np.random.normal(size = 1000, loc = 101.75, scale = 0.375)
})## NameError: name 'pd' is not defined
'read'] = np.minimum(dogtemp.actual, 102)
dogtemp[## NameError: name 'np' is not defined
np.diff(dogtemp.mean())## NameError: name 'np' is not defined
The effect of the thermometer’s censoring is around 0.06 degrees F for animals that are not ill.
29.5 Monte Carlo methods
Monte carlo methods [4] are methods which rely on repeated random sampling in order to solve numerical problems. Often, the types of problems approached with MC methods are extremely difficult or impossible to solve analytically.
In general, a MC problem involves these steps:
- Define the input domain
- Generate inputs randomly from an appropriate probability distribution
- Perform a computation using those inputs
- Aggregate the results.
Let’s try it out by using MC simulation to estimate the number of uniform (0,1) random variables needed for the sum to exceed 1.
More precisely, if \(u_i \sim U(0,1)\), where _{i=1}^k u_i > 1, what is the expected value of \(k\)?
- In this simulation, our input domain is [0,1].
- Our input is \(u_i \sim U(0,1)\)
- We generate new \(u_i\) until \(\sum_{i=1}^k > 1\) and save the value of \(k\)
- We average the result of \(N\) such simulations.
# It's easier to think through the code if we write it inefficiently first
sim_fcn <- function() {
usum <- 0
k <- 0
# prevent infinite loops by monitoring the value of k as well
while (usum < 1 & k < 15) {
usum <- runif(1) + usum
k <- k + 1
}
return(k)
}
set.seed(302497852)
res <- tibble(k = replicate(1000, sim_fcn(), simplify = T))
mean(res$k)
## [1] 2.717
If we want to see whether the result converges to something, we can increase the number of trials we run:
import numpy as np
## ModuleNotFoundError: No module named 'numpy'
import random
import pandas as pd
## ModuleNotFoundError: No module named 'pandas'
def sim_fcn():
= 0
usum = 0
k # prevent infinite loops by monitoring the value of k as well
while usum < 1 and k < 15:
# print("k = ", k)
= np.random.uniform(size=1) + usum
usum += 1
k return k
302497852)
random.seed(= pd.DataFrame({"k": [sim_fcn() for _ in range(1000)]})
res ## NameError: name 'pd' is not defined
If we want to see whether the result converges to something, we can increase the number of trials we run:
20417023)
random.seed(
= pd.DataFrame({"k": [sim_fcn() for _ in range(250000)]})
sim_res ## NameError: name 'pd' is not defined
'running_avg_est'] = sim_res.k.expanding().mean()
sim_res[## NameError: name 'sim_res' is not defined
'N'] = np.arange(len(sim_res))
sim_res[## NameError: name 'np' is not defined
import seaborn as sns
## ModuleNotFoundError: No module named 'seaborn'
import matplotlib.pyplot as plt
## ModuleNotFoundError: No module named 'matplotlib'
plt.clf()## NameError: name 'plt' is not defined
= sns.lineplot(data = sim_res, x = 'N', y = 'running_avg_est', color = "black")
graph ## NameError: name 'sns' is not defined
= np.exp(1), xmin = 0, xmax = 1, color = "red")
graph.axhline(y ## NameError: name 'graph' is not defined
plt.show()## NameError: name 'plt' is not defined
The expected number of uniform RV draws required to sum to 1 is \(e\)!
Monte Carlo methods are often used to approximate the value of integrals which do not have a closed-form (in particular, these integrals tend to pop up frequently in Bayesian methods).
Suppose you want to integrate \[\int_0^1 x^2 \sin \left(\frac{1}{x}\right) dx\]
You could set up Riemann integration and evaluate the integral using a sum over \(K\) points, but that approach only converges for smooth functions (and besides, that’s boring Calc 2 stuff, right?).
Instead, let’s observe that this is equivalent to \(\int_0^1 x^2 \sin \left(\frac{1}{x}\right) \cdot 1 dx\), where \(p(x) = 1\) for a uniform random variable. That is, this integral can be written as the expected value of the function over the interval \([0,1]\). What if we just generate a bunch of uniform(0,1) variables, evaluate the value of the function at those points, and average the result?
You can use the law of large numbers to prove that this approach will converge. [5]
20491720)
random.seed(
def fn(x):
return x**2 * np.sin(1/x)
= pd.DataFrame({"x": np.random.uniform(size = 100000)})
sim_data ## NameError: name 'pd' is not defined
'y'] = fn(sim_data.x)
sim_data[## NameError: name 'sim_data' is not defined
sim_data.y.mean()## NameError: name 'sim_data' is not defined
Let’s say that you want to find an estimate for \(\pi\), and you know that a circle with radius 1 has an area of exactly that. You also know, that all of the points on this circle can be written as \(x^2 + y^2 \le 1\).
Evaluating the area of the circle mathematically, would need us to either change to polar-coordinates or separate the graph into suitable functions (half-circles), and evaluate the integral between the top and the bottom: \[ \int_{-1}^1 2 \sqrt{1-x^2} dx \] Instead, we note that the circle is encapsulated in a square with side length 2. We can reach all points in that square by using two independent uniform random random variables over the interval \([-1,1]\), i.e. when we generate two random values from U[-1,1], and use one as the \(x\) coordinate and one as the \(y\) coordinate, we get a point in the square. If the sum of the squares of the coordinates are less than 1, the point will also fall inside the circle. If not, the point falls in one of the four corners of the square that are outside the circle.
How do we get to an estimate of \(\pi\) from there? We know that the area of the square is simply \(2^2 = 4\). The area of the circle is then directly proportional to the rate at which points fall into the circle, ie.
\[ \hat{\pi} = 4 \times \frac{\text{Number of points with } x^2+y^2 \le 1}{\text{Number of points generated}}. \] The more points we generate, the closer our estimate will be to the real value.
This problem is an example for Monte-Carlo Integration using an Acceptance-Rejection approach: we can slightly re-write the simulation and think of the generation of a new point in the circle as a two step process, where we first generate a value for \(x\) from U[-1,1], and in second step generate a candidate \(c\) for \(y\) from U[-1, 1], which we will only accept as \(y\), if \(|c| \le \sqrt{1-x^2}\). Acceptance-Rejection sampling is the basis of a lot of Markov-Chain Monte-Carlo (MCMC) methods, such as e.g. the Metropolis-Hastings algorithm.
set.seed(20491720)
calculate_pi <- function(R) {
x = runif(R, min=-1, max=1)
y = runif(R, min=-1, max=1)
in_circle = x^2+y^2<1
4 * sum(in_circle) / R
}
# Quite a bit of variability with just 100 values
calculate_pi(100)
## [1] 3.16
calculate_pi(100)
## [1] 3.2
calculate_pi(100)
## [1] 2.96
# Better with 10,000
calculate_pi(10000)
## [1] 3.126
calculate_pi(10000)
## [1] 3.1392
# Better, but still only good for about 2-3 digits
calculate_pi(1000000)
## [1] 3.14344
pi
## [1] 3.1415927
20491720)
random.seed(
def calculate_pi(R):
= np.random.uniform(size = R)
x = np.random.uniform(size = R)
y = x**2+y**2<1
in_circle
return 4 * sum(in_circle) / R
# Quite a bit of variability with just 100 values
100)
calculate_pi(## NameError: name 'np' is not defined
100)
calculate_pi(## NameError: name 'np' is not defined
100)
calculate_pi(## NameError: name 'np' is not defined
# Better with 10,000
10000)
calculate_pi(## NameError: name 'np' is not defined
10000)
calculate_pi(## NameError: name 'np' is not defined
# Better, but still only good for about 2-3 digits
1000000)
calculate_pi(## NameError: name 'np' is not defined
np.pi## NameError: name 'np' is not defined
Buffon’s needle is a mathematical problem which can be boiled down to a simple physical simulation. Read this science friday description of the problem and develop a monte carlo simulation method which estimates \(\pi\) using the Buffon’s needle method. Your method should be a function which
- allows the user to specify how many sticks are dropped
- plots the result of the physical simulation
- prints out a numerical estimate of pi.
Let’s start out with horizontal lines at 0 and 1, and set our stick length to 1. We need to randomly generate a position (of one end of the stick) and an angle. The position in \(x\) doesn’t actually make much of a difference (since what we care about is the \(y\) coordinates), but we can draw a picture if we generate \(x\) as well.
needle_sim <- function(sticks = 100) {
df <- tibble(xstart = runif(sticks, 0, 10),
ystart = runif(sticks, 0, 1),
angle = runif(sticks, 0, 360),
xend = xstart + cos(angle/180*pi),
yend = ystart + sin(angle/180*pi)
) %>%
# We can see if a stick crosses a line if the floor() function of ystart is
# different than floor(yend).
# Note this only works for integer line values.
mutate(crosses_line = floor(ystart) != floor(yend))
gg <- ggplot() +
geom_hline(yintercept = c(0, 1)) +
geom_segment(aes(x = xstart, y = ystart, xend = xend, yend = yend,
color = crosses_line), data = df) +
coord_fixed()
return(list(est = 2 * sticks / sum(df$crosses_line), plot = gg))
}
needle_sim(10)
## $est
## [1] 3.3333333
##
## $plot
needle_sim(100)
## $est
## [1] 3.125
##
## $plot
needle_sim(1000)
## $est
## [1] 3.1446541
##
## $plot
needle_sim(10000)
## $est
## [1] 3.1730922
##
## $plot
def needle_sim(sticks = 100):
= pd.DataFrame({
df "xstart": np.random.uniform(0, 10, size = sticks),
"ystart": np.random.uniform(0, 1, size = sticks),
"angle": np.random.uniform(0, 360, size = sticks)
})
'xend'] = df.xstart + np.cos(df.angle/180*np.pi)
df['yend'] = df.ystart + np.sin(df.angle/180*np.pi)
df['crosses_line'] = np.floor(df.ystart) != np.floor(df.yend)
df[
return df
= needle_sim(100000)
data ## NameError: name 'pd' is not defined
'N'] = np.arange(len(data)) + 1
data[## NameError: name 'np' is not defined
'cum_est'] = 2*data.N / data.crosses_line.expanding().sum()
data[## NameError: name 'data' is not defined
import seaborn as sns
## ModuleNotFoundError: No module named 'seaborn'
import matplotlib.pyplot as plt
## ModuleNotFoundError: No module named 'matplotlib'
plt.clf()## NameError: name 'plt' is not defined
= sns.lineplot(data = data, x = "N", y = "cum_est", color = "black")
graph ## NameError: name 'sns' is not defined
= np.pi, xmin = 0, xmax = 1, color = "red")
graph.axhline(y ## NameError: name 'graph' is not defined
plt.show()## NameError: name 'plt' is not defined
29.6 Other Resources
Simulation (R programming for Data Science chapter)
Simulation - R Studio lesson
Simulating Data with SAS (Excerpt)