Statistical Computing using R and Python

Author

Susan Vanderplas

Published

October 26, 2024

Preface

Cover image for Statistical Computing using R and Python. Shows little fuzzball 'monsters' completing data-related tasks such as rearranging data frames, brewing complete documents using markdown, and assembling data analyses by arranging, wrangling, visualizing, and modeling data. Images assembled from a collection of drawings by Allison Horst; used with permission.

Content Overload!

This book is designed to demonstrate introductory statistical programming concepts and techniques. It is intended as a substitute for hours and hours of video lectures - watching someone code and talk about code is not usually the best way to learn how to code. It’s far better to learn how to code by … coding.

I hope that you will work through this book week by week over the semester. I have included comics, snark, gifs, YouTube videos, extra resources, and more: my goal is to make this a collection of the best information I can find on statistical programming.

In most cases, this book includes way more information than you need. Everyone comes into this class with a different level of computing experience, so I’ve attempted to make this book comprehensive. Unfortunately, that means some people will be bored and some will be overwhelmed. Use this book in the way that works best for you - skip over the stuff you know already, ignore the stuff that seems too complex until you understand the basics. Come back to the scary stuff later and see if it makes more sense to you.

Book Format Guide

I’ve made an effort to use some specific formatting and enable certain features that make this book a useful tool for this class.

Special Sections

Some instructions depend on your operating system. Where it’s shorter, I will use tabs to provide you with OS specific instructions. Here are the icons I will use:

Windows-specific instructions

Mac specific instructions

Linux specific instructions. I will usually try to make this generic, but if it’s gui based, my instructions will usually be for KDE.

Warnings

These sections contain things you may want to look out for: common errors, mistakes, and unfortunate situations that may arise when programming.

Demonstrations

These sections demonstrate how the code being discussed is used (in a simple way).

Examples

These sections contain illustrations of the concepts discussed in the chapter. Don’t skip them, even though they may be long!

Try it out

These sections contain activities you should do to reinforce the things you’ve just read. You will be much more successful if you read the material, review the example, and then try to write your own code. Most of the time, these sections will have a specific format:

The problem will be in the first tab for you to start with

A solution will be provided in R, potentially with an explanation.

A solution will be provided in Python as well.

In some cases, the problem will be more open-ended and may not adhere to this format, but most try it out sections in this book will have solutions provided. I highly recommend that you attempt to solve the problem yourself before you look at the solutions - this is the best way to learn. Passively reading code does not result in information retention.

Essential Reading

These sections may direct you to additional reading material that is essential for understanding the topic. For instance, I will sometimes link to other online textbooks rather than try to rehash the content myself when someone else has done it better.

Learn More

These sections will direct you to additional resources that may be helpful to consult as you learn about a topic. You do not have to use these sections unless you are 1) bored, or 2) hopelessly lost. They’re provided to help but are not expected reading (Unlike the essential reading sections in red).

Notes

These generic sections contain information I may want to call attention to, but that isn’t necessarily urgent or a common error trap.

Advanced

These sections are intended to apply to more advanced courses. If you are taking an introductory course, feel free to skip that content for now.

Expandable Sections

These are expandable sections, with additional information when you click on the line

This additional information may be information that is helpful but not essential, or it may be that an example just takes a LOT of space and I want to make sure you can skim the book without having to scroll through a ton of output.

Answers or punchlines may be hidden in this type of expandable section as well.

Analytics

I have enabled Google Analytics on this site for the purposes of measuring this work’s impact and use both in my own classes and elsewhere. I’m not using the individual tracking/ad-targeting settings (to the best of my knowledge) - my only purpose in using Google Analytics is to assess how often this site is used, and where its’ users are located at a rough (state/regional) level.

If you are using this site and aren’t affiliated with the University of Nebraska Lincoln, or have found it useful, please let me know by making a comment in Giscus (below) or sending me an email! These affirmations help me make a case that spending time on this resource is actually a good investment.

Acknowledgements

The cover of this book is an amalgam of different images by the lovely @allison_horst, which are released under the cc-by 4.0 license. I have modified them to remove most of the R package references and arrange them to represent the topics covered in this book.

Laptop icon used in the tab/logo created by Good Ware - Flaticon

Throughout this book, I have borrowed liberally from other online tutorials, published books, and blog posts. I have tried to ensure that I link to the source material throughout the book and provide appropriate credit to anyone whose examples I have used, modified, or repurposed. Special thanks to the tutorials provided by Posit/RStudio and the tidyverse project.

I don’t have official editors, but thank you to those who make use of the giscus comment box to let me know about issues and typos. So far, you’ve helped me fix at least 3 issues so far!

This book was built with the following parameters/settings/library versions:

import os
import sys

itemlist = ["PWD", "SHELL", "USER", "PYTHONIOENCODING", "VIRTUAL_ENV", "RETICULATE_PYTHON", "R_HOME", "R_PLATFORM", "LD_LIBRARY_PATH", "R_LIBS_USER", "R_LIBS_SITE","RENV_PROJECT", "RSTUDIO_PANDOC", "RMARKDOWN_MATHJAX_PATH", "R_SESSION_INITIALIZED", "PYTHONPATH"]

itemlist = list(set(itemlist) & set(os.environ))
itemlist.sort()

for item in itemlist:
    print(f'{item}{" : "}{os.environ[item]}')
## PWD : /Users/hofmann/Documents/Teaching/Stat 850/stat-computing-r-python
## PYTHONIOENCODING : utf-8
## PYTHONPATH : /Users/hofmann/.pyenv/versions/3.11.2/lib/python311.zip:/Users/hofmann/.pyenv/versions/3.11.2/lib/python3.11:/Users/hofmann/.pyenv/versions/3.11.2/lib/python3.11/lib-dynload:/Users/hofmann/.virtualenvs/book/lib/python3.11/site-packages:/Users/hofmann/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/reticulate/1.39.0/e1a5d04397edc1580c5e0ed1dbdccf76/reticulate/python
## RENV_PROJECT : /Users/hofmann/Documents/Teaching/Stat 850/stat-computing-r-python
## RMARKDOWN_MATHJAX_PATH : /Applications/RStudio.app/Contents/Resources/app/resources/mathjax-27
## RSTUDIO_PANDOC : /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/x86_64
## R_HOME : /Library/Frameworks/R.framework/Resources
## R_LIBS_SITE : /Library/Frameworks/R.framework/Resources/site-library
## R_LIBS_USER : /Users/hofmann/Documents/Teaching/Stat 850/stat-computing-r-python/renv/library/macos/R-4.4/x86_64-apple-darwin20
## R_PLATFORM : x86_64-apple-darwin20
## R_SESSION_INITIALIZED : PID=53381:NAME="reticulate"
## SHELL : /bin/bash
## USER : hofmann
## VIRTUAL_ENV : /Users/hofmann/.virtualenvs/book

print(sys.path)
## ['', '/Users/hofmann/.pyenv/versions/3.11.2/bin', '/Users/hofmann/.pyenv/versions/3.11.2/lib/python311.zip', '/Users/hofmann/.pyenv/versions/3.11.2/lib/python3.11', '/Users/hofmann/.pyenv/versions/3.11.2/lib/python3.11/lib-dynload', '/Users/hofmann/.virtualenvs/book/lib/python3.11/site-packages', '/Users/hofmann/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/reticulate/1.39.0/e1a5d04397edc1580c5e0ed1dbdccf76/reticulate/python', '/Users/hofmann/.virtualenvs/book/lib/python311.zip', '/Users/hofmann/.virtualenvs/book/lib/python3.11', '/Users/hofmann/.virtualenvs/book/lib/python3.11/lib-dynload']
library(devtools)
devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.4.1 (2024-06-14)
##  os       macOS Sonoma 14.4.1
##  system   x86_64, darwin20
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/Chicago
##  date     2024-09-09
##  pandoc   3.3 @ /usr/local/bin/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  ! package     * version date (UTC) lib source
##  P cachem        1.1.0   2024-05-16 [?] CRAN (R 4.4.0)
##    cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.0)
##  P devtools    * 2.4.5   2022-10-11 [?] CRAN (R 4.4.0)
##  P digest        0.6.37  2024-08-19 [?] CRAN (R 4.4.1)
##  P ellipsis      0.3.2   2021-04-29 [?] CRAN (R 4.4.0)
##  P evaluate      0.24.0  2024-06-10 [?] CRAN (R 4.4.0)
##  P fastmap       1.2.0   2024-05-15 [?] CRAN (R 4.4.0)
##  P fontawesome * 0.5.2   2023-08-19 [?] CRAN (R 4.4.0)
##  P fs            1.6.4   2024-04-25 [?] CRAN (R 4.4.0)
##    glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
##  P htmltools     0.5.8.1 2024-04-04 [?] CRAN (R 4.4.0)
##  P htmlwidgets   1.6.4   2023-12-06 [?] CRAN (R 4.4.0)
##  P httpuv        1.6.15  2024-03-26 [?] CRAN (R 4.4.0)
##  P jsonlite      1.8.8   2023-12-04 [?] CRAN (R 4.4.0)
##  P knitr         1.48    2024-07-07 [?] CRAN (R 4.4.0)
##  P later         1.3.2   2023-12-06 [?] CRAN (R 4.4.0)
##  P lattice       0.22-6  2024-03-20 [?] CRAN (R 4.4.1)
##    lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
##    magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
##  P Matrix        1.7-0   2024-04-26 [?] CRAN (R 4.4.1)
##  P memoise       2.0.1   2021-11-26 [?] CRAN (R 4.4.0)
##  P mime          0.12    2021-09-28 [?] CRAN (R 4.4.0)
##  P miniUI        0.1.1.1 2018-05-18 [?] CRAN (R 4.4.0)
##  P pkgbuild      1.4.4   2024-03-17 [?] CRAN (R 4.4.0)
##  P pkgload       1.4.0   2024-06-28 [?] CRAN (R 4.4.0)
##  P png           0.1-8   2022-11-29 [?] CRAN (R 4.4.0)
##  P profvis       0.3.8   2023-05-02 [?] CRAN (R 4.4.0)
##  P promises      1.3.0   2024-04-05 [?] CRAN (R 4.4.0)
##  P purrr         1.0.2   2023-08-10 [?] CRAN (R 4.4.0)
##  P R6            2.5.1   2021-08-19 [?] CRAN (R 4.4.0)
##  P Rcpp          1.0.13  2024-07-17 [?] CRAN (R 4.4.0)
##  P remotes       2.5.0   2024-03-17 [?] CRAN (R 4.4.0)
##    renv          1.0.7   2024-04-11 [1] CRAN (R 4.4.0)
##  P reticulate    1.39.0  2024-09-05 [?] CRAN (R 4.4.1)
##    rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.0)
##  P rmarkdown     2.28    2024-08-17 [?] CRAN (R 4.4.1)
##  P rstudioapi    0.16.0  2024-03-24 [?] CRAN (R 4.4.0)
##  P sessioninfo   1.2.2   2021-12-06 [?] CRAN (R 4.4.0)
##  P shiny         1.9.1   2024-08-01 [?] CRAN (R 4.4.0)
##  P stringi       1.8.4   2024-05-06 [?] CRAN (R 4.4.0)
##  P stringr       1.5.1   2023-11-14 [?] CRAN (R 4.4.0)
##  P urlchecker    1.0.1   2021-11-30 [?] CRAN (R 4.4.0)
##  P usethis     * 3.0.0   2024-07-29 [?] CRAN (R 4.4.0)
##    vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
##  P xfun          0.47    2024-08-17 [?] CRAN (R 4.4.1)
##  P xtable        1.8-4   2019-04-21 [?] CRAN (R 4.4.0)
##  P yaml          2.3.10  2024-07-26 [?] CRAN (R 4.4.0)
## 
##  [1] /Users/hofmann/Documents/Teaching/Stat 850/stat-computing-r-python/renv/library/macos/R-4.4/x86_64-apple-darwin20
##  [2] /Users/hofmann/Library/Caches/org.R-project.R/R/renv/sandbox/macos/R-4.4/x86_64-apple-darwin20/2edc1867
## 
##  P ── Loaded and on-disk path mismatch.
## 
## ─ Python configuration ───────────────────────────────────────────────────────
##  python:         /Users/hofmann/.virtualenvs/book/bin/python
##  libpython:      /Users/hofmann/.pyenv/versions/3.11.2/lib/libpython3.11.dylib
##  pythonhome:     /Users/hofmann/.virtualenvs/book:/Users/hofmann/.virtualenvs/book
##  version:        3.11.2 (main, Nov 20 2023, 11:27:18) [Clang 13.0.0 (clang-1300.0.29.30)]
##  numpy:          /Users/hofmann/.virtualenvs/book/lib/python3.11/site-packages/numpy
##  numpy_version:  2.1.1
##  
##  NOTE: Python version was forced by VIRTUAL_ENV
## 
## ──────────────────────────────────────────────────────────────────────────────