2 + 2 + 3
[1] 7
2 + 2 # + 3
[1] 4
# This line is entirely commented out
The fundamental goal of this chapter is to learn how to talk to R and Python. In the very beginning, people told computers what to do using punch cards [1]. This required that you have every step of your program and data planned out in advance - you’d submit your punch cards to the computer, and then come back 24-72 hours later to find out you’d gotten two cards out of order. Dropping a tray of punch cards was … problematic.
Thankfully, we’re mostly free of the days where being a bit clumsy could erase a semester of hard work. As things grew more evolved and we got actual monitors and (eventually) graphical interfaces, we started using interactive terminals (interactive mode) to boss computers around.
Open RStudio and navigate to the Console tab. You can issue commands directly to R by typing something in at the >
prompt.
Try typing in 2+2
and hit enter.
Open RStudio and navigate to the Terminal tab. This is your computer’s ‘terminal’ - where you tell the computer what to do.
First, we have to tell it what language we’d like to work in - by default, it’s going to work in Batch (Windows), Zsh (Mac), or Bash (Linux). Luckily, we can avoid these and tell the computer we want to work in python by typing in python3
or python
(depending on how your computer is set up). This will launch an interactive python session (ipython).
You should get a prompt that looks like this: >>>
Type in 2+2
and hit enter.
Interactive mode is useful for quick, one-off analyses, but if you need to repeat an analysis (or remember what you did), interactive mode is just awful. Once you close the program, the commands (and results) are gone. This is particularly inconvenient when you need to run the same task multiple times. For example, each day I may want to pull the weather forecast and observed weather values from the national weather service using the same commands. I don’t want to manually re-type them each day!
To somewhat address this issue, most computing languages allow you to provide a sequence of commands in a text file known as a script. Scripts are typically meant to run on their own - they may perform computations, format data and save it, scrape data from the web… the possibilities are endless, but they are typically meant to run without the person running the script having to read all of the commands.
Download scripts.zip and unzip the file.
Open a system terminal in the directory where you unzipped the files.
Follow the directions below exactly to ensure that you have the terminal open in the correct location.
Open the folder. Type cmd into the location bar at the top of the window and hit enter. The command prompt will open in the desired location.
Open a finder window and navigate to the folder you want to use. If you don’t have a path bar at the bottom of the finder window, choose View > Show Path Bar. Control-click the folder in the path bar and choose Open in Terminal.
Open the folder in your file browser. Select the path to the folder in the path bar and copy it to the clipboard. Launch a terminal and type cd
, and then paste the copied path. Hit enter. (There may be more efficient ways to do this, but these instructions work for most window managers).
For more information about how to use system terminals, see Section 31.1.
This assumes that the R binary has been added to your system path. If these instructions don’t work, please ask for help or visit office hours.
In the terminal, type Rscript words.R dickens-oliver-twist.txt
You should get some output that looks like this:
user@computer:~/scripts$ Rscript words.R dickens-oliver-twist.txt
text
the and to of a his in he was
8854 4902 4558 3767 3763 3569 2272 2224 1931 1684
This assumes that the python binary has been added to your system path. If these instructions don’t work, please ask for help or visit office hours.
In the terminal, type python3 words.py
and hit Enter. You will be prompted for the file name. Enter dickens-oliver-twist.txt
and hit Enter again.
You should get some output that looks like this:
user@computer:~/scripts$ python3 words.py
Enter file:dickens-oliver-twist.txt
the 8854
Scripts, and compiled programs generated from scripts, are responsible for much of what you interact with on a computer or cell phone day-to-day. When the goal is to process a file or complete a task in exactly the same way each time, a script is the right choice for the job.
However, when working with data, we sometimes prefer to combine scripts with interactive mode - that is, we use a script file to keep track of which commands we run, but we run the script interactively. About 60% of my day-to-day computing is done using R or python scripts that are run interactively.
If you haven’t already, download scripts.zip and unzip the file.
Open RStudio and use RStudio to complete the following tasks.
Use RStudio to open the words-noinput.R
file in the scripts
folder you downloaded and unzipped.
What do you notice about the appearance of the file? Is there an icon in the tab to tell you what type of file it is? Are some words in the file highlighted?
Copy the path to the scripts folder.
OS Specific Instructions: Windows, Mac, Linux
In the R Console, type in setwd("<paste path here>")
, where you paste your file path from step 3 between the quotes. Hit enter.
In the words-noinput.R
file, hit the “source” button in the top right. Do you get the same output that you got from running the file as a script from the terminal? Why do you think that is?
Click on the last line of the file and hit Run (or Ctrl/Cmd + Enter). Do you get the output now?
Click on the first line of the file and hit Run (or Ctrl/Cmd + Enter). This runs a single line of the file. Use this to run each line of the file in turn. What could you learn from doing this?
Use RStudio or your preferred python editor to open the words-noinput.py
file in the scripts
folder you downloaded and unzipped.
What do you notice about the appearance of the file? Is there an icon in the tab to tell you what type of file it is? Are some words in the file highlighted?
Copy the path to the scripts folder.
OS Specific Instructions: Windows, Mac, Linux
In the R Console, type in setwd("<paste path here>")
, where you paste your file path from step 3 between the quotes. Hit enter.
In the words-noinput.py
file, hit the “source” button in the top right. Do you get the same output that you got from running the file as a script from the terminal? What changes?
Click on the first line of the file and hit Run (or Ctrl/Cmd + Enter). This runs a single line of the file. Use this to run each line of the file in turn. What do you learn from doing this?
Using scripts interactively allows us to see what is happening in the script step-by-step, and to examine the results during the program’s evaluation. This can be beneficial when applying a script to a new dataset, because it allows us to change things on the fly while still keeping the same basic order of operations.
One problem with scripts and interactive modes of using programming languages is that we’re spending most of our time writing code for computers to read – which doesn’t necessarily imply that our code is easy for humans to read.
There are two solutions to this problem, and I encourage you to make liberal use of both of them (together).
A comment is a part of computer code which is intended only for people to read. It is not evaluated or run by the computing language.
To “comment out” a single line of code in R or python, put a #
(pound sign/hashtag) just before the part of the code you do not want to be evaluated.
2 + 2 + 3
[1] 7
2 + 2 # + 3
[1] 4
# This line is entirely commented out
2 + 2 + 3
7
2 + 2 # + 3
4
# This line is entirely commented out
Many computing languages, such as Java, C/C++, and JavaScript have mechanisms to comment out an entire paragraph. Neither R nor Python has so-called “block comments” - instead, you can use keyboard shortcuts in RStudio to comment out an entire chunk of code (or text) using Ctrl/Cmd-Shift-C.
While code comments add human-readable text to code, scripts with comments are still primarily formatted for the computer’s convenience. However, most of the time spent on any given document is spent by people, not by computers. We often write parallel documents - user manuals, academic papers, tutorials, etc. which explain the purpose of our code and how to use it, but this can get clumsy over time, and requires updating multiple documents (sometimes in multiple places), which often leads to the documentation getting out-of-sync from the code.
To solve this problem, Donald Knuth invented the concept of literate programming: interspersing text and code in the same document using structured text to indicate which lines are code and which lines are intended for human consumption.
This textbook is written using a literate format - quarto markdown - which allows me to include code chunks in R, python, and other languages, alongside the text, pictures, and other formatting necessary to create a textbook.
One type of literate programming document is a quarto markdown document.
We will use quarto markdown documents for most of the components of this class because they allow you to answer assignment questions, write reports with figures and tables generated from data, and provide code all in the same file.
While literate documents aren’t ideal for jobs where a computer is doing things unobserved (such as pulling data from a web page every hour), they are extremely useful in situations where it is desireable to have both code and an explanation of what the code is doing and what the results of that code are in the same document.
In RStudio, create a new quarto markdown document: File > New File > Quarto Document. Give your document a title and an author, and select HTML as the output.
Copy the following text into your document and hit the “Render” button at the top of the file.
This defines an R code chunk. The results will be included in the compiled HTML file.
```{r}
2 + 2
```
This defines a python code chunk. The results will be included in the compiled HTML file.
```{python}
2 + 2
```
# This is a header
## This is a subheader
I can add paragraphs of text, as well as other structured text such as lists:
1. First thing
2. Second thing
- nested list
- nested list item 2
3. Third thing
[links](https://www.oldest.org/entertainment/memes/)
I can even include images and
![Goodwin's law is almost as old as the internet itself.](https://www.oldest.org/wp-content/uploads/2017/10/Godwins-Law.jpg)
Markdown is a format designed to be readable and to allow document creators to focus on content rather than style.
A Markdown-formatted document should be publishable as-is, as plain text, without looking like it’s been marked up with tags or formatting instructions. – John Gruber
You can read more about pandoc markdown (and quarto markdown, which is a specific type of pandoc markdown) here [2].
Markdown documents are compiled into their final form (usually, HTML, PDF, Docx) in multiple stages:
All code chunks are run and the results are saved and inserted into the markdown document.
Rmd/qmd -> md
The markdown document is converted into its final format using pandoc, a program that is designed to ensure you can generate almost any document format. This may involve conversion to an intermediate file (e.g. .tex files for PDF documents).
An error in your code will likely cause a failure at stage 1 of the process. An error in the formatting of your document, or missing pictures, and miscellaneous other problems may cause errors in stage 2.
Quarto markdown is the newest version of a long history of literate document writing in R. A previous version, Rmarkdown, had to be compiled using R; quarto can be compiled using R or python or the terminal directly.
Prior to Rmarkdown, the R community used knitr
and Sweave
to integrate R code with LaTeX documents (another type of markup document that has a steep learning curve and is harder to read).
Where quarto comes primarily out of the R community and those who are agnostic whether R or Python is preferable for data science related computing, Jupyter is essentially an equivalent notebook technology that comes from the python side of the world.
Quarto supports using the jupyter engine for chunk compilation, but jupyter notebooks have some (rather technical) features that make them less desirable for an introductory computing class [3].
There are some excellent opinions surrounding the use of notebooks in data analysis:
Yihui Xie is the person responsible for knitr
and Rmarkdown
and was involved in the development of quarto
.