13 Control Structures – Statistical Computing using R and Python

Objectives

Understand how to use conditional statements
Understand how conditional statements are evaluated by a program
Use program flow diagrams to break a problem into parts and evaluate how a program will execute
Understand how to use loops
Select the appropriate type of loop for a problem

13.1 Mindset

Before we start on the types of control structures, let’s get in the right mindset. We’re all used to “if-then” logic, and use it in everyday conversation, but computers require another level of specificity when you’re trying to provide instructions.

Check out this video of the classic “make a peanut butter sandwich instructions challenge”:

Here’s another example:

‘If you’re done being pedantic, we should get dinner.’ ‘You did it again!’ ‘No, I didn’t.’ Image from Randal Munroe, xkcd.com, available under a CC-By 2.5 license.

The key takeaways from these bits of media are that you should read this section with a focus on exact precision - state exactly what you mean, and the computer will do what you say. If you instead expect the computer to get what you mean, you’re going to have a bad time.

13.2 Conditional Statements

Conditional statements determine if code is evaluated.

They look like this:

if (condition)
  then
    (thing to do)
  else
    (other thing to do)

The else (other thing to do) part may be omitted.

When this statement is read by the computer, the computer checks to see if condition is true or false. If the condition is true, then (thing to do) is also run. If the condition is false, then (other thing to do) is run instead.

Let’s try this out:

R
Python

x <- 3
y <- 1

if (x > 2) { 
  y <- 8
} else {
  y <- 4
}

print(paste("x =", x, "; y =", y))
## [1] "x = 3 ; y = 8"

In R, the logical condition after if must be in parentheses. It is common to then enclose the statement to be run if the condition is true in {} so that it is clear what code matches the if statement. You can technically put the condition on the line after the if (x > 2) line, and everything will still work, but then it gets hard to figure out what to do with the else statement - it technically would also go on the same line, and that gets hard to read.

x <- 3
y <- 1

if (x > 2) y <- 8 else y <- 4

print(paste("x =", x, "; y =", y))
## [1] "x = 3 ; y = 8"

So while the 2nd version of the code technically works, the first version with the brackets is much easier to read and understand. Please try to emulate the first version!

x = 3
y = 1

if x > 2:
  y = 8
else:
  y = 4

print("x =", x, "; y =", y)
## x = 3 ; y = 8

In python, all code grouping is accomplished with spaces instead of with brackets. So in python, we write our if statement as if x > 2: with the colon indicating that what follows is the code to evaluate. The next line is indented with 2 spaces to show that the code on those lines belongs to that if statement. Then, we use the else: statement to provide an alternative set of code to run if the logical condition in the if statement is false. Again, we indent the code under the else statement to show where it “belongs”.

Warning

Python will throw errors if you mess up the spacing. This is one thing that is very annoying about Python… but it’s a consequence of trying to make the code more readable.

13.2.1 Representing Conditional Statements as Diagrams

A common way to represent conditional logic is to draw a flow chart diagram.

In a flow chart, conditional statements are represented as diamonds, and other code is represented as a rectangle. Yes/no or True/False branches are labeled. Typically, after a conditional statement, the program flow returns to a single point.

Program flow diagram outline of a simple if/else statement

13.2.2 Chaining Conditional Statements: Else-If

In many cases, it can be helpful to have a long chain of conditional statements describing a sequence of alternative statements.

Example - Conditional Evaluation

Suppose I want to determine what categorical age bracket someone falls into based on their numerical age. All of the bins are mutually exclusive - you can’t be in the 25-40 bracket and the 41-55 bracket.

Program flow map for a series of mutually exclusive categories. If our goal is to take a numeric age variable and create a categorical set of age brackets, such as <18, 18-25, 26-40, 41-55, 56-65, and >65, we can do this with a series of if-else statements chained together. Only one of the bracket assignments is evaluated, so it is important to place the most restrictive condition first.

The important thing to realize when examining this program flow map is that if age <= 18 is true, then none of the other conditional statements even get evaluated. That is, once a statement is true, none of the other statements matter. Because of this, it is important to place the most restrictive statement first.

Program flow map for a series of mutually exclusive categories, emphasizing that only some statements are evaluated. When age = 40, only (age <= 18), (age <= 25), and (age <= 40) are evaluated conditionally. Of the assignment statements, only bracket = ‘26-40’ is evaluated when age = 40.

If for some reason you wrote your conditional statements in the wrong order, the wrong label would get assigned:

Program flow map for a series of mutually exclusive categories, with category labels in the wrong order - <40 is evaluated first, and so <= 25 and <= 18 will never be evaluated and the wrong label will be assigned for anything in those categories.

In code, we would write this statement using else-if (or elif) statements.

age <- 40 # change this as you will to see how the code works

if (age < 18) {
  bracket <- "<18"
} else if (age <= 25) {
  bracket <- "18-25"
} else if (age <= 40) {
  bracket <- "26-40"
} else if (age <= 55) {
  bracket <- "41-55" 
} else if (age <= 65) {
  bracket <- "56-65"
} else {
  bracket <- ">65"
}

bracket
## [1] "26-40"

Python uses elif as a shorthand for else if statements. As always, indentation/white space in python matters. If you put an extra blank line between two elif statements, then the interpreter will complain. If you don’t indent properly, the interpreter will complain.

age = 40 # change this to see how the code works

if age < 18:
  bracket = "<18"
elif age <= 25:
  bracket = "18-25"
elif age <= 40:
  bracket = "26-40"
elif age <= 55:
  bracket = "41-55"
elif age <= 65:
  bracket = "56-65"
else:
  bracket = ">65"
  
bracket
## '26-40'

Try it out - Chained If/Else Statements

The US Tax code has brackets, such that the first $10,275 of your income is taxed at 10%, anything between $10,275 and $41,775 is taxed at 12%, and so on.

Here is the table of tax brackets for single filers in 2022:

rate	Income
10%	$0 to $10,275
12%	$10,275 to $41,775
22%	$41,775 to $89,075
24%	$89,075 to $170,050
32%	$170,050 to $215,950
35%	$215,950 to $539,900
37%	$539,900 or more

Note: For the purposes of this problem, we’re ignoring the personal exemption and the standard deduction, so we’re already simplifying the tax code.

Write a set of if statements that assess someone’s income and determine what their overall tax rate is.

Hint: You may want to keep track of how much of the income has already been taxed in a variable and what the total tax accumulation is in another variable.

The control flow diagram for the tax brackets

Control flow diagrams can be extremely helpful when figuring out how programs work (and where gaps in your logic are when you’re debugging). It can be very helpful to map out your program flow as you’re untangling a problem.

# Start with total income
income <- 200000

# x will hold income that hasn't been taxed yet
x <- income
# y will hold taxes paid
y <- 0

if (x <= 10275) {
  y <- x*.1 # tax paid
  x <- 0 # All money has been taxed
} else {
  y <- y + 10275 * .1
  x <- x - 10275 # Money remaining that hasn't been taxed
}

if (x <= (41775 - 10275)) {
  y <- y + x * .12
  x <- 0
} else {
  y <- y + (41775 - 10275) * .12
  x <- x - (41775 - 10275) 
}

if (x <= (89075 - 41775)) {
  y <- y + x * .22
  x <- 0
} else {
  y <- y + (89075 - 41775) * .22
  x <- x - (89075 - 41775)
}

if (x <= (170050 - 89075)) {
  y <- y + x * .24
  x <- 0
} else {
  y <- y + (170050 - 89075) * .24
  x <- x - (170050 - 89075)
}

if (x <= (215950 - 170050)) {
  y <- y + x * .32
  x <- 0
} else {
  y <- y + (215950 - 170050) * .32
  x <- x - (215950 - 170050)
}

if (x <= (539900 - 215950)) {
  y <- y + x * .35
  x <- 0
} else {
  y <- y + (539900 - 215950) * .35
  x <- x - (539900 - 215950)
}

if (x > 0) {
  y <- y + x * .37
}


print(paste("Total Tax Rate on $", income, " in income = ", round(y/income, 4)*100, "%"))
## [1] "Total Tax Rate on $ 2e+05  in income =  22.12 %"

# Start with total income
income = 200000

# untaxed will hold income that hasn't been taxed yet
untaxed = income
# taxed will hold taxes paid
taxes = 0

if untaxed <= 10275:
  taxes = untaxed*.1 # tax paid
  untaxed = 0 # All money has been taxed
else:
  taxes = taxes + 10275 * .1
  untaxed = untaxed - 10275 # money remaining that hasn't been taxed

if untaxed <= (41775 - 10275):
  taxes = taxes + untaxed * .12
  untaxed = 0
else:
  taxes = taxes + (41775 - 10275) * .12
  untaxed = untaxed - (41775 - 10275) 


if untaxed <= (89075 - 41775):
  taxes = taxes + untaxed * .22
  untaxed = 0
else: 
  taxes = taxes + (89075 - 41775) * .22
  untaxed = untaxed - (89075 - 41775)

if untaxed <= (170050 - 89075):
  taxes = taxes + untaxed * .24
  untaxed = 0
else: 
  taxes = taxes + (170050 - 89075) * .24
  untaxed = untaxed - (170050 - 89075)

if untaxed <= (215950 - 170050):
  taxes = taxes + untaxed * .32
  untaxed = 0
else:
  taxes = taxes + (215950 - 170050) * .32
  untaxed = untaxed - (215950 - 170050)

if untaxed <= (539900 - 215950):
  taxes = taxes + untaxed * .35
  untaxed = 0
else: 
  taxes = taxes + (539900 - 215950) * .35
  untaxed = untaxed - (539900 - 215950)


if untaxed > 0:
  taxes = taxes + untaxed * .37



print("Total Tauntaxed Rate on $", income, " in income = ", round(taxes/income, 4)*100, "%")
## Total Tauntaxed Rate on $ 200000  in income =  22.12 %

We will find a better way to represent this calculation once we discuss loops - we can store each bracket’s start and end point in a vector and loop through them. Any time you find yourself copy-pasting code and changing values, you should consider using a loop (or eventually a function) instead.

13.3 Loops

Often, we write programs which update a variable in a way that the new value of the variable depends on the old value:

x = x + 1

This means that we add one to the current value of x.

Before we write a statement like this, we have to initialize the value of x because otherwise, we don’t know what value to add one to.

x = 0
x = x + 1

We sometimes use the word increment to talk about adding one to the value of x; decrement means subtracting one from the value of x.

A particularly powerful tool for making these types of repetitive changes in programming is the loop, which executes statements a certain number of times. Loops can be written in several different ways, but all loops allow for executing a block of code a variable number of times.

13.3.1 While Loops

In the previous section, we discussed conditional statements, where a block of code is only executed if a logical statement is true. The simplest type of loop is the while loop, which executes a block of code until a statement is no longer true.

Example - While Loops

Flow map showing while-loop pseudocode (while x <= N) { # code that changes x in some way} and the program flow map expansion where we check if x > N (exiting the loop if true); otherwise, we continue into the loop, execute the main body of #code and then change x and start over.

x <- 0

while (x < 10) { 
  # Everything in here is executed 
  # during each iteration of the loop
  print(x)
  x <- x + 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9

x = 0

while x < 10:
  print(x)
  x = x + 1
## 0
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9

Try it Out - While Loops

Write a while loop that verifies that \[\lim_{N \rightarrow \infty} \prod_{k=1}^N \left(1 + \frac{1}{k^2}\right) = \frac{e^\pi - e^{-\pi}}{2\pi}.\]

Terminate your loop when you get within 0.0001 of $\frac{e^\pi - e^{-\pi}}{2\pi}$. At what value of $k$ is this point reached?

Breaking down math notation for code:

If you are unfamiliar with the notation $\prod_{k=1}^N f(k)$, this is the product of $f(k)$ for $k = 1, 2, ..., N$, \[f(1)\cdot f(2)\cdot ... \cdot f(N)\]
To evaluate a limit, we just keep increasing $N$ until we get arbitrarily close to the right hand side of the equation.

In this problem, we can just keep increasing $k$ and keep track of the cumulative product. So we define k=1, prod = 1, and ans before the loop starts. Then, we loop over k, multiplying prod by $(1 + 1/k^2)$ and then incrementing $k$ by one each time. At each iteration, we test whether prod is close enough to ans to stop the loop.

In R, you will use pi and exp() - these are available by default without any additional libraries or packages.

k <- 1
prod <- 1
ans <- (exp(pi) - exp(-pi))/(2*pi)
delta <- 0.0001

while (abs(prod - ans) >= 0.0001) {
  prod <- prod * (1 + 1/k^2)
  k <- k + 1
}

k
## [1] 36761
prod
## [1] 3.675978
ans
## [1] 3.676078

Note that in python, you will have to import the math library to get the values of pi and the exp function. You can refer to these as math.pi and math.exp() respectively.

import math

k = 1
prod = 1
ans = (math.exp(math.pi) - math.exp(-math.pi))/(2*math.pi)
delta = 0.0001

while abs(prod - ans) >= 0.0001:
  prod = prod * (1 + k**-2)
  k = k + 1
  if k > 500000:
    break


print("At ", k, " iterations, the product is ", prod, "compared to the limit ", ans,".")
## At  36761  iterations, the product is  3.675977910975878 compared to the limit  3.676077910374978 .

Warning: Avoid Infinite Loops

It is very easy to create an infinite loop when you are working with while loops. Infinite loops never exit, because the condition is always true. If in the while loop example we decrement x instead of incrementing x, the loop will run forever.

You want to try very hard to avoid ever creating an infinite loop - it can cause your session to crash.

One common way to avoid infinite loops is to create a second variable that just counts how many times the loop has run. If that variable gets over a certain threshold, you exit the loop.

R
Python

This while loop runs until either x < 10 or n > 50 - so it will run an indeterminate number of times and depends on the random values added to x. Since this process (a ‘random walk’) could theoretically continue forever, we add the n>50 check to the loop so that we don’t tie up the computer for eternity.

x <- 0
n <- 0 # count the number of times the loop runs

while (x < 10) { 
  print(x)
  x <- x + rnorm(1) # add a random normal (0, 1) draw each time
  n <- n + 1
  if (n > 50) 
    break # this stops the loop if n > 50
}
## [1] 0
## [1] 1.869738
## [1] 3.180108
## [1] 4.717037
## [1] 3.993688
## [1] 5.031116
## [1] 5.340486
## [1] 5.836731
## [1] 6.547092
## [1] 6.45585
## [1] 7.408307
## [1] 7.626212
## [1] 7.707442
## [1] 7.428602
## [1] 8.89643
## [1] 9.787483
## [1] 8.665186
## [1] 7.915138
## [1] 7.587419
## [1] 7.986841
## [1] 9.396551

import numpy as np; # for the random normal draw

x = 0
n = 0 # count the number of times the loop runs

while x < 10:
  print(x)
  x = x + np.random.normal(0, 1, 1) # add a random normal (0, 1) draw each time
  n = n + 1
  if n > 50:
    break # this stops the loop if n > 50
## 0
## [2.05640036]
## [1.64110256]
## [1.84490872]
## [4.20633578]
## [4.37295345]
## [4.95047605]
## [4.58778118]
## [4.60037911]
## [5.35284153]
## [5.87862449]
## [7.60241795]
## [7.63478723]
## [7.26736008]
## [6.73708633]
## [8.06995118]
## [7.14281015]
## [6.07483497]
## [5.28945274]
## [6.71734985]
## [7.03880733]
## [5.75305753]
## [4.48856535]
## [3.89709297]
## [3.0049643]
## [3.70985543]
## [3.84141759]
## [3.0474656]
## [2.46814314]
## [3.67591447]
## [2.71513203]
## [4.13476784]
## [5.6318165]
## [5.42104515]
## [6.17941767]
## [7.96734265]
## [8.41698154]
## [7.20092102]
## [7.58638313]
## [7.9635956]
## [7.11117283]
## [6.86539324]
## [6.741047]
## [7.97673482]
## [5.80301199]
## [7.2727839]
## [6.94528783]
## [7.87542911]
## [7.05487125]
## [6.91585117]
## [8.07938378]

In both of the examples above, there are more efficient ways to write a random walk, but we will get to that later. The important thing here is that we want to make sure that our loops don’t run for all eternity.

13.3.2 For Loops

Another common type of loop is a for loop. In a for loop, we run the block of code, iterating through a series of values (commonly, one to N, but not always). Generally speaking, for loops are known as definite loops because the code inside a for loop is executed a specific number of times. While loops are known as indefinite loops because the code within a while loop is evaluated until the condition is falsified, which is not always a known number of times.

Illustrated for loop where the input vector is a parade of monsters, including monsters that are circles, triangles, and squares. The for loop they enter has an if-else statement: if the monster is a triangle, it gets sunglasses. Otherwise, it gets a hat. The output is the parade of monsters where the same input parade of monsters shows up, now wearing either sunglasses (if triangular) or a hat (if any other shape). — A visual demonstration of for loops iterating through a vector of monsters to dress them up for a parade. Image by Allison Horst.

Example - For Loop Syntax

Flow map showing for-loop pseudocode (for j in 1 to N) { # code} and the program flow map expansion where j starts at 1 and we check if j > N (exiting the loop if true); otherwise, we continue into the loop, execute the main body of #code and then increment j and start over.

for (i in 1:5 ) {
  print(i)
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

for i in range(5):
  print(i)
## 0
## 1
## 2
## 3
## 4

By default range(5) goes from 0 to 5, the upper bound. When i = 5 the loop exits. This is because range(5) creates a vector [0, 1, 2, 3, 4].

For loops are often run from 1 to N (or 0 to N-1 in python) but in essence, a for loop is very commonly used to do a task for every value of a vector.

For instance, in R, there is a built-in variable called month.name. Type month.name into your R console to see what it looks like. If we want to iterate along the values of month.name, we can:

for (i in month.name)
  print(i)
## [1] "January"
## [1] "February"
## [1] "March"
## [1] "April"
## [1] "May"
## [1] "June"
## [1] "July"
## [1] "August"
## [1] "September"
## [1] "October"
## [1] "November"
## [1] "December"

We can even pick out the first 3 letters of each month name and store them into a vector called abbr3

# Create new vector of the correct length
abbr3 <- rep("", length(month.name))

# We have to iterate along the index (1 to length) instead of the name 
# in this case because we want to store the result in a corresponding
# row of a new vector
for (i in 1:length(month.name))
  abbr3[i] <- substr(month.name[i], 1, 3)

# We can combine the two vectors into a data frame 
# so that each row corresponds to a month and there are two columns:
# full month name, and abbreviation
data.frame(full_name = month.name, abbrev = abbr3)
##    full_name abbrev
## 1    January    Jan
## 2   February    Feb
## 3      March    Mar
## 4      April    Apr
## 5        May    May
## 6       June    Jun
## 7       July    Jul
## 8     August    Aug
## 9  September    Sep
## 10   October    Oct
## 11  November    Nov
## 12  December    Dec

In python, we have to define our vector or list to start out with, but that’s easy enough:

import calendar
# Create a list with month names. For some reason, by default there's a "" as 
# the first entry, so we'll get rid of that
month_name = list(calendar.month_name)[1:13]

for i in month_name:
  print(i)
## January
## February
## March
## April
## May
## June
## July
## August
## September
## October
## November
## December

We can even pick out the first 3 letters of each month name and store them into a vector called abbr3.

Python handles lists best when you use pythonic expressions. The linked post has an excellent explanation of why enumerate works best here.

# Create new vector of the correct length
abbr3 = [""] * len(month_name)

# We have to iterate along the index because we want to 
# store the result in a corresponding row of a new vector
# Python allows us to iterate along both the index i and the value val
# at the same time, which is convenient.
for i, val in enumerate(month_name):
  abbr3[i] = val[0:3:] # Strings have indexes by character, so this gets 
                       # characters 0, 1, and 2.
  
abbr3
## ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

13.4 Other Control Structures

13.4.1 Conditional Statements

case statements, e.g. case_when in tidyverse

13.4.2 Loops

13.4.2.1 Controlling Loops

While I do not often use break, next, and continue statements, they do exist in both languages and can be useful for controlling the flow of program execution. I have moved the section on this to Section 43.2 for the sake of brevity and to reduce the amount of new material those without programming experience are being exposed to in this section.

13.4.2.2 Other Types of Loops

There are other types of loops in most languages, such as the do-while loop, which runs the code first and then evaluates the logical condition to determine whether the loop will be run again.

Example: do-while loops

R
Python

In R, do-while loops are most naturally implemented using a very primitive type of iteration: a repeat statement.

repeat {
  # statements go here
  if (condition)
    break # this exits the repeat statement
}

In python, do-while loops are most naturally implemented using a while loop with condition TRUE:

while TRUE:
  # statements go here
  if condition:
    break

An additional means of running code an indeterminate number of times is the use of recursion, which we cannot cover until we learn about functions. I have added an additional section, Section 43.3, to cover this topic, but it is not essential to being able to complete most basic data programming tasks. Recursion is useful when working with structures such as trees (including phylogenetic trees) and nested lists.

Objectives

13.1 Mindset

13.2 Conditional Statements

13.2.1 Representing Conditional Statements as Diagrams

13.2.2 Chaining Conditional Statements: Else-If

13.3 Loops

13.3.1 While Loops

13.3.2 For Loops

13.3.2.1 Example - For Loops

13.4 Other Control Structures

13.4.1 Conditional Statements

13.4.2 Loops

13.4.2.1 Controlling Loops

13.4.2.2 Other Types of Loops

13.5 References