pip3 install numpy pandas
10 Data Structures
This chapter introduces some of the most important structures for storing and working with data: vectors, matrices, lists, and data frames.
10.1 Objectives
- Understand the differences between lists, vectors, data frames, matrices, and arrays in R and python
- Be able to use location-based indexing in R or python to pull out subsets of a complex data object
10.2 Python Package Installation
You will need the numpy
and pandas
packages for this section. Pick one of the following ways to install python packages:
This package installation method requires that you have a virtual environment set up (that is, if you are on Windows, don’t try to install packages this way).
reticulate::py_install(c("numpy", "pandas"))
In a python chunk (or the python terminal), you can run the following command. This depends on something called “IPython magic” commands, so if it doesn’t work for you, try the System Terminal method instead.
%pip install numpy pandas
10.3 Data Structures Overview
In Chapter 8, we discussed 4 different data types: strings/characters, numeric/double/floats, integers, and logical/booleans. As you might imagine, things are about to get more complicated.
Data structures are more complex arrangements of information, but they are still (usually) created using the same data types we have previously discussed.
Homogeneous | Heterogeneous | |
---|---|---|
1D | vector | list |
2D | matrix | data frame |
N-D | array |
Those of you who have taken programming classes that were more computer science focused will realize that I am leaving out a lot of information about lower-level structures like pointers. I’m making a deliberate choice to gloss over most of those details in this chapter, because it’s already hard enough to learn 2 languages worth of data structures at a time. In addition, R doesn’t have pointers No Pointers in R, [1], so leaving out this material in python streamlines teaching both two languages, at the cost of overly simplifying some python concepts. If you want to read more about the Python concepts I’m leaving out, check out [2].
10.4 Lists
A list is a one-dimensional column of heterogeneous data - the things stored in a list can be of different types.
x <- list("a", 3, FALSE)
x
## [[1]]
## [1] "a"
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] FALSE
= ["a", 3, False]
x
x## ['a', 3, False]
The most important thing to know about lists, for the moment, is how to pull things out of the list. We call that process indexing.
10.4.1 Indexing
Every element in a list has an index (a location, indicated by an integer position)1.
In R, we count from 1.
x <- list("a", 3, FALSE)
x[1] # This returns a list
## [[1]]
## [1] "a"
x[1:2] # This returns multiple elements in the list
## [[1]]
## [1] "a"
##
## [[2]]
## [1] 3
x[[1]] # This returns the item
## [1] "a"
x[[1:2]] # This doesn't work - you can only use [[]] with a single index
## Error in x[[1:2]]: subscript out of bounds
In R, list indexing with []
will return a list with the specified elements.
To actually retrieve the item in the list, use [[]]
. The only downside to [[]]
is that you can only access one thing at a time.
In Python, we count from 0.
= ["a", 3, False]
x
0]
x[## 'a'
1]
x[## 3
0:2]
x[## ['a', 3]
In Python, we can use single brackets to get an object or a list back out, but we have to know how slices work. Essentially, in Python, 0:2
indicates that we want objects 0 and 1, but want to stop at 2 (not including 2). If you use a slice, Python will return a list; if you use a single index, python just returns the value in that location in the list.
We’ll talk more about indexing as it relates to vectors, but indexing is a general concept that applies to just about any multi-value object.
10.5 Vectors
A vector is a one-dimensional column of homogeneous data. Homogeneous means that every element in a vector has the same data type.
We can have vectors of any data type and length we want:
10.5.1 Indexing by Location
Each element in a vector has an index - an integer telling you what the item’s position within the vector is. I’m going to demonstrate indices with the string vector
R | Python |
---|---|
1-indexed language | 0-indexed language |
Count elements as 1, 2, 3, 4, …, N | Count elements as 0, 1, 2, 3, , …, N-1 |
![]() |
![]() |
In R, we create vectors with the c()
function, which stands for “concatenate” - basically, we stick a bunch of objects into a row.
digits_pi <- c(3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5)
# Access individual entries
digits_pi[1]
## [1] 3
digits_pi[2]
## [1] 1
digits_pi[3]
## [1] 4
# R is 1-indexed - a list of 11 things goes from 1 to 11
digits_pi[0]
## numeric(0)
digits_pi[11]
## [1] 5
# Print out the vector
digits_pi
## [1] 3 1 4 1 5 9 2 6 5 3 5
In python, we create vectors using the array
function in the numpy module. To add a python module, we use the syntax import <name> as <nickname>
. Many modules have conventional (and very short) nicknames - for numpy
, we will use np
as the nickname. Any functions we reference in the numpy
module will then be called using np.fun_name()
so that python knows where to find them.2
import numpy as np
= [3,1,4,1,5,9,2,6,5,3,5]
digits_list = np.array(digits_list)
digits_pi
# Access individual entries
0]
digits_pi[## 3
1]
digits_pi[## 1
2]
digits_pi[## 4
# Python is 0 indexed - a list of 11 things goes from 0 to 10
0]
digits_pi[## 3
11]
digits_pi[## index 11 is out of bounds for axis 0 with size 11
# multiplication works on the whole vector at once
* 2
digits_pi ## array([ 6, 2, 8, 2, 10, 18, 4, 12, 10, 6, 10])
# Print out the vector
print(digits_pi)
## [3 1 4 1 5 9 2 6 5 3 5]
Python has multiple things that look like vectors, including the pandas
library’s Series structure. A Series is a one-dimensional array-like object containing a sequence of values and an associated array of labels (called its index).
import pandas as pd
= pd.Series([3,1,4,1,5,9,2,6,5,3,5])
digits_pi
# Access individual entries
0]
digits_pi[## 3
1]
digits_pi[## 1
2]
digits_pi[## 4
# Python is 0 indexed - a list of 11 things goes from 0 to 10
0]
digits_pi[## 3
11]
digits_pi[## 11
# logical indexing works here too
> 3]
digits_pi[digits_pi ## 2 4
## 4 5
## 5 9
## 7 6
## 8 5
## 10 5
## dtype: int64
# simple multiplication works in a vectorized manner
# that is, the whole vector is multiplied at once
* 2
digits_pi ## 0 6
## 1 2
## 2 8
## 3 2
## 4 10
## 5 18
## 6 4
## 7 12
## 8 10
## 9 6
## 10 10
## dtype: int64
# Print out the series
print(digits_pi)
## 0 3
## 1 1
## 2 4
## 3 1
## 4 5
## 5 9
## 6 2
## 7 6
## 8 5
## 9 3
## 10 5
## dtype: int64
The Series object has a list of labels in the first printed column, and a list of values in the second. If we want, we can specify the labels manually to use as e.g. plot labels later:
import pandas as pd
= pd.Series(['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'], index = ['S', 'M', 'T', 'W', 'R', 'F', 'Sat']) weekdays
# access individual objs
0]
weekdays[## 'Sunday'
1]
weekdays[## 'Monday'
'S']
weekdays[## 'Sunday'
'Sat']
weekdays[## 'Saturday'
# access the index
weekdays.index## Index(['S', 'M', 'T', 'W', 'R', 'F', 'Sat'], dtype='object')
6] = 'Z' # you can't assign things to the index to change it
weekdays.index[## Index does not support mutable operations
weekdays## S Sunday
## M Monday
## T Tuesday
## W Wednesday
## R Thursday
## F Friday
## Sat Saturday
## dtype: object
We can pull out items in a vector by indexing, but we can also replace specific things as well:
favorite_cats <- c("Grumpy", "Garfield", "Jorts", "Jean")
favorite_cats
## [1] "Grumpy" "Garfield" "Jorts" "Jean"
favorite_cats[2] <- "Nyan Cat"
favorite_cats
## [1] "Grumpy" "Nyan Cat" "Jorts" "Jean"
= ["Grumpy", "Garfield", "Jorts", "Jean"]
favorite_cats
favorite_cats## ['Grumpy', 'Garfield', 'Jorts', 'Jean']
1] = "Nyan Cat"
favorite_cats[
favorite_cats## ['Grumpy', 'Nyan Cat', 'Jorts', 'Jean']
If you’re curious about any of these cats, see the footnotes3.
10.5.2 Indexing with Logical Vectors
As you might imagine, we can create vectors of all sorts of different data types. One particularly useful trick is to create a logical vector that goes along with a vector of another type to use as a logical index.
If we let the black lego represent “True” and the grey lego represent “False”, we can use the logical vector to pull out all values in the main vector.
Black = True, Grey = False | Grey = True, Black = False |
---|---|
![]() |
![]() |
Note that for logical indexing to work properly, the logical index must be the same length as the vector we’re indexing. This constraint will return when we talk about data frames, but for now just keep in mind that logical indexing doesn’t make sense when this constraint isn’t true.
# Define a character vector
weekdays <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
weekend <- c("Sunday", "Saturday")
# Create logical vectors
relax_days <- c(1, 0, 0, 0, 0, 0, 1) # doing this the manual way
relax_days <- weekdays %in% weekend # This creates a logical vector
# with less manual construction
relax_days
## [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE
school_days <- !relax_days # FALSE if weekend, TRUE if not
school_days
## [1] FALSE TRUE TRUE TRUE TRUE TRUE FALSE
# Using logical vectors to index the character vector
weekdays[school_days] # print out all school days
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
import numpy as np;
= np.array(["Cat", "Dog", "Snake", "Lizard", "Tarantula", "Hamster", "Gerbil", "Otter"])
animals
# Define a logical vector
= np.array([True, True, False, False, False, True, True, False])
good_pets = np.invert(good_pets) # Invert the logical vector
bad_pets # so True -> False and False -> True
animals[good_pets]## array(['Cat', 'Dog', 'Hamster', 'Gerbil'], dtype='<U9')
animals[bad_pets]## array(['Snake', 'Lizard', 'Tarantula', 'Otter'], dtype='<U9')
~good_pets] # equivalent to using bad_pets
animals[## array(['Snake', 'Lizard', 'Tarantula', 'Otter'], dtype='<U9')
10.5.3 Reviewing Types
As vectors are a collection of things of a single type, what happens if we try to make a vector with differently-typed things?
import numpy as np
2, False, 3.1415, "animal"]) # all converted to strings
np.array([## array(['2', 'False', '3.1415', 'animal'], dtype='<U32')
2, False, 3.1415]) # converted to floats
np.array([## array([2. , 0. , 3.1415])
2, False]) # converted to integers
np.array([## array([2, 0])
As a reminder, this is an example of implicit type conversion - R and python decide what type to use for you, going with the type that doesn’t lose data but takes up as little space as possible.
Try it Out!
Create a vector of the integers from one to 30. Use logical indexing to pick out only the numbers which are multiples of 3.
x <- 1:30
x [ x %% 3 == 0]
## [1] 3 6 9 12 15 18 21 24 27 30
import numpy as np
= np.array(range(1, 31)) # because python is 0 indexed
x % 3 == 0]
x[ x ## array([ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30])
Extra challenge: Pick out numbers which are multiples of 2 or 3, but not multiples of 6!
This operation is xor, a.k.a. exclusive or. That is, X or Y, but not X AND Y.
We can write xor as (X OR Y) & !(X AND Y)
– or we can use a predefined function: xor()
in R, ^
in python.
import numpy as np
= np.array(range(1, 31))
x
= x % 2 == 0 # multiples of 2
x2 = x % 3 == 0 # multiples of 3
x3 = x2 ^ x3
x2xor3
x[x2xor3]## array([ 2, 3, 4, 8, 9, 10, 14, 15, 16, 20, 21, 22, 26, 27, 28])
10.6 Matrices
A matrix is the next step after a vector - it’s a set of values arranged in a two-dimensional, rectangular format.
# Minimal matrix in R: take a vector,
# tell R how many rows you want
matrix(1:12, nrow = 3)
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
matrix(1:12, ncol = 3) # or columns
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
# by default, R will fill in column-by-column
# the byrow parameter tells R to go row-by-row
matrix(1:12, nrow = 3, byrow = T)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
# We can also easily create square matrices
# with a specific diagonal (this is useful for modeling)
diag(rep(1, times = 4))
## [,1] [,2] [,3] [,4]
## [1,] 1 0 0 0
## [2,] 0 1 0 0
## [3,] 0 0 1 0
## [4,] 0 0 0 1
In python, matrices are just a special case of a class called ndarray
- n-dimensional arrays.
import numpy as np
# Minimal ndarray in python by typing in the values in a structured format
0, 1, 2],
np.array([[3, 4, 5],
[6, 7, 8],
[9, 10, 11]])
[## array([[ 0, 1, 2],
## [ 3, 4, 5],
## [ 6, 7, 8],
## [ 9, 10, 11]])
# This syntax creates a list of the rows we want in our matrix
# Matrix in python using a data vector and size parameters
range(0,12), (3,4))
np.reshape(## array([[ 0, 1, 2, 3],
## [ 4, 5, 6, 7],
## [ 8, 9, 10, 11]])
range(0,12), (4,3))
np.reshape(## array([[ 0, 1, 2],
## [ 3, 4, 5],
## [ 6, 7, 8],
## [ 9, 10, 11]])
range(0,12), (3,4), order = 'F')
np.reshape(## array([[ 0, 3, 6, 9],
## [ 1, 4, 7, 10],
## [ 2, 5, 8, 11]])
In python, we create 2-dimensional arrays (aka matrices) either by creating a list of rows to join together or by reshaping a 1-dimensional array. The trick with reshaping the 1-dimensional array is the order argument: ‘F’ stands for “Fortran-like” and ‘C’ stands for “C-like”… so to go by column, you use ‘F’ and to go by row, you use ‘C’. Totally intuitive, right?
Most of the problems we’re going to work on will not require much in the way of matrix or array operations. For now, you need the following:
- Know that matrices exist and what they are (2-dimensional arrays of numbers)
- Understand how they are indexed (because it is extremely similar to data frames that we’ll work with in the next chapter)
- Be aware that there are lots of functions that depend on matrix operations at their core (including linear regression)
For more on matrix operations and matrix calculations, see Chapter 11.
10.6.1 Indexing in Matrices
Both R and python use [row, column] to index matrices. To extract the bottom-left element of a 3x4 matrix in R, we would use [3,1] to get to the third row and first column entry; in python, we would use [2,0] (remember that Python is 0-indexed).
As with vectors, you can replace elements in a matrix using assignment.
my_mat <- matrix(1:12, nrow = 3, byrow = T)
my_mat[3,1] <- 500
my_mat
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 500 10 11 12
Remember that zero-indexing!
import numpy as np
= np.reshape(range(1, 13), (3,4))
my_mat
2,0] = 500
my_mat[
my_mat## array([[ 1, 2, 3, 4],
## [ 5, 6, 7, 8],
## [500, 10, 11, 12]])
10.6.2 Matrix Operations
There are a number of matrix operations that we need to know for basic programming purposes:
- scalar multiplication \[c*\textbf{X} = c * \left[\begin{array}{cc} x_{1,1} & x_{1, 2}\\x_{2,1} & x_{2,2}\end{array}\right] = \left[\begin{array}{cc} c*x_{1,1} & c*x_{1, 2}\\c*x_{2,1} & c*x_{2,2}\end{array}\right]\]
- transpose - flip the matrix across the left top -> right bottom diagonal. \[t(\textbf{X}) = \left[\begin{array}{cc} x_{1,1} & x_{1, 2}\\x_{2,1} & x_{2,2}\end{array}\right]^T = \left[\begin{array}{cc} x_{1,1} & x_{2,1}\\x_{1,2} & x_{2,2}\end{array}\right]\]
- matrix multiplication (dot product) - If you haven’t had this in Linear Algebra, here’s a preview. See [3] for a better explanation \[\textbf{X}*\textbf{Y} = \left[\begin{array}{cc} x_{1,1} & x_{1, 2}\\x_{2,1} & x_{2,2}\end{array}\right] * \left[\begin{array}{cc} y_{1,1} \\y_{2,1} \end{array}\right] = \left[\begin{array}{c}x_{1,1}*y_{1,1} + x_{1,2}*y_{2,1} \\x_{2, 1}*y_{1,1} + x_{2,2}*y_{2,1}\end{array}\right]\] Note that matrix multiplication depends on having matrices of compatible dimensions. If you have two matrices of dimension \((a \times b)\) and \((c \times d)\), then \(b\) must be equal to \(c\) for the multiplication to work, and your result will be \((a \times d)\).
x <- matrix(c(1, 2, 3, 4), nrow = 2, byrow = T)
y <- matrix(c(5, 6), nrow = 2)
# Scalar multiplication
x * 3
## [,1] [,2]
## [1,] 3 6
## [2,] 9 12
3 * x
## [,1] [,2]
## [1,] 3 6
## [2,] 9 12
# Transpose
t(x)
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
t(y)
## [,1] [,2]
## [1,] 5 6
# matrix multiplication (dot product)
x %*% y
## [,1]
## [1,] 17
## [2,] 39
import numpy as np
= np.array([[1,2],[3,4]])
x = np.array([[5],[6]])
y
# scalar multiplication
*3
x## array([[ 3, 6],
## [ 9, 12]])
3*x
## array([[ 3, 6],
## [ 9, 12]])
# transpose
# shorthand
x.T ## array([[1, 3],
## [2, 4]])
# Long form
x.transpose() ## array([[1, 3],
## [2, 4]])
# Matrix multiplication (dot product)
np.dot(x, y)## array([[17],
## [39]])
10.7 Arrays
Arrays are a generalized n-dimensional version of a vector: all elements have the same type, and they are indexed using square brackets in both R and python: [dim1, dim2, dim3, ...]
I don’t think you will need to create 3+ dimensional arrays in this class, but if you want to try it out, here is some code.
Note that displaying this requires 2 slices, since it’s hard to display 3D information in a 2D terminal arrangement.
import numpy as np
1,2],[3,4]],[[5,6], [7,8]]])
np.array([[[## array([[[1, 2],
## [3, 4]],
##
## [[5, 6],
## [7, 8]]])
10.8 Data Frames
In the previous sections, we talked about homogeneous structures: arrangements of data, like vectors and matrices, where every entry in the larger structure has the same type. In the rest of this chapter, we’ll be talking about the root of most data science analysis projects: the data frame.
Like an excel spreadsheet, data frames are arrangements of data in columns and rows.
This format has two main restrictions:
- Every entry in each column must have the same data type
- Every column must have the same number of rows
The picture above shows a data frame of 4 columns, each with a different data type (brick size/hue). The data frame has 12 rows. This picture may look similar to one that we used to show logical indexing in the last chapter, and that is not a coincidence. You can get everything from a data frame that you would get from a collection of 4 separate vectors… but there are advantages to keeping things in a data frame instead.
Consider for a moment https://worldpopulationreview.com/states, which lists the population of each state. You can find this dataset in CSV form here.
In the previous sections, we learned how to make different vectors in R, numpy, and pandas. Let’s see what happens when we work with the data above as a set of vectors/Series compared to what happens when we work with data frames.
(I’m going to cheat and read this in using pandas functions we haven’t learned yet to demonstrate why this stuff matters.)
import pandas as pd
## Error: ModuleNotFoundError: No module named 'pandas'
= pd.read_html("https://worldpopulationreview.com/states")[0]
data ## Error: NameError: name 'pd' is not defined
list(data.columns) # get names
# Create a few population series
## Error: NameError: name 'data' is not defined
= pd.Series(data['2022 Population'].values, index = data['State'].values)
population2022 ## Error: NameError: name 'pd' is not defined
= pd.Series(data['2021 Population'].values, index = data['State'].values)
population2021 ## Error: NameError: name 'pd' is not defined
= pd.Series(data['2010 Census'].values, index = data['State'].values)
population2010 ## Error: NameError: name 'pd' is not defined
Suppose that we want to sort each population vector by the population in that year.
import pandas as pd
= pd.read_html("https://worldpopulationreview.com/states")[0]
data ## Missing optional dependency 'lxml'. Use pip or conda to install lxml.
= pd.Series(data['2022 Population'].values, index = data['State'].values).sort_values()
population2022 ## name 'data' is not defined
= pd.Series(data['2021 Population'].values, index = data['State'].values).sort_values()
population2021 ## name 'data' is not defined
= pd.Series(data['2010 Census'].values, index = data['State'].values).sort_values()
population2010 ## name 'data' is not defined
population2022.head()## name 'population2022' is not defined
population2021.head()## name 'population2021' is not defined
population2010.head()## name 'population2010' is not defined
The only problem is that by doing this, we’ve now lost the ordering that matched across all 3 vectors. Pandas Series are great for this, because they use labels that allow us to reconstitute which value corresponds to which label, but in R or even in numpy arrays, vectors don’t inherently come with labels. In these situations, sorting by one value can actually destroy the connection between two vectors!
df <- read.csv("https://raw.githubusercontent.com/srvanderplas/Stat151/main/data/population2022.csv")
# Use vectors instead of the data frame
state <- df$State
pop2022 <- df$Pop
pop2021 <- df$Pop2021
pop2010 <- df$Pop2010
# Create a vector to index population in 2022 in order
order2022 <- order(pop2022)
# To keep variables together, we have to do things like this:
head(state[order2022])
## [1] "Wyoming" "Vermont" "District of Columbia"
## [4] "Alaska" "North Dakota" "South Dakota"
head(pop2022[order2022])
## [1] 582233 622882 718355 720763 774008 902542
# It makes more sense just to reorder the whole data frame:
head(df[order2022,])
## rank State Pop Growth Pop2021 Pop2010 growthSince2010
## 52 52 Wyoming 582233 0.0020 581075 564487 0.0314
## 51 51 Vermont 622882 -0.0006 623251 625879 -0.0048
## 50 50 District of Columbia 718355 0.0059 714153 605226 0.1869
## 49 49 Alaska 720763 -0.0050 724357 713910 0.0096
## 48 48 North Dakota 774008 0.0052 770026 674715 0.1472
## 47 47 South Dakota 902542 0.0066 896581 816166 0.1058
## Percent density
## 52 0.0017 5.9967
## 51 0.0019 67.5797
## 50 0.0021 11776.3115
## 49 0.0021 1.2631
## 48 0.0023 11.2173
## 47 0.0027 11.9052
The primary advantage to data frames is that rows of data are kept together. Since we often think of a row of data as a single observation in a sample, this is an extremely important feature that makes data frames a huge improvement on a collection of vectors of the same length: it’s much harder for observations in a single row to get shuffled around and mismatched!
10.8.1 Data Frame Basics
In R, data frames are built in as type data.frame
, though there are packages that provide other implementations of data frames that have additional features, such as the tibble
package used in many other common packages. We will cover functions from both base R and the tibble
package in this chapter.
In Python, we will use the pandas
library, which is conventionally abbreviated pd
. So before you use any data frames in python, you will need to add the following line to your code: import pandas as pd
.
10.8.1.1 Examining Data Frames
When you examine the structure of a data frame, as shown below, you get each column shown in a row, with its type and the first few values in the column. The head(n)
command shows the first \(n\) rows of a data frame (enough to see what’s there, not enough to overflow your screen).
data(mtcars) # Load the data -- included in base R
head(mtcars) # Look at the first 6 rows
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
str(mtcars) # Examine the structure of the object
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
You can change column values or add new columns easily using assignment. The summary()
function can be used on specific columns to perform summary operations (a 5-number summary useful for making e.g. boxplots is provided by default).
Often, it is useful to know the dimensions of a data frame. The number of rows can be obtained by using nrow(df)
and similarly, the columns can be obtained using ncol(df)
(or, get both with dim()
). There is also an easy way to get a summary of each column in the data frame, using summary()
.
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb gpm
## Min. :0.0000 Min. :3.000 Min. :1.000 Min. :0.02950
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:0.04386
## Median :0.0000 Median :4.000 Median :2.000 Median :0.05208
## Mean :0.4062 Mean :3.688 Mean :2.812 Mean :0.05423
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:0.06483
## Max. :1.0000 Max. :5.000 Max. :8.000 Max. :0.09615
dim(mtcars)
## [1] 32 12
nrow(mtcars)
## [1] 32
ncol(mtcars)
## [1] 12
Missing variables in an R data frame are indicated with NA
.
When you examine the structure of a data frame, as shown below, you get each column shown in a row, with its type and the first few values in the column. The df.head(n)
command shows the first \(n\) rows of a data frame (enough to see what’s there, not enough to overflow your screen).
= pd.read_csv("https://vincentarelbundock.github.io/Rdatasets/csv/datasets/mtcars.csv")
mtcars
5)
mtcars.head(## rownames mpg cyl disp hp ... qsec vs am gear carb
## 0 Mazda RX4 21.0 6 160.0 110 ... 16.46 0 1 4 4
## 1 Mazda RX4 Wag 21.0 6 160.0 110 ... 17.02 0 1 4 4
## 2 Datsun 710 22.8 4 108.0 93 ... 18.61 1 1 4 1
## 3 Hornet 4 Drive 21.4 6 258.0 110 ... 19.44 1 0 3 1
## 4 Hornet Sportabout 18.7 8 360.0 175 ... 17.02 0 0 3 2
##
## [5 rows x 12 columns]
mtcars.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 32 entries, 0 to 31
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 rownames 32 non-null object
## 1 mpg 32 non-null float64
## 2 cyl 32 non-null int64
## 3 disp 32 non-null float64
## 4 hp 32 non-null int64
## 5 drat 32 non-null float64
## 6 wt 32 non-null float64
## 7 qsec 32 non-null float64
## 8 vs 32 non-null int64
## 9 am 32 non-null int64
## 10 gear 32 non-null int64
## 11 carb 32 non-null int64
## dtypes: float64(5), int64(6), object(1)
## memory usage: 3.1+ KB
You can change column values or add new columns easily using assignment. It’s also easy to access specific columns to perform summary operations. You can access a column named xyz using df.xyz
or using df["xyz"]
. To create a new column, you must use df["xyz"]
.
"gpm"] = 1/mtcars.mpg # gpm is sometimes used to assess efficiency
mtcars[
mtcars.gpm.describe()## count 32.000000
## mean 0.054227
## std 0.016424
## min 0.029499
## 25% 0.043860
## 50% 0.052083
## 75% 0.064834
## max 0.096154
## Name: gpm, dtype: float64
mtcars.mpg.describe()## count 32.000000
## mean 20.090625
## std 6.026948
## min 10.400000
## 25% 15.425000
## 50% 19.200000
## 75% 22.800000
## max 33.900000
## Name: mpg, dtype: float64
Often, it is useful to know the dimensions of a data frame. The dimensions of a data frame (rows x columns) can be accessed using df.shape
. There is also an easy way to get a summary of each column in the data frame, using df.describe()
.
mtcars.describe()## mpg cyl disp ... gear carb gpm
## count 32.000000 32.000000 32.000000 ... 32.000000 32.0000 32.000000
## mean 20.090625 6.187500 230.721875 ... 3.687500 2.8125 0.054227
## std 6.026948 1.785922 123.938694 ... 0.737804 1.6152 0.016424
## min 10.400000 4.000000 71.100000 ... 3.000000 1.0000 0.029499
## 25% 15.425000 4.000000 120.825000 ... 3.000000 2.0000 0.043860
## 50% 19.200000 6.000000 196.300000 ... 4.000000 2.0000 0.052083
## 75% 22.800000 8.000000 326.000000 ... 4.000000 4.0000 0.064834
## max 33.900000 8.000000 472.000000 ... 5.000000 8.0000 0.096154
##
## [8 rows x 12 columns]
mtcars.shape## (32, 13)
Missing variables in a pandas data frame are indicated with nan
or NULL
.
The dataset state.x77
contains information on US state statistics in the 1970s. By default, it is a matrix, but we can easily convert it to a data frame, as shown below.
data(state)
state_facts <- data.frame(state.x77)
state_facts <- cbind(state = row.names(state_facts), state_facts, stringsAsFactors = F)
# State names were stored as row labels
# Store them in a variable instead, and add it to the data frame
row.names(state_facts) <- NULL # get rid of row names
head(state_facts)
## state Population Income Illiteracy Life.Exp Murder HS.Grad Frost Area
## 1 Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## 2 Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## 3 Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## 4 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## 5 California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## 6 Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
# Write data out so that we can read it in using Python
write.csv(state_facts, file = "data/state_facts.csv", row.names = F)
## Error in file(file, ifelse(append, "a", "w")): cannot open the connection
We can write out the built in R data and read it in using pd.read_csv
, which creates a DataFrame in pandas.
import pandas as pd
= pd.read_csv("https://raw.githubusercontent.com/srvanderplas/unl-stat850/main/data/state_facts.csv")
state_facts ## HTTP Error 404: Not Found
How many rows and columns does it have? Can you find different ways to get that information?
The
Illiteracy
column contains the percent of the population of each state that is illiterate. Calculate the number of people in each state who are illiterate, and store that in a new column calledTotalNumIlliterate
. Note:Population
contains the population in thousands.Calculate the average population density of each state (population per square mile) and store it in a new column
PopDensity
. Using the R reference card, can you find functions that you can combine to get the state with the minimum population density?
# 3 ways to get rows and columns
str(state_facts)
## 'data.frame': 50 obs. of 9 variables:
## $ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Population: num 3615 365 2212 2110 21198 ...
## $ Income : num 3624 6315 4530 3378 5114 ...
## $ Illiteracy: num 2.1 1.5 1.8 1.9 1.1 0.7 1.1 0.9 1.3 2 ...
## $ Life.Exp : num 69 69.3 70.5 70.7 71.7 ...
## $ Murder : num 15.1 11.3 7.8 10.1 10.3 6.8 3.1 6.2 10.7 13.9 ...
## $ HS.Grad : num 41.3 66.7 58.1 39.9 62.6 63.9 56 54.6 52.6 40.6 ...
## $ Frost : num 20 152 15 65 20 166 139 103 11 60 ...
## $ Area : num 50708 566432 113417 51945 156361 ...
dim(state_facts)
## [1] 50 9
nrow(state_facts)
## [1] 50
ncol(state_facts)
## [1] 9
# Illiteracy
state_facts$TotalNumIlliterate <- state_facts$Population * 1e3 * (state_facts$Illiteracy/100)
# Population Density
state_facts$PopDensity <- state_facts$Population * 1e3/state_facts$Area
# in people per square mile
# minimum population
state_facts$state[which.min(state_facts$PopDensity)]
## [1] "Alaska"
# Ways to get rows and columns
state_facts.shape## name 'state_facts' is not defined
# rows
state_facts.index.size ## name 'state_facts' is not defined
# columns
state_facts.columns.size ## name 'state_facts' is not defined
# columns + rows + missing counts + data types
state_facts.info() ## name 'state_facts' is not defined
# Illiteracy
"TotalNumIlliterate"] = state_facts["Population"] * 1e3 * state_facts["Illiteracy"]/100
state_facts[## name 'state_facts' is not defined
# Population Density
"PopDensity"] = state_facts["Population"] * 1e3/state_facts["Area"]
state_facts[## name 'state_facts' is not defined
# in people per square mile
# minimum population
= state_facts["PopDensity"].min()
min_dens ## name 'state_facts' is not defined
# Get location of minimum population
= state_facts.PopDensity.isin([min_dens])
loc_min_dens ## name 'state_facts' is not defined
# Pull out matching state
state_facts.state[loc_min_dens]## name 'state_facts' is not defined
10.8.2 Creating Data Frames
It is possible to create data frames from scratch by building them out of simpler components, such as lists of vectors or dicts of Series. This tends to be useful for small data sets, but it is more common to read data in from e.g. CSV files, which I’ve used several times already but haven’t yet shown you how to do (see Chapter 17 for the full how-to).
10.8.2.1 Data Frames from Scratch
math_and_lsd <- data.frame(
lsd_conc = c(1.17, 2.97, 3.26, 4.69, 5.83, 6.00, 6.41),
test_score = c(78.93, 58.20, 67.47, 37.47, 45.65, 32.92, 29.97))
math_and_lsd
## lsd_conc test_score
## 1 1.17 78.93
## 2 2.97 58.20
## 3 3.26 67.47
## 4 4.69 37.47
## 5 5.83 45.65
## 6 6.00 32.92
## 7 6.41 29.97
# add a column - character vector
math_and_lsd$subjective <- c("finally coming back", "getting better", "it's totally better", "really tripping out", "is it over?", "whoa, man", "I can taste color, but I can't do math")
math_and_lsd
## lsd_conc test_score subjective
## 1 1.17 78.93 finally coming back
## 2 2.97 58.20 getting better
## 3 3.26 67.47 it's totally better
## 4 4.69 37.47 really tripping out
## 5 5.83 45.65 is it over?
## 6 6.00 32.92 whoa, man
## 7 6.41 29.97 I can taste color, but I can't do math
= pd.DataFrame({
math_and_lsd "lsd_conc": [1.17, 2.97, 3.26, 4.69, 5.83, 6.00, 6.41],
"test_score": [78.93, 58.20, 67.47, 37.47, 45.65, 32.92, 29.97]})
math_and_lsd## lsd_conc test_score
## 0 1.17 78.93
## 1 2.97 58.20
## 2 3.26 67.47
## 3 4.69 37.47
## 4 5.83 45.65
## 5 6.00 32.92
## 6 6.41 29.97
# add a column - character vector
"subjective"] = ["finally coming back", "getting better", "it's totally better", "really tripping out", "is it over?", "whoa, man", "I can taste color, but I can't do math"]
math_and_lsd[
math_and_lsd## lsd_conc test_score subjective
## 0 1.17 78.93 finally coming back
## 1 2.97 58.20 getting better
## 2 3.26 67.47 it's totally better
## 3 4.69 37.47 really tripping out
## 4 5.83 45.65 is it over?
## 5 6.00 32.92 whoa, man
## 6 6.41 29.97 I can taste color, but I can't do math
While it’s not so hard to create data frames from scratch for small data sets, it’s very tedious if you have a lot of data (or if you can’t type accurately). An easier way to create a data frame (rather than typing the whole thing in) is to read in data from somewhere else - a file, a table on a webpage, etc. We’re not going to go into the finer points of this (you’ll get into that in Chapter 17), but it is useful to know how to read neatly formatted data.
One source of (relatively neat) data is the TidyTuesday github repository4
10.8.2.2 Reading in Data
In Base R, we can read the data in using the read.csv
function
airmen <- read.csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-08/airmen.csv')
head(airmen)
## name last_name first_name graduation_date
## 1 Adams, John H., Jr. Adams John H., Jr. 1945-04-15T00:00:00Z
## 2 Adams, Paul Adams Paul 1943-04-29T00:00:00Z
## 3 Adkins, Rutherford H. Adkins Rutherford H. 1944-10-16T00:00:00Z
## 4 Adkins, Winston A. Adkins Winston A. 1944-02-08T00:00:00Z
## 5 Alexander, Halbert L. Alexander Halbert L. 1944-11-20T00:00:00Z
## 6 Alexander, Harvey R. Alexander Harvey R. 1944-04-15T00:00:00Z
## rank_at_graduation class graduated_from pilot_type
## 1 2nd Lt SE-45-B TAAF Single engine
## 2 2nd Lt SE-43-D TAAF Single engine
## 3 2nd Lt SE-44-I-1 TAAF Single engine
## 4 2nd Lt TE-44-B TAAF Twin engine
## 5 2nd Lt SE-44-I TAAF Single engine
## 6 2nd Lt TE-44-D TAAF Twin engine
## military_hometown_of_record state aerial_victory_credits
## 1 Kansas City KS <NA>
## 2 Greenville SC <NA>
## 3 Alexandria VA <NA>
## 4 Chicago IL <NA>
## 5 Georgetown IL <NA>
## 6 Georgetown IL <NA>
## number_of_aerial_victory_credits reported_lost reported_lost_date
## 1 0 <NA> <NA>
## 2 0 <NA> <NA>
## 3 0 <NA> <NA>
## 4 0 <NA> <NA>
## 5 0 <NA> <NA>
## 6 0 <NA> <NA>
## reported_lost_location web_profile
## 1 <NA> https://cafriseabove.org/john-h-adams-jr/
## 2 <NA> https://cafriseabove.org/paul-adams/
## 3 <NA> https://cafriseabove.org/rutherford-h-adkins/
## 4 <NA> <NA>
## 5 <NA> https://cafriseabove.org/halbert-l-alexander/
## 6 <NA> https://cafriseabove.org/harvey-r-alexander/
If we want instead to create a tibble, we can use the readr
package’s read_csv
function, which is a bit more robust and has a few additional features.
library(readr)
airmen <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-08/airmen.csv')
head(airmen)
## # A tibble: 6 × 16
## name last_name first_name graduation_date rank_at_graduation class
## <chr> <chr> <chr> <dttm> <chr> <chr>
## 1 Adams, John… Adams John H., … 1945-04-15 00:00:00 2nd Lt SE-4…
## 2 Adams, Paul Adams Paul 1943-04-29 00:00:00 2nd Lt SE-4…
## 3 Adkins, Rut… Adkins Rutherfor… 1944-10-16 00:00:00 2nd Lt SE-4…
## 4 Adkins, Win… Adkins Winston A. 1944-02-08 00:00:00 2nd Lt TE-4…
## 5 Alexander, … Alexander Halbert L. 1944-11-20 00:00:00 2nd Lt SE-4…
## 6 Alexander, … Alexander Harvey R. 1944-04-15 00:00:00 2nd Lt TE-4…
## # ℹ 10 more variables: graduated_from <chr>, pilot_type <chr>,
## # military_hometown_of_record <chr>, state <chr>,
## # aerial_victory_credits <chr>, number_of_aerial_victory_credits <dbl>,
## # reported_lost <chr>, reported_lost_date <dttm>,
## # reported_lost_location <chr>, web_profile <chr>
In pandas
, we can read the csv using pd.read_csv
import pandas as pd
= pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-02-08/airmen.csv")
airmen
airmen.head()## name ... web_profile
## 0 Adams, John H., Jr. ... https://cafriseabove.org/john-h-adams-jr/
## 1 Adams, Paul ... https://cafriseabove.org/paul-adams/
## 2 Adkins, Rutherford H. ... https://cafriseabove.org/rutherford-h-adkins/
## 3 Adkins, Winston A. ... NaN
## 4 Alexander, Halbert L. ... https://cafriseabove.org/halbert-l-alexander/
##
## [5 rows x 16 columns]
10.8.3 Working with Data Frames
Often, we want to know what a data frame contains. R and pandas both have easy summary methods for data frames.
10.8.3.1 Data Frame Summaries
Notice that the type of summary depends on the data type.
summary(airmen)
## name last_name first_name
## Length:1006 Length:1006 Length:1006
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## graduation_date rank_at_graduation class
## Min. :1942-03-06 00:00:00.000 Length:1006 Length:1006
## 1st Qu.:1943-10-22 00:00:00.000 Class :character Class :character
## Median :1944-05-23 00:00:00.000 Mode :character Mode :character
## Mean :1944-07-02 13:18:52.462
## 3rd Qu.:1945-04-15 00:00:00.000
## Max. :1948-10-12 00:00:00.000
## NA's :11
## graduated_from pilot_type military_hometown_of_record
## Length:1006 Length:1006 Length:1006
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## state aerial_victory_credits number_of_aerial_victory_credits
## Length:1006 Length:1006 Min. :0.0000
## Class :character Class :character 1st Qu.:0.0000
## Mode :character Mode :character Median :0.0000
## Mean :0.1118
## 3rd Qu.:0.0000
## Max. :4.0000
##
## reported_lost reported_lost_date reported_lost_location
## Length:1006 Min. :1943-07-02 Length:1006
## Class :character 1st Qu.:1943-07-02 Class :character
## Mode :character Median :1943-07-02 Mode :character
## Mean :1943-07-02
## 3rd Qu.:1943-07-02
## Max. :1943-07-02
## NA's :1004
## web_profile
## Length:1006
## Class :character
## Mode :character
##
##
##
##
library(skimr) # Fancier summaries
skim(airmen)
Name | airmen |
Number of rows | 1006 |
Number of columns | 16 |
_______________________ | |
Column type frequency: | |
character | 13 |
numeric | 1 |
POSIXct | 2 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 9 | 28 | 0 | 1003 | 0 |
last_name | 0 | 1.00 | 3 | 12 | 0 | 617 | 0 |
first_name | 0 | 1.00 | 3 | 17 | 0 | 804 | 0 |
rank_at_graduation | 5 | 1.00 | 3 | 14 | 0 | 7 | 0 |
class | 20 | 0.98 | 3 | 9 | 0 | 72 | 0 |
graduated_from | 0 | 1.00 | 4 | 23 | 0 | 4 | 0 |
pilot_type | 0 | 1.00 | 11 | 13 | 0 | 5 | 0 |
military_hometown_of_record | 9 | 0.99 | 3 | 19 | 0 | 366 | 0 |
state | 11 | 0.99 | 2 | 5 | 0 | 48 | 0 |
aerial_victory_credits | 934 | 0.07 | 31 | 137 | 0 | 50 | 0 |
reported_lost | 1004 | 0.00 | 1 | 1 | 0 | 1 | 0 |
reported_lost_location | 1004 | 0.00 | 23 | 23 | 0 | 1 | 0 |
web_profile | 813 | 0.19 | 34 | 95 | 0 | 190 | 0 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
number_of_aerial_victory_credits | 0 | 1 | 0.11 | 0.46 | 0 | 0 | 0 | 0 | 4 | ▇▁▁▁▁ |
Variable type: POSIXct
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
graduation_date | 11 | 0.99 | 1942-03-06 | 1948-10-12 | 1944-05-23 | 52 |
reported_lost_date | 1004 | 0.00 | 1943-07-02 | 1943-07-02 | 1943-07-02 | 1 |
# All variables - strings are summarized with NaNs
= 'all')
airmen.describe(include ## name ... web_profile
## count 1006 ... 193
## unique 1003 ... 190
## top Brothers, James E. ... https://cafriseabove.org/captain-graham-smith-...
## freq 2 ... 2
## mean NaN ... NaN
## std NaN ... NaN
## min NaN ... NaN
## 25% NaN ... NaN
## 50% NaN ... NaN
## 75% NaN ... NaN
## max NaN ... NaN
##
## [11 rows x 16 columns]
# Only summarize numeric variables
= [np.number])
airmen.describe(include ## number_of_aerial_victory_credits
## count 1006.000000
## mean 0.111829
## std 0.457844
## min 0.000000
## 25% 0.000000
## 50% 0.000000
## 75% 0.000000
## max 4.000000
# Only summarize string variables (objects)
= ['O'])
airmen.describe(include ## name ... web_profile
## count 1006 ... 193
## unique 1003 ... 190
## top Brothers, James E. ... https://cafriseabove.org/captain-graham-smith-...
## freq 2 ... 2
##
## [4 rows x 15 columns]
# Get counts of how many NAs in each column
=True)
airmen.info(show_counts## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 1006 entries, 0 to 1005
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 name 1006 non-null object
## 1 last_name 1006 non-null object
## 2 first_name 1006 non-null object
## 3 graduation_date 995 non-null object
## 4 rank_at_graduation 999 non-null object
## 5 class 986 non-null object
## 6 graduated_from 1006 non-null object
## 7 pilot_type 1006 non-null object
## 8 military_hometown_of_record 997 non-null object
## 9 state 995 non-null object
## 10 aerial_victory_credits 72 non-null object
## 11 number_of_aerial_victory_credits 1006 non-null float64
## 12 reported_lost 2 non-null object
## 13 reported_lost_date 2 non-null object
## 14 reported_lost_location 2 non-null object
## 15 web_profile 193 non-null object
## dtypes: float64(1), object(15)
## memory usage: 125.9+ KB
In pandas, you will typically want to separate out .describe() calls for numeric and non-numeric columns. Another handy function in pandas is .info(), which you can use to show the number of non-NA values. This is particularly useful in sparse datasets where there may be a LOT of missing values and you may want to find out which columns have useful information for more than just a few rows.