class(FALSE)
class(2L) # by default, R treats all numbers as numeric/decimal values.
# The L indicates that we're talking about an integer.
class(2)
class("Hello, programmer!")
[1] "logical"
[1] "integer"
[1] "numeric"
[1] "character"
Let’s start this section with some basic vocabulary.
1
, 2
, "Hello, World"
, and so on.2
is an integer, "Hello, World"
is a string (it contains a “string” of letters). Strings are in quotation marks to let us know that they are not variable names.In both R and python, there are some very basic data types:
logical or boolean - FALSE/TRUE or 0/1 values. Sometimes, boolean is shortened to bool
integer - whole numbers (positive or negative)
double or float or numeric- decimal numbers.
character or string - holds text, usually enclosed in quotes.
If you don’t know what type a value is, both R and python have functions to help you with that.
class(FALSE)
class(2L) # by default, R treats all numbers as numeric/decimal values.
# The L indicates that we're talking about an integer.
class(2)
class("Hello, programmer!")
[1] "logical"
[1] "integer"
[1] "numeric"
[1] "character"
type(False)
type(2)
type(3.1415)
type("This is python code")
<class 'bool'>
<class 'int'>
<class 'float'>
<class 'str'>
In R, boolean values are TRUE
and FALSE
, but in Python they are True
and False
. Capitalization matters a LOT.
Other things matter too: if we try to write a million, we would write it 1000000
instead of 1,000,000
(in both languages). Commas are used for separating numbers, not for proper spacing and punctuation of numbers. This is a hard thing to get used to but very important – especially when we start reading in data.
Programming languages use variables - names that refer to values. Think of a variable as a container that holds something - instead of referring to the value, you can refer to the container and you will get whatever is stored inside.
We assign variables values using the syntax object_name <- value
(R) or object_name = value
(python). You can read this as “object name gets value” in your head.
<- "So long and thanks for all the fish"
message <- 2025
year <- 42L
the_answer <- FALSE earth_demolished
= "So long and thanks for all the fish"
message = 2025
year = 42
the_answer = False earth_demolished
Note that in R, we assign variables values using the <-
operator, where in Python, we assign variables values using the =
operator. Technically, =
will work for assignment in both languages, but <-
is more common than =
in R by convention.
We can then use the variables - do numerical computations, evaluate whether a proposition is true or false, and even manipulate the content of strings, all by referencing the variable by name.
There are only two hard things in Computer Science: cache invalidation and naming things.
– Phil Karlton
Object names must start with a letter and can only contain letters, numbers, _
, and .
in R. In Python, object names must start with a letter and can consist of letters, numbers, and _
(that is, .
is not a valid character in a Python variable name). While it is technically fine to use uppercase variable names in Python, it’s recommended that you use lowercase names for variables (you’ll see why later).
What happens if we try to create a variable name that isn’t valid?
In both languages, starting a variable name with a number will get you an error message that lets you know that something isn’t right - “unexpected symbol” in R and “invalid syntax” in python.
<- "check your variable names!" 1st_thing
Error: <text>:1:2: unexpected symbol
1: 1st_thing
^
1st_thing <- "check your variable names!"
Note: Run the above chunk in your python window - the book won’t compile if I set it to evaluate 😿. It generates an error of SyntaxError: invalid syntax (<string>, line 1)
<- "this isn't valid" second.thing
Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'second' is not defined
Detailed traceback:
File "<string>", line 1, in <module>
In python, trying to have a .
in a variable name gets a more interesting error: “.
. We’ll get into this more later, but there is a good reason for python’s restriction about not using .
in variable names.
Naming things is difficult! When you name variables, try to make the names descriptive - what does the variable hold? What are you going to do with it? The more (concise) information you can pack into your variable names, the more readable your code will be.
Why is naming things hard? - Blog post by Neil Kakkar
There are a few different conventions for naming things that may be useful:
some_people_use_snake_case
, where words are separated by underscoressomePeopleUseCamelCase
, where words are appended but anything after the first word is capitalized (leading to words with humps like a camel).some.people.use.periods
(in R, obviously this doesn’t work in python)variables_thatLookLike.this
and they are almost universally hated 👿As long as you pick ONE naming convention and don’t mix-and-match, you’ll be fine. It will be easier to remember what you named your variables (or at least guess) and you’ll have fewer moments where you have to go scrolling through your script file looking for a variable you named.
We talked about values and types above, but skipped over a few details because we didn’t know enough about variables. It’s now time to come back to those details.
What happens when we have an integer and a numeric type and we add them together? Hopefully, you don’t have to think too hard about what the result of 2 + 3.5
is, but this is a bit more complicated for a computer for two reasons: storage, and arithmetic.
In days of yore, programmers had to deal with memory allocation - when declaring a variable, the programmer had to explicitly define what type the variable was. This tended to look something like the code chunk below:
int a = 1
double b = 3.14159
Typically, an integer would take up 32 bits of memory, and a double would take up 64 bits, so doubles used 2x the memory that integers did. Both R and python are dynamically typed, which means you don’t have to deal with any of the trouble of declaring what your variables will hold - the computer automatically figures out how much memory to use when you run the code. So we can avoid the discussion of memory allocation and types because we’re using higher-level languages that handle that stuff for us2.
But the discussion of types isn’t something we can completely avoid, because we still have to figure out what to do when we do operations on things of two different types - even if memory isn’t a concern, we still have to figure out the arithmetic question.
So let’s see what happens with a couple of examples, just to get a feel for type conversion (aka type casting or type coercion), which is the process of changing an expression from one data type to another.
mode(2L + 3.14159) # add 2 and pi
[1] "numeric"
mode(2L + TRUE) # add integer 2 and TRUE
[1] "numeric"
mode(TRUE + FALSE) # add TRUE and FALSE
[1] "numeric"
In R, all of the examples above are ‘numeric’ - basically, a catch-all class for things that are in some way, shape, or form numbers. Integers and decimal numbers are both numeric, but so are logicals (because they can be represented as 0 or 1).
type(2 + 3.14159)
<class 'float'>
type(2 + True)
<class 'int'>
type(True + False)
<class 'int'>
In python, by contrast, anything without a decimal point is converted into an integer - essentially, python tries to minimize memory used without losing data. So it will never convert a float into an integer implicitly - if you want Python to do that, you’ll have to tell it to do so directly.
You may be asking yourself at this point why this matters, and that’s a decent question. We will eventually be reading in data from spreadsheets and other similar tabular data, and types become very important at that point, because we’ll have to know how R and python both handle type conversions.
Do a bit of experimentation - what happens when you try to add a string and a number? Which types are automatically converted to other types? Fill in the following table in your notes:
Adding a ___ and a ___ produces a ___:
Logical | Integer | Decimal | String | |
---|---|---|---|---|
Logical | ||||
Integer | ||||
Decimal | ||||
String |
Above, we looked at automatic type conversions, but in many cases, we also may want to convert variables manually, specifying exactly what type we’d like them to be. A common application for this in data analysis is when there are “*” or “.” or other indicators in an otherwise numeric column of a spreadsheet that indicate missing data: when this data is read in, the whole column is usually read in as character data. So we need to know how to tell R and python that we want our string to be treated as a number, or vice-versa.
In R, we can explicitly convert a variable’s type using as.XXX()
functions, where XXX is the type you want to convert to (as.numeric
, as.integer
, as.logical
, as.character
, etc.).
<- 3
x <- "3.14159"
y
+ y x
Error in x + y: non-numeric argument to binary operator
+ as.numeric(y) x
[1] 6.14159
In python, the same basic idea holds true, but in python, we just use the variable type as a function: int()
, float()
, str()
, and bool()
.
= 3
x = "3.14159"
y
+ y x
Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: unsupported operand type(s) for +: 'int' and 'str'
Detailed traceback:
File "<string>", line 1, in <module>
+ float(y) x
6.14159
In addition to variables, functions are extremely important in programming.
Let’s first start with a special class of functions called operators. You’re probably familiar with operators as in arithmetic expressions: +, -, /, *, and so on.
Here are a few of the most important ones:
Operation | R symbol | Python symbol |
---|---|---|
Addition | + |
+ |
Subtraction | - |
- |
Multiplication | * |
* |
Division | / |
/ |
Integer Division | %/% |
// |
Modular Division | %% |
% |
Exponentiation | ^ |
** |
Note that integer division is the whole number answer to A/B, and modular division is the fractional remainder when A/B.
So 14 %/% 3
in R would be 4, and 14 %% 3
in R would be 2.
14 %/% 3
[1] 4
14 %% 3
[1] 2
14 // 3
4
14 % 3
2
Note that these operands are all intended for scalar operations (operations on a single number) - vectorized versions, such as matrix multiplication, are somewhat more complicated (and different between R and python).
Both R and Python operate under the same mathematical rules of precedence that you learned in school. You may have learned the acronym PEMDAS, which stands for Parentheses, Exponents, Multiplication/Division, and Addition/Subtraction. That is, when examining a set of mathematical operations, we evaluate parentheses first, then exponents, and then we do multiplication/division, and finally, we add and subtract.
1+1)^(5-2) # 2 ^ 3 = 8 (
[1] 8
1 + 2^3 * 4 # 1 + (8 * 4)
[1] 33
3*1^3 # 3 * 1
[1] 3
1+1)**(5-2) (
8
1 + 2**3*4
33
3*1**3
3
Python has some additional operators that work on strings. In R, you will have to use functions to perform these operations, as R does not have string operators.
In Python, +
will concatenate (stick together) two strings, and multiplying a string by an integer will repeat the string the specified number of times
"first " + "second"
'first second'
"hello " * 3
'hello hello hello '
In R, to concatenate things, we need to use functions: paste
or paste0
:
paste("first", "second", sep = " ")
[1] "first second"
paste("first", "second", collapse = " ")
[1] "first second"
paste(c("first", "second"), sep = " ") # sep only works on separate parameters
[1] "first" "second"
paste(c("first", "second"), collapse = " ") # collapse works on vectors
[1] "first second"
paste(c("a", "b", "c", "d"),
c("first", "second", "third", "fourth"),
sep = "-", collapse = " ")
[1] "a-first b-second c-third d-fourth"
# sep is used to collapse parameters, then collapse is used to collapse vectors
paste0(c("a", "b", "c"))
[1] "a" "b" "c"
paste0("a", "b", "c") # equivalent to paste(..., sep = "")
[1] "abc"
You don’t need to understand the details of this at this point in the class, but it is useful to know how to combine strings in both languages.
Functions are sets of instructions that take arguments and return values. Strictly speaking, operators (like those above) are a special type of functions – but we aren’t going to get into that now.
We’re also not going to talk about how to create our own functions just yet. Instead, I’m going to show you how to use functions.
It may be helpful at this point to print out the R reference card3 and the Python reference card4. These cheat sheets contain useful functions for a variety of tasks in each language.
Methods are a special type of function that operate on a specific variable type. In Python, methods are applied using the syntax variable.method_name()
. So, you can get the length of a string variable my_string
using my_string.length()
.
R has methods too, but they are invoked differently. In R, you would get the length of a string variable using length(my_string)
.
Right now, it is not really necessary to know too much more about functions than this: you can invoke a function by passing in arguments, and the function will do a task and return the value.
Try out some of the functions mentioned on the R and Python cheatsheets.
Can you figure out how to define a list or vector of numbers? If so, can you use a function to calculate the maximum value?
Can you find the R functions that will allow you to repeat a string variable multiple times or concatenate two strings? Can you do this task in Python?
This means that doubles take up more memory but can store more decimal places. You don’t need to worry about this much in R, and only a little in Python, but in older and more precise languages such as C/C++/Java, the difference between floats and doubles can be important.↩︎
In some ways, this is like the difference between an automatic and a manual transmission - you have fewer things to worry about, but you also don’t know what’s going on under the hood nearly as well↩︎
From https://cran.r-project.org/doc/contrib/Short-refcard.pdf↩︎
From http://sixthresearcher.com/wp-content/uploads/2016/12/Python3_reference_cheat_sheet.pdf↩︎