4  Basic Data Types

4.1 Values and Types

Let’s start this section with some basic vocabulary.

  • a value is a basic unit of stuff that a program works with, like 1, 2, "Hello, World", and so on.
  • values have types - 2 is an integer, "Hello, World" is a string (it contains a “string” of letters). Strings are in quotation marks to let us know that they are not variable names.

In both R and python, there are some very basic data types:

  • logical or boolean - FALSE/TRUE or 0/1 values. Sometimes, boolean is shortened to bool

  • integer - whole numbers (positive or negative)

  • double or float or numeric- decimal numbers.

    • float is short for floating-point value.
    • double is a floating-point value with more precision (“double precision”).
    • R uses the name numeric to indicate a decimal value, regardless of precision.
  • character or string - holds text, usually enclosed in quotes.

If you don’t know what type a value is, both R and python have functions to help you with that.

class(FALSE)
class(2L) # by default, R treats all numbers as numeric/decimal values. 
          # The L indicates that we're talking about an integer. 
class(2)
class("Hello, programmer!")
[1] "logical"
[1] "integer"
[1] "numeric"
[1] "character"
type(False)
type(2)
type(3.1415)
type("This is python code")
<class 'bool'>
<class 'int'>
<class 'float'>
<class 'str'>

In R, boolean values are TRUE and FALSE, but in Python they are True and False. Capitalization matters a LOT.

Other things matter too: if we try to write a million, we would write it 1000000 instead of 1,000,000 (in both languages). Commas are used for separating numbers, not for proper spacing and punctuation of numbers. This is a hard thing to get used to but very important – especially when we start reading in data.

4.2 Variables

Programming languages use variables - names that refer to values. Think of a variable as a container that holds something - instead of referring to the value, you can refer to the container and you will get whatever is stored inside.

We assign variables values using the syntax object_name <- value (R) or object_name = value (python). You can read this as “object name gets value” in your head.

message <- "So long and thanks for all the fish"
year <- 2025
the_answer <- 42L
earth_demolished <- FALSE
message = "So long and thanks for all the fish"
year = 2025
the_answer = 42
earth_demolished = False

Note that in R, we assign variables values using the <- operator, where in Python, we assign variables values using the = operator. Technically, = will work for assignment in both languages, but <- is more common than = in R by convention.

We can then use the variables - do numerical computations, evaluate whether a proposition is true or false, and even manipulate the content of strings, all by referencing the variable by name.

4.2.1 Valid Names

There are only two hard things in Computer Science: cache invalidation and naming things.
– Phil Karlton

Object names must start with a letter and can only contain letters, numbers, _, and . in R. In Python, object names must start with a letter and can consist of letters, numbers, and _ (that is, . is not a valid character in a Python variable name). While it is technically fine to use uppercase variable names in Python, it’s recommended that you use lowercase names for variables (you’ll see why later).

What happens if we try to create a variable name that isn’t valid?

In both languages, starting a variable name with a number will get you an error message that lets you know that something isn’t right - “unexpected symbol” in R and “invalid syntax” in python.

1st_thing <- "check your variable names!"
Error: <text>:1:2: unexpected symbol
1: 1st_thing
     ^
1st_thing <- "check your variable names!"

Note: Run the above chunk in your python window - the book won’t compile if I set it to evaluate 😿. It generates an error of SyntaxError: invalid syntax (<string>, line 1)

second.thing <- "this isn't valid"
Error in py_call_impl(callable, dots$args, dots$keywords): NameError: name 'second' is not defined

Detailed traceback:
  File "<string>", line 1, in <module>

In python, trying to have a . in a variable name gets a more interesting error: “ is not defined”. This is because in python, some objects have components and methods that can be accessed with .. We’ll get into this more later, but there is a good reason for python’s restriction about not using . in variable names.

Naming things is difficult! When you name variables, try to make the names descriptive - what does the variable hold? What are you going to do with it? The more (concise) information you can pack into your variable names, the more readable your code will be.

Why is naming things hard? - Blog post by Neil Kakkar

There are a few different conventions for naming things that may be useful:

  • some_people_use_snake_case, where words are separated by underscores
  • somePeopleUseCamelCase, where words are appended but anything after the first word is capitalized (leading to words with humps like a camel).
  • some.people.use.periods (in R, obviously this doesn’t work in python)
  • A few people mix conventions with variables_thatLookLike.this and they are almost universally hated 👿

As long as you pick ONE naming convention and don’t mix-and-match, you’ll be fine. It will be easier to remember what you named your variables (or at least guess) and you’ll have fewer moments where you have to go scrolling through your script file looking for a variable you named.

4.3 Type Conversions

We talked about values and types above, but skipped over a few details because we didn’t know enough about variables. It’s now time to come back to those details.

What happens when we have an integer and a numeric type and we add them together? Hopefully, you don’t have to think too hard about what the result of 2 + 3.5 is, but this is a bit more complicated for a computer for two reasons: storage, and arithmetic.

In days of yore, programmers had to deal with memory allocation - when declaring a variable, the programmer had to explicitly define what type the variable was. This tended to look something like the code chunk below:

int a = 1
double b = 3.14159

Typically, an integer would take up 32 bits of memory, and a double would take up 64 bits, so doubles used 2x the memory that integers did. Both R and python are dynamically typed, which means you don’t have to deal with any of the trouble of declaring what your variables will hold - the computer automatically figures out how much memory to use when you run the code. So we can avoid the discussion of memory allocation and types because we’re using higher-level languages that handle that stuff for us.

But the discussion of types isn’t something we can completely avoid, because we still have to figure out what to do when we do operations on things of two different types - even if memory isn’t a concern, we still have to figure out the arithmetic question.

So let’s see what happens with a couple of examples, just to get a feel for type conversion (aka type casting or type coercion), which is the process of changing an expression from one data type to another.

mode(2L + 3.14159) # add 2 and pi
[1] "numeric"
mode(2L + TRUE) # add integer 2 and TRUE
[1] "numeric"
mode(TRUE + FALSE) # add TRUE and FALSE
[1] "numeric"

In R, all of the examples above are ‘numeric’ - basically, a catch-all class for things that are in some way, shape, or form numbers. Integers and decimal numbers are both numeric, but so are logicals (because they can be represented as 0 or 1).

type(2 + 3.14159)
<class 'float'>
type(2 + True)
<class 'int'>
type(True + False)
<class 'int'>

In python, by contrast, anything without a decimal point is converted into an integer - essentially, python tries to minimize memory used without losing data. So it will never convert a float into an integer implicitly - if you want Python to do that, you’ll have to tell it to do so directly.

You may be asking yourself at this point why this matters, and that’s a decent question. We will eventually be reading in data from spreadsheets and other similar tabular data, and types become very important at that point, because we’ll have to know how R and python both handle type conversions.

In class activity

Do a bit of experimentation - what happens when you try to add a string and a number? Which types are automatically converted to other types? Fill in the following table in your notes:

Adding a ___ and a ___ produces a ___:

Logical Integer Decimal String
Logical
Integer
Decimal
String

Above, we looked at automatic type conversions, but in many cases, we also may want to convert variables manually, specifying exactly what type we’d like them to be. A common application for this in data analysis is when there are “*” or “.” or other indicators in an otherwise numeric column of a spreadsheet that indicate missing data: when this data is read in, the whole column is usually read in as character data. So we need to know how to tell R and python that we want our string to be treated as a number, or vice-versa.

In R, we can explicitly convert a variable’s type using as.XXX() functions, where XXX is the type you want to convert to (as.numeric, as.integer, as.logical, as.character, etc.).

x <- 3
y <- "3.14159"

x + y
Error in x + y: non-numeric argument to binary operator
x + as.numeric(y)
[1] 6.14159

In python, the same basic idea holds true, but in python, we just use the variable type as a function: int(), float(), str(), and bool().

x = 3
y = "3.14159"

x + y
Error in py_call_impl(callable, dots$args, dots$keywords): TypeError: unsupported operand type(s) for +: 'int' and 'str'

Detailed traceback:
  File "<string>", line 1, in <module>
x + float(y)
6.14159

4.4 Operators and Functions

In addition to variables, functions are extremely important in programming.

Let’s first start with a special class of functions called operators. You’re probably familiar with operators as in arithmetic expressions: +, -, /, *, and so on.

Here are a few of the most important ones:

Operation R symbol Python symbol
Addition + +
Subtraction - -
Multiplication * *
Division / /
Integer Division %/% //
Modular Division %% %
Exponentiation ^ **

Note that integer division is the whole number answer to A/B, and modular division is the fractional remainder when A/B.

So 14 %/% 3 in R would be 4, and 14 %% 3 in R would be 2.

14 %/% 3
[1] 4
14 %% 3
[1] 2
14 // 3
4
14 % 3
2

Note that these operands are all intended for scalar operations (operations on a single number) - vectorized versions, such as matrix multiplication, are somewhat more complicated (and different between R and python).

4.4.1 Order of Operations

Both R and Python operate under the same mathematical rules of precedence that you learned in school. You may have learned the acronym PEMDAS, which stands for Parentheses, Exponents, Multiplication/Division, and Addition/Subtraction. That is, when examining a set of mathematical operations, we evaluate parentheses first, then exponents, and then we do multiplication/division, and finally, we add and subtract.

(1+1)^(5-2) # 2 ^ 3 = 8
[1] 8
1 + 2^3 * 4 # 1 + (8 * 4)
[1] 33
3*1^3 # 3 * 1
[1] 3
(1+1)**(5-2)
8
1 + 2**3*4
33
3*1**3
3

4.4.2 String Operations

Python has some additional operators that work on strings. In R, you will have to use functions to perform these operations, as R does not have string operators.

In Python, + will concatenate (stick together) two strings, and multiplying a string by an integer will repeat the string the specified number of times

"first " + "second"
'first second'
"hello " * 3
'hello hello hello '

In R, to concatenate things, we need to use functions: paste or paste0:

paste("first", "second", sep = " ")
[1] "first second"
paste("first", "second", collapse = " ")
[1] "first second"
paste(c("first", "second"), sep = " ") # sep only works on separate parameters
[1] "first"  "second"
paste(c("first", "second"), collapse = " ") # collapse works on vectors
[1] "first second"
paste(c("a", "b", "c", "d"), 
      c("first", "second", "third", "fourth"), 
      sep = "-", collapse = " ")
[1] "a-first b-second c-third d-fourth"
# sep is used to collapse parameters, then collapse is used to collapse vectors

paste0(c("a", "b", "c"))
[1] "a" "b" "c"
paste0("a", "b", "c") # equivalent to paste(..., sep = "")
[1] "abc"

You don’t need to understand the details of this at this point in the class, but it is useful to know how to combine strings in both languages.

4.4.3 Functions

Functions are sets of instructions that take arguments and return values. Strictly speaking, operators (like those above) are a special type of functions – but we aren’t going to get into that now.

We’re also not going to talk about how to create our own functions just yet. Instead, I’m going to show you how to use functions.

It may be helpful at this point to print out the R reference card and the Python reference card. These cheat sheets contain useful functions for a variety of tasks in each language.

Methods are a special type of function that operate on a specific variable type. In Python, methods are applied using the syntax variable.method_name(). So, you can get the length of a string variable my_string using my_string.length().

R has methods too, but they are invoked differently. In R, you would get the length of a string variable using length(my_string).

Right now, it is not really necessary to know too much more about functions than this: you can invoke a function by passing in arguments, and the function will do a task and return the value.

In class activity

Try out some of the functions mentioned on the R and Python cheatsheets.

Can you figure out how to define a list or vector of numbers? If so, can you use a function to calculate the maximum value?

Can you find the R functions that will allow you to repeat a string variable multiple times or concatenate two strings? Can you do this task in Python?

Footnotes

  1. This means that doubles take up more memory but can store more decimal places. You don’t need to worry about this much in R, and only a little in Python, but in older and more precise languages such as C/C++/Java, the difference between floats and doubles can be important.↩︎

  2. In some ways, this is like the difference between an automatic and a manual transmission - you have fewer things to worry about, but you also don’t know what’s going on under the hood nearly as well↩︎

  3. From https://cran.r-project.org/doc/contrib/Short-refcard.pdf↩︎

  4. From http://sixthresearcher.com/wp-content/uploads/2016/12/Python3_reference_cheat_sheet.pdf↩︎