Introduction to R

Author

Claus O. Wilke

Introduction

In this worksheet, we will cover some of the basic concepts of the R programming language. The worksheet is not an exhaustive introduction to the language, but it will cover the most important concepts and in particular the concepts where R differs from other languages you may be familiar with. If you have prior R experience you can skip this worksheet.

R is a language designed for interactive data analysis, and some of its features may seem strange when approached from the perspective of a general purpose programming language. Keep in mind that language features that simplify interactive work may get in the way of writing complex programs and vice versa.

Please wait a moment until the live R session is fully set up and all packages are loaded.

Basic data types

R implements all the standard mathematical operations you would expect, such as addition, subtraction, etc., as well as special functions. This will generally work just like you would expect from other languages.

Try this out. Can you calculate 2 to the power of 5? Or the sin of pi/4? Or the square-root of 2?

One way in which R differs from most programming languages is that it is inherently vectorized. In R, you always work with vectors of numbers rather than with individual values. (A vector is an ordered set of values of the same data type.) Vectors are created with c(...), as in c(1, 2, 3). You can also create vectors of consecutive integers using the colon notation, as in 1:3 or 3:1. The latter places the integers into the reverse order.

Try this out. Make a vector of the integers from 1 to 10. Make a vector of the values 0.25, 0.5, 0.75. Make a vector of the words “orange”, “banana”, “grapefruit”.

Mathematical operations are also vectorized, so you can for example multiply all values in a vector by the same number or calculate multiple square roots at once. You can also do mathematical operations combining two (or more) vectors and the operation will be element-wise. If the numbers of elements don’t match you will get a warning but R will still give you a result. However, it’s generally best to avoid combining vectors with mismatched lengths, as the results can be non-intuitive.

Try this out. Make a vector of all the squares of the numbers from 1 to 5. Also make a vector of the values 0.25, 0.5, 0.75, by using a vectorized mathematical expression.

R also has logical values TRUE and FALSE (written in all capitals). Logical values also are vectorized, and they can be created by vectorized comparisons. This is very important for data analysis tasks.

Try this out. Manually create a vector of logical values. Then create a vector of logical values via comparison.

You can combine vectors of logical values with & (logical AND) and | (logical OR). You can negate logical values with ! (logical NOT).

Try this out. Combine some logical vectors with & and |. Also negate a logical vector.

Missing values

R supports the concept of missing values. Missing values are data values that don’t exist. This is a common issue in real-world data. For example, consider a scenario where people are asked to fill out a questionaire about various aspects of who they are and where they live, and one question asks about where they were born, and some people simply don’t answer that question. The result is a missing value, and we need the ability to express this concept.

In R, missing values are denoted by NA. You can use NA as a value for any vector.

Note that the NA indicating a missing value is not enclosed in quotes, even for vectors of words (i.e, character strings).

In computations, missing values remain missing. You can check for missingness via is.na().

Try this out. Make both a numerical and a character vector with some missing values. Also test for the missing positions in one of the vectors you made.

Variables and functions

Any data values or objects that you are working with in R can be assigned to variables to be reused later. Assignment in R is expressed with the <- operator. (You can also assign with = but this is generally discouraged.) If you subsequently just write the variable name by itself R prints out the value corresponding to that variable.

Try this out. Assign the number 5 to a variable called foo and then print the value of foo. Then calculate the cosine of this number.

In addition to working with variables, we commonly use and interact with functions in R. Functions are a way to store code and reuse it at a later stage. To call a function (which means to execute the code the function stores), you write the name of the variable storing the function followed by parentheses, as in sin() for the sine function. Inside the parentheses we can place values that are called “arguments,” as in sin(0.5). These arguments turn into variables that are used inside the function body.

In R, function arguments are always named, which means you can write the name of the argument when you provide the argument value, as in sin(x = 0.5). This is helpful for functions with many arguments, as without providing the argument names it can be confusing which value gets assigned to which argument. If you don’t name the function arguments, then values are assigned to arguments in order (positional matching), similarly to how most other programming languages work.

In the following example, the function example_fun() takes three arguments, a, b, and c, and each argument has a default value that will be used in case the argument is not provided when the function is called. The function then simply prints the values of the arguments. This allows us to explore how argument matching works in R.

Now try this out yourself. Using the function example_fun() defined above, see what happens when you provide different arguments, positional or named. Do you understand what example_fun(2, a = 1) does?

Packages

Many R features are provided by extension packages. You need to load those packages with library() before you can use them. For example, throughout this class, we make extensive use of the tidyverse package and therefore you will see library(tidyverse) at the beginning of most worksheets and homework templates. One of the most common problem students encounter in assignments is that they want to use a function from a package but have not properly loaded the package.

Note that we don’t normally put the package name in quotes inside the library() statement.

Upon loading, some packages write out all sorts of messages. In particular, the tidyverse package lists a number of “conflicts”. This frequently confuses students as they think something has gone wrong. You can just ignore these conflicts. They are expected and they will not interfere with your work in this class.

Try out loading a package. Load the package ggridges. Then load the package cowplot.

Numerical and logical indexing

R has a variety of ways to extract specific values or subsets of a vector. First, you can extract an individual element by indexing with square brackets. For example, if x is a vector, x[1] is the first element, x[2] is the second element, and so on. Note that in R, the first element of a vector is number 1, not number 0 as it is in most other languages (Python, C, Rust, etc.). You can also extract multiple elements by placing a vector of numeric values inside the square brackets.

Negative indices remove the respective elements.

Try this out. Extract the first element from the names vector, then extract the last two, then extract all but the first.

In addition to numerical indexing, a frequently used indexing approach in R is logical indexing. In logical indexing, you provide inside the square brackets a logical vector that indicates for each element whether you want to keep it (TRUE) or not (FALSE). The benefit of this indexing approach is that you can combine it with logical statements to extract all elements that meet a specific condition.

Try this out, by extracting all the even numbers from the numerical vector 1:10. To test whether a number x is even, you can use x %% 2 == 0.

Data frames

A core concept of R is the data frame, which holds data in tabular form. A data frame is made up of multiple columns that all have the same number of elements. Different columns can be of different types.

There are a variety of ways to create a data frame. We will usually use the tibble() function from the tidyverse package.

Note how we can assign names to the columns via named arguments in the tibble() function.

Try this out. Create a tibble of your own.

Flow control

R has standard flow-control features such as for loops and if/else statements. These are almost never needed in data analysis and therefore I will not cover them here. If you find yourself wanting to use those constructs chances are you are replicating procedural programming patterns you have learned in other languages but that are not the most elegant way to solving a data analysis problem. I would encourage you to think about how to solve your problem using vectorized or functional programming patterns instead. (Functional programming patterns such as map() go beyond this basic tutorial but will be covered later in this class.)

A concept closely related to flow control is the if_else() function from the tidyverse package. With if_else(), you can run a comparison at each position in a vector and then create a new vector whose elements depend on the outcome of each comparison.

For example, we can replace each occurrence of the word “orange” by “citrus” like so:

The first argument to if_else() is the logical condition you want to execute, the second argument is the resulting value if the condition is true, and the third argument is the resulting value if the condition is false.

Try this out. In the following example, replace all numbers greater than 5 with the number 5 in the vector numbers.