Getting to know your data 1
Introduction
In this worksheet, we will discuss how to perform basic inspection of a dataset and simple data-cleaning tasks.
First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.
Next we set up the data.
We will be working with the dataset NCbirths
, which contains data about 1450 births in the state of North Carolina in 2001.
NCbirths
Basic inspection of the data
When first working with a new dataset, you should always start by just looking at the data. The simplest way to do this is to just enter the name of the dataset in the R command line and run, which causes the data to be printed. You can also use head(...)
to only see the first six rows or glimpse(...)
to get a list of all columns with their type and first few values.
Try this yourself. Write code that displays the entire NCbirths
dataset, the first six rows, or a list of all columns.
NCbirths
head(NCbirths)
glimpse(NCbirths)
NCbirths
It is often useful to get a list of all names of the columns in a data frame. You can obtain this with names(...)
. Try this yourself.
names(NCbirths)
To inspect individual columns, you can extract them either with pull()
like so: data |> pull(column)
or with the $
operator like so: data$column
. The second option is shorter but the first option integrates better into longer analysis pipelines. Try both options on the NCbirths
dataset, for example for the Smoke
column.
# option using pull()
|>
NCbirths pull(Smoke)
# option using $ operator
$Smoke NCbirths
Finally, to see all distinct values in a column, you can apply the function unique()
to it. Try this with the Smoke
column.
|>
NCbirths pull(Smoke) |>
___
|>
NCbirths pull(Smoke) |>
unique()
Recoding of data values
We frequently want to recode data values such that they are more humanly readable. For example, we might want to write smoker/non-smoker instead of 1/0. We can do this with if_else()
, which takes three arguments: a logical condition, the data value if the logical condition is true, and the data value if the logical condition is false. Try this out on the Smoke
column, creating a new column Smoke_recoded
that is human-readable.
|>
NCbirths mutate(
Smoke_recoded = if_else(___, ___, ___)
|>
) select(Smoke, Smoke_recoded) |>
unique()
|>
NCbirths mutate(
Smoke_recoded = if_else(Smoke == 0, "non-smoker", "smoker")
|>
) select(Smoke, Smoke_recoded) |>
unique()
When you want to recode a variable with more than two categories, you could nest if_else()
commands, but usually it is simpler to use case_when()
. With case_when()
, you provide a list of conditions and corresponding data values as formulas of the form condition ~ data value
. For example, the recoding exercise for the Smoke
column could be written with case_when()
as follows:
When using case_when()
, it is usually a good idea to provide an explicit fallback that is used when none of the earlier conditions match. The logical conditions are evaluated in order, so you want to list the most specific conditions first and the least specific conditions last. The fallback condition is simply TRUE
. It applies always if no previous condition applied.
Now use case_when()
to recode the Plural
column into singlet/twins/triplets.
|>
NCbirths mutate(
Plural_recoded = case_when(
== 1 ~ "singlet",
Plural
___,
___,
___
)|>
) select(Plural, Plural_recoded) |>
unique()
|>
NCbirths mutate(
Plural_recoded = case_when(
== 1 ~ "singlet",
Plural == 2 ~ "twins",
Plural == 3 ~ "triplets",
Plural TRUE ~ NA
)|>
) select(Plural, Plural_recoded) |>
unique()
Summaries of data columns
When exploring a new dataset, it is usually a good idea to look at summaries of the data values in each column, to get a quick sense of the range of data values, to see whether there are any unexpected outliers, etc. There are two useful functions for this purpose, summary()
for numerical data and table()
for categorical data.
First try this for numerical data. Perform summaries for the data columns MomAge
, Weeks
, and BirthWeightGm
.
summary(NCbirths$MomAge)
___ ___
summary(NCbirths$MomAge)
summary(NCbirths$Weeks)
summary(NCbirths$BirthWeightGm)
Now try this for categorical data. Perform summaries for the data columns Plural
, Smoke
, and RaceMom
.
table(NCbirths$Plural)
___ ___
table(NCbirths$Plural)
table(NCbirths$Smoke)
table(NCbirths$RaceMom)
Do you understand what the output means? If not, look it up in the R documentation for the table()
function.
One quirk of the table()
function is that by default it omits any NA
values. However, it is important to know whether there are any NA
values in a data column or not. We can get table()
to tabulate NA
s as well by providing it with the argument useNA = "ifany"
. Repeat the previous exercise with this modification and see which of the three columns Plural
, Smoke
, or RaceMom
contain any NA
s.
table(NCbirths$Plural, useNA = "ifany")
___ ___
table(NCbirths$Plural, useNA = "ifany")
table(NCbirths$Smoke, useNA = "ifany")
table(NCbirths$RaceMom, useNA = "ifany")