Getting to know your data 1

Author

Claus O. Wilke

Introduction

In this worksheet, we will discuss how to perform basic inspection of a dataset and simple data-cleaning tasks.

First we need to load the required R packages. Please wait a moment until the live R session is fully set up and all packages are loaded.

Next we set up the data.

We will be working with the dataset NCbirths, which contains data about 1450 births in the state of North Carolina in 2001.

NCbirths

Basic inspection of the data

When first working with a new dataset, you should always start by just looking at the data. The simplest way to do this is to just enter the name of the dataset in the R command line and run, which causes the data to be printed. You can also use head(...) to only see the first six rows or glimpse(...) to get a list of all columns with their type and first few values.

Try this yourself. Write code that displays the entire NCbirths dataset, the first six rows, or a list of all columns.

Hint
NCbirths
Solution
head(NCbirths)
glimpse(NCbirths)
NCbirths

It is often useful to get a list of all names of the columns in a data frame. You can obtain this with names(...). Try this yourself.

Solution
names(NCbirths)

To inspect individual columns, you can extract them either with pull() like so: data |> pull(column) or with the $ operator like so: data$column. The second option is shorter but the first option integrates better into longer analysis pipelines. Try both options on the NCbirths dataset, for example for the Smoke column.

Solution
# option using pull()
NCbirths |>
  pull(Smoke)

# option using $ operator
NCbirths$Smoke

Finally, to see all distinct values in a column, you can apply the function unique() to it. Try this with the Smoke column.

Hint
NCbirths |>
  pull(Smoke) |>
  ___
Solution
NCbirths |>
  pull(Smoke) |>
  unique()

Recoding of data values

We frequently want to recode data values such that they are more humanly readable. For example, we might want to write smoker/non-smoker instead of 1/0. We can do this with if_else(), which takes three arguments: a logical condition, the data value if the logical condition is true, and the data value if the logical condition is false. Try this out on the Smoke column, creating a new column Smoke_recoded that is human-readable.

Hint
NCbirths |>
  mutate(
    Smoke_recoded = if_else(___, ___, ___)
  ) |>
  select(Smoke, Smoke_recoded) |>
  unique()
Solution
NCbirths |>
  mutate(
    Smoke_recoded = if_else(Smoke == 0, "non-smoker", "smoker")
  ) |>
  select(Smoke, Smoke_recoded) |>
  unique()

When you want to recode a variable with more than two categories, you could nest if_else() commands, but usually it is simpler to use case_when(). With case_when(), you provide a list of conditions and corresponding data values as formulas of the form condition ~ data value. For example, the recoding exercise for the Smoke column could be written with case_when() as follows:

When using case_when(), it is usually a good idea to provide an explicit fallback that is used when none of the earlier conditions match. The logical conditions are evaluated in order, so you want to list the most specific conditions first and the least specific conditions last. The fallback condition is simply TRUE. It applies always if no previous condition applied.

Now use case_when() to recode the Plural column into singlet/twins/triplets.

Hint
NCbirths |>
  mutate(
    Plural_recoded = case_when(
      Plural == 1 ~ "singlet",
      ___,
      ___,
      ___
    )
  ) |>
  select(Plural, Plural_recoded) |>
  unique()
Solution
NCbirths |>
  mutate(
    Plural_recoded = case_when(
      Plural == 1 ~ "singlet",
      Plural == 2 ~ "twins",
      Plural == 3 ~ "triplets",
      TRUE ~ NA
    )
  ) |>
  select(Plural, Plural_recoded) |>
  unique()

Summaries of data columns

When exploring a new dataset, it is usually a good idea to look at summaries of the data values in each column, to get a quick sense of the range of data values, to see whether there are any unexpected outliers, etc. There are two useful functions for this purpose, summary() for numerical data and table() for categorical data.

First try this for numerical data. Perform summaries for the data columns MomAge, Weeks, and BirthWeightGm.

Hint
summary(NCbirths$MomAge)
___
___
Solution
summary(NCbirths$MomAge)
summary(NCbirths$Weeks)
summary(NCbirths$BirthWeightGm)

Now try this for categorical data. Perform summaries for the data columns Plural, Smoke, and RaceMom.

Hint
table(NCbirths$Plural)
___
___
Solution
table(NCbirths$Plural)
table(NCbirths$Smoke)
table(NCbirths$RaceMom)

Do you understand what the output means? If not, look it up in the R documentation for the table() function.

One quirk of the table() function is that by default it omits any NA values. However, it is important to know whether there are any NA values in a data column or not. We can get table() to tabulate NAs as well by providing it with the argument useNA = "ifany". Repeat the previous exercise with this modification and see which of the three columns Plural, Smoke, or RaceMom contain any NAs.

Hint
table(NCbirths$Plural, useNA = "ifany")
___
___
Solution
table(NCbirths$Plural, useNA = "ifany")
table(NCbirths$Smoke, useNA = "ifany")
table(NCbirths$RaceMom, useNA = "ifany")