Quantiles and Correlations with dplyr

Dividing a data-frame into quantiles with ntile()

The dplyr function ntile() divides a data-frame into n evenly-sized groups. It takes two arguments: a column name with which to rank the data, and the number of groups that the data should be split into. As an example, we will split the iris data-frame into 5 groups based on Petal.Length:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Divide iris data-frame into 5 evenly sized groups based on Petal.Length
iris %>% mutate(quintile = ntile(Petal.Length, 5)) -> iris_five
head(iris_five)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species quintile
## 1          5.1         3.5          1.4         0.2  setosa        1
## 2          4.9         3.0          1.4         0.2  setosa        1
## 3          4.7         3.2          1.3         0.2  setosa        1
## 4          4.6         3.1          1.5         0.2  setosa        1
## 5          5.0         3.6          1.4         0.2  setosa        1
## 6          5.4         3.9          1.7         0.4  setosa        2

Each row has been assigned a group that is stored in the column quintile. At this point, we have only added a variable (quintile), but we have not yet “grouped” the data-frame. To use other dplyr functions such as summarize() and mutate() the newly-created quintiles, we must explicitly group the data-frame using group_by(). We will now group the data by the quintile column, and then compute the mean Sepal.Length and mean Sepal.Width for each quintile.

iris_five %>% group_by(quintile) %>% # Group by quintile
  summarize(mean_sepal_length = mean(Sepal.Length), mean_sepal_width = mean(Sepal.Width), count = n()) %>%
  head()
## Source: local data frame [5 x 4]
## 
##   quintile mean_sepal_length mean_sepal_width count
##      (int)             (dbl)            (dbl) (int)
## 1        1          4.940000         3.383333    30
## 2        2          5.173333         3.153333    30
## 3        3          5.926667         2.790000    30
## 4        4          6.266667         2.876667    30
## 5        5          6.910000         3.083333    30

After splitting iris into 5 groups, each row now corresponds to a quintile. As you can see from the count column, each group contains an equal number of observations.

Extracting p-values and correlation coefficients from cor.test()

When computing a Pearson correlation coefficient, it is often useful to also retrieve a p-value to determine if the correlation is significant. The base R function cor.test() can provide us with both a correlation coefficient and a p-value. We’ll again use the iris data-frame as an example:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Compute a correlation between Sepal.Length and Sepal.Width
sepal_cor <- cor.test(iris$Sepal.Length, iris$Sepal.Width)
sepal_cor
## 
##  Pearson's product-moment correlation
## 
## data:  iris$Sepal.Length and iris$Sepal.Width
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.27269325  0.04351158
## sample estimates:
##        cor 
## -0.1175698

We are interested in both the correlation coefficient and the p-value. We can extract these values from sepal.cor:

# The correlation coefficient is stored as 'estimate'
sepal_cor$estimate
##        cor 
## -0.1175698
# The p-value is stored as 'p.value'
sepal_cor$p.value
## [1] 0.1518983

We can take a shortcut and extract a correlation coefficient and p-value without storing any information in sepal_cor:

cor.test(iris$Sepal.Length, iris$Sepal.Width)$estimate
##        cor 
## -0.1175698
cor.test(iris$Sepal.Length, iris$Sepal.Width)$p.value
## [1] 0.1518983

Lastly, cor.test will play nicely with dplyr’s summarize function, as long as cor.test only returns a single value. Here’s an example:

# This calculates the correlation coefficient and the degrees of freedom
iris %>% summarize(sepal_cor = cor.test(Sepal.Length, Sepal.Width)$estimate,
                   sepal_df = cor.test(Sepal.Length, Sepal.Width)$parameter)
##    sepal_cor sepal_df
## 1 -0.1175698      148

Note that within the cor.test function, I did not use any $ symbols. Carefully compare the above code block with the code blocks preceding it. If you want cor.test (or any function that you use with summarize) to respect dplyr groupings, you must use only the column names by themselves.