
Hadley Wickham’s dplyr package is an incredibly powerful R package for data analysis. A common data analysis technique, known as split-apply-combine, involves creating statistical summaries by groups within a data frame.
This post will introduce the power of using logical vectors within your dplyr code to create complex data summaries with ease.
Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the age vector below.
age <- c(23, 31, 27, 41, 54, 34, 25)
age## [1] 23 31 27 41 54 34 25What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents?
We can answer this question by first using R’s >= operator to find where values stored in the age vector are greater than or equal to the value 30. Anytime we use R’s comparison operators (>, >=, <, <=, ==) on a vector, we will get a logical vector consisting of TRUE/FALSE values indicating where our condition was met.
For example, running the code below produces a sequence of TRUE/FALSE values that test where our respondents are 30 or older in the age vector.
age >= 30## [1] FALSE  TRUE FALSE  TRUE  TRUE  TRUE FALSETo answer our question above, we can use the following properties of logical vectors in R:
We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.
sum(age >= 30)## [1] 4mean(age >= 30)## [1] 0.5714286Let’s go through a simple example where using these two properties can help with performing complex statistical summaries with dplyr.
We will be working with a subset of the mpg dataset, which is automatically loaded with the tidyverse package in R.
This dataset contains the fuel efficiency and other interesting properties of 234 cars.
library(tidyverse)
mpg_df <- mpg %>%
          select(manufacturer, model, drv, hwy)
mpg_df %>% head()## # A tibble: 6 x 4
##   manufacturer model drv     hwy
##   <chr>        <chr> <chr> <int>
## 1 audi         a4    f        29
## 2 audi         a4    f        29
## 3 audi         a4    f        31
## 4 audi         a4    f        30
## 5 audi         a4    f        26
## 6 audi         a4    f        26Using the split-apply-combine technique with dplyr usually involves taking a data frame, forming subsets with the group_by() function, applying a summary function to to the groups, and collecting the results into a single data frame.
A simple example would be to answer the following questions about our subset of mpg:
How many cars are there by manufacturer? What is the average highway fuel efficiency by manufacturer?
The code below answers these questions with ease.
mpg_df %>% group_by(manufacturer) %>% 
           summarise(n_cars = n(),
                     avg_hwy = mean(hwy))## # A tibble: 15 x 3
##    manufacturer n_cars avg_hwy
##    <chr>         <int>   <dbl>
##  1 audi             18    26.4
##  2 chevrolet        19    21.9
##  3 dodge            37    17.9
##  4 ford             25    19.4
##  5 honda             9    32.6
##  6 hyundai          14    26.9
##  7 jeep              8    17.6
##  8 land rover        4    16.5
##  9 lincoln           3    17  
## 10 mercury           4    18  
## 11 nissan           13    24.6
## 12 pontiac           5    26.4
## 13 subaru           14    25.6
## 14 toyota           34    24.9
## 15 volkswagen       27    29.2What if someone asked us the following questions:
How many cars have a highway fuel efficiency greater than 16, by manufacturer? What proportion of the total cars does this group represent within each manufacturer?
This question can be answered without the use of logical vectors, but it involves a surprising amount of work! The steps are listed below:
hwy value greater than 16 into a separate data frameThe R code below implements this logic.
# Counts by manufacturer
cars_by_manuf <- mpg_df %>% group_by(manufacturer) %>% 
                 summarise(n_cars = n())
# Counts by manufacturer for hwy > 16
cars_by_manuf_16 <- mpg_df %>% 
                    filter(hwy > 16) %>% 
                    group_by(manufacturer) %>% 
                    summarise(n_cars_16 = n())
# Combine into one data frame and compute proportion within each group
result <- cars_by_manuf %>% 
          left_join(cars_by_manuf_16, by = 'manufacturer') %>%
          mutate(prop_cars_16 = n_cars_16/n_cars)
# View results
result## # A tibble: 15 x 4
##    manufacturer n_cars n_cars_16 prop_cars_16
##    <chr>         <int>     <int>        <dbl>
##  1 audi             18        18        1    
##  2 chevrolet        19        16        0.842
##  3 dodge            37        25        0.676
##  4 ford             25        22        0.88 
##  5 honda             9         9        1    
##  6 hyundai          14        14        1    
##  7 jeep              8         6        0.75 
##  8 land rover        4         2        0.5  
##  9 lincoln           3         2        0.667
## 10 mercury           4         4        1    
## 11 nissan           13        13        1    
## 12 pontiac           5         5        1    
## 13 subaru           14        14        1    
## 14 toyota           34        33        0.971
## 15 volkswagen       27        27        1I wasn’t joking when I said that it was a surprising amount of work! Let’s see how logical vectors can come to our rescue.
Using the two properties of logical vectors from above, we can compute the results in a single dplyr expression.
mpg_df %>% group_by(manufacturer) %>% 
           summarise(n_cars = n(),
                     n_cars_16 = sum(hwy > 16), 
                     prop_cars_16 = mean(hwy > 16))## # A tibble: 15 x 4
##    manufacturer n_cars n_cars_16 prop_cars_16
##    <chr>         <int>     <int>        <dbl>
##  1 audi             18        18        1    
##  2 chevrolet        19        16        0.842
##  3 dodge            37        25        0.676
##  4 ford             25        22        0.88 
##  5 honda             9         9        1    
##  6 hyundai          14        14        1    
##  7 jeep              8         6        0.75 
##  8 land rover        4         2        0.5  
##  9 lincoln           3         2        0.667
## 10 mercury           4         4        1    
## 11 nissan           13        13        1    
## 12 pontiac           5         5        1    
## 13 subaru           14        14        1    
## 14 toyota           34        33        0.971
## 15 volkswagen       27        27        1Who knew that logical vectors where the secret to simple and efficient dplyr code. As Alex has mentioned before, small wins aren’t life changing, but if you find enough of them, things start to feel a lot easier.