Hadley Wickham’s dplyr package is an **incredibly** powerful **R** package for data analysis. A common data analysis technique, known as `split-apply-combine`

, involves creating statistical summaries *by groups* within a data frame.

This post will introduce the power of using logical vectors within your `dplyr`

code to create complex data summaries with ease.

Imagine we have data from a survey we recently conducted where 7 people responded and provided their age. This data is stored in the `age`

vector below.

```
age <- c(23, 31, 27, 41, 54, 34, 25)
age
```

`## [1] 23 31 27 41 54 34 25`

What if we would like to know the number of people who are 30 or older and what percentage of the total respondents this group represents?

We can answer this question by first using R’s `>=`

operator to find where values stored in the `age`

vector are greater than or equal to the value 30. Anytime we use R’s comparison operators (`>`

, `>=`

, `<`

, `<=`

, `==`

) on a vector, we will get a logical vector consisting of TRUE/FALSE values indicating where our condition was met.

For example, running the code below produces a sequence of TRUE/FALSE values that test where our respondents are 30 or older in the `age`

vector.

`age >= 30`

`## [1] FALSE TRUE FALSE TRUE TRUE TRUE FALSE`

To answer our question above, we can use the following properties of logical vectors in R:

- the sum of a logical vector returns the
**number of TRUE values** - the mean of a logical vector returns the
**proportion of TRUE values**

We see from the output below that 4 people in our survey were 30 years or older and that this represents 57% of the total respondents.

`sum(age >= 30)`

`## [1] 4`

`mean(age >= 30)`

`## [1] 0.5714286`

Let’s go through a simple example where using these two properties can help with performing complex statistical summaries with `dplyr`

.

We will be working with a subset of the `mpg`

dataset, which is automatically loaded with the `tidyverse`

package in R.

This dataset contains the fuel efficiency and other interesting properties of 234 cars.

```
library(tidyverse)
mpg_df <- mpg %>%
select(manufacturer, model, drv, hwy)
mpg_df %>% head()
```

```
## # A tibble: 6 x 4
## manufacturer model drv hwy
## <chr> <chr> <chr> <int>
## 1 audi a4 f 29
## 2 audi a4 f 29
## 3 audi a4 f 31
## 4 audi a4 f 30
## 5 audi a4 f 26
## 6 audi a4 f 26
```

Using the `split-apply-combine`

technique with `dplyr`

usually involves taking a data frame, forming subsets with the `group_by()`

function, applying a summary function to to the groups, and collecting the results into a single data frame.

A simple example would be to answer the following questions about our subset of `mpg`

:

How many cars are there by manufacturer? What is the average highway fuel efficiency by manufacturer?

The code below answers these questions with ease.

```
mpg_df %>% group_by(manufacturer) %>%
summarise(n_cars = n(),
avg_hwy = mean(hwy))
```

```
## # A tibble: 15 x 3
## manufacturer n_cars avg_hwy
## <chr> <int> <dbl>
## 1 audi 18 26.4
## 2 chevrolet 19 21.9
## 3 dodge 37 17.9
## 4 ford 25 19.4
## 5 honda 9 32.6
## 6 hyundai 14 26.9
## 7 jeep 8 17.6
## 8 land rover 4 16.5
## 9 lincoln 3 17
## 10 mercury 4 18
## 11 nissan 13 24.6
## 12 pontiac 5 26.4
## 13 subaru 14 25.6
## 14 toyota 34 24.9
## 15 volkswagen 27 29.2
```

What if someone asked us the following questions:

How many cars have a highway fuel efficiency greater than 16, by manufacturer? What proportion of the total cars does this group represent within each manufacturer?

This question can be answered without the use of logical vectors, but it involves a surprising amount of work! The steps are listed below:

- We must calculate the number of cars by manufacturer and store it in a new data frame
- Next we calculate the number of cars by manufacturer that have a
`hwy`

value greater than 16 into a separate data frame - Finally we join the data together into our final result and calculate the proportion

The R code below implements this logic.

```
# Counts by manufacturer
cars_by_manuf <- mpg_df %>% group_by(manufacturer) %>%
summarise(n_cars = n())
# Counts by manufacturer for hwy > 16
cars_by_manuf_16 <- mpg_df %>%
filter(hwy > 16) %>%
group_by(manufacturer) %>%
summarise(n_cars_16 = n())
# Combine into one data frame and compute proportion within each group
result <- cars_by_manuf %>%
left_join(cars_by_manuf_16, by = 'manufacturer') %>%
mutate(prop_cars_16 = n_cars_16/n_cars)
# View results
result
```

```
## # A tibble: 15 x 4
## manufacturer n_cars n_cars_16 prop_cars_16
## <chr> <int> <int> <dbl>
## 1 audi 18 18 1
## 2 chevrolet 19 16 0.842
## 3 dodge 37 25 0.676
## 4 ford 25 22 0.88
## 5 honda 9 9 1
## 6 hyundai 14 14 1
## 7 jeep 8 6 0.75
## 8 land rover 4 2 0.5
## 9 lincoln 3 2 0.667
## 10 mercury 4 4 1
## 11 nissan 13 13 1
## 12 pontiac 5 5 1
## 13 subaru 14 14 1
## 14 toyota 34 33 0.971
## 15 volkswagen 27 27 1
```

I wasn’t joking when I said that it was a surprising amount of work! Let’s see how logical vectors can come to our rescue.

Using the two properties of logical vectors from above, we can compute the results in a **single dplyr expression**.

```
mpg_df %>% group_by(manufacturer) %>%
summarise(n_cars = n(),
n_cars_16 = sum(hwy > 16),
prop_cars_16 = mean(hwy > 16))
```

```
## # A tibble: 15 x 4
## manufacturer n_cars n_cars_16 prop_cars_16
## <chr> <int> <int> <dbl>
## 1 audi 18 18 1
## 2 chevrolet 19 16 0.842
## 3 dodge 37 25 0.676
## 4 ford 25 22 0.88
## 5 honda 9 9 1
## 6 hyundai 14 14 1
## 7 jeep 8 6 0.75
## 8 land rover 4 2 0.5
## 9 lincoln 3 2 0.667
## 10 mercury 4 4 1
## 11 nissan 13 13 1
## 12 pontiac 5 5 1
## 13 subaru 14 14 1
## 14 toyota 34 33 0.971
## 15 volkswagen 27 27 1
```

Who knew that logical vectors where the secret to simple and efficient dplyr code. As Alex has mentioned before, small wins aren’t life changing, but if you find enough of them, things start to feel a lot easier.