Splitting text columns with inconsistent delimiter positions in R

30 de Mayo de 2022

Working with delimited text fields

Text data can be tricky to work with. Add a separator like a comma, semicolon, or space, and it becomes even more challenging to manipulate into a format that is ready for analysis.

Although tools like Excel and Google Sheets provide UI to disentangle such text records, most only handle the simple use case of breaking the string into multiple columns in its original order.

Survey data example

Let’s a look at some sample data with results from a multiple choice survey question: What are your favorite colors?

Respondents can select as many colors as they desire from a list of seven. Five respondents take the survey and raw data is downloaded from the survey platform and imported into an R dataframe named survey.

respondent	response
1	blue;green;red
2	brown
3	orange;pink
4	black;red
5	pink;red

Text to columns

The Text To Columns… feature in Excel is similar to separate() from dplyr in R, which needs to know names for which columns that results should be placed. I’ve manually added 1:7 because there are only seven distinct colors in the survey and therefore any given respondent could select seven at most.

library(tidyverse)

survey %>% 
  separate(col = response, sep = ";", into= str_c('color',1:7 )) # color1 through color7 column names

respondent	color1	color2	color3	color4	color5	color6	color7
1	blue	green	red	NA	NA	NA	NA
2	brown	NA	NA	NA	NA	NA	NA
3	orange	pink	NA	NA	NA	NA	NA
4	black	red	NA	NA	NA	NA	NA
5	pink	red	NA	NA	NA	NA	NA

Speeding up calculations

Wouldn’t it be better to have results from each unique response option grouped automatically? Enter the library splitstackshape by Ananda Mahto.

Its core function cSplit() would achieve the same result as separate() with less effort by running cSplit(indt = survey, splitCols = "response", sep = ";").

But to solve our problem we want to use the function cSplit_e(), passing in the following parameters.

data: our dataframe
split.col: the dataframe column we want to split
sep: the separator/delimiter character
type: the underlying datatype

library(splitstackshape)

cSplit_e(data = survey, split.col = "response", sep=";", type = "character")

respondent	response	response_black	response_blue	response_brown	response_green	response_orange	response_pink	response_red
1	blue;green;red	NA	1	NA	1	NA	NA	1
2	brown	NA	NA	1	NA	NA	NA	NA
3	orange;pink	NA	NA	NA	NA	1	1	NA
4	black;red	1	NA	NA	NA	NA	NA	1
5	pink;red	NA	NA	NA	NA	NA	1	1

The function automatically splits the data into as many unique values are contained within the column - regardless of what position they might are reported! If a given respondent did not mention a specific color, the cell becomes NA. We could remove the original column by adding drop=TRUE and change missing values to zero with fill=0.

clean <- cSplit_e(data = survey, split.col = "response", sep=";", drop = TRUE, type = "character", fill = 0)   

clean

respondent	response_black	response_blue	response_brown	response_green	response_orange	response_pink	response_red
1	0	1	0	1	0	0	1
2	0	0	1	0	0	0	0
3	0	0	0	0	1	1	0
4	1	0	0	0	0	0	1
5	0	0	0	0	0	1	1

Summarizing results

From here we can head back to the tidyverse to explore the survey results.

results <- clean %>% 
  select(contains("response_")) %>% # select the columns with color detail
  pivot_longer(cols = everything()) %>% # make it long format
  group_by(name) %>% 
  summarise(selected = sum(value)) %>% # count colors selected
  mutate(percent = selected / nrow(survey)) %>% # calculate percent of total
  arrange(desc(selected)) %>% 
  mutate(name = str_replace_all(name, "response_", "")) # get rid of prefix

results

Now we know how many people selected each color with 60 percent of respondents indicating that red, for example, is one of their favorite colors.

name	selected	percent
red	3	0.6
pink	2	0.4
black	1	0.2
blue	1	0.2
brown	1	0.2
green	1	0.2
orange	1	0.2

Visualizing results

Finally, we turn to ggplot to make a simple visualization showcasing the relative popularity of the colors red and pink among our survey respondents.

ggplot(results, aes(x = reorder(name, desc(percent)), y = percent )) + geom_col(fill='#1FA187') +
  geom_text(aes(label=scales::percent(percent)),nudge_y = .01) +
  theme_minimal() +
  scale_y_continuous(labels=percent) +
  labs(title='Percent of Survey Respondents Selecting a Given Color',
     x = NULL, y = NULL)

The complete code

In just a few lines of code we can convert a messy text column with inconsistently delimited values into a tidy dataframe, calculate summary results, and generate a clean visualization to convey our findings.

Not bad at all.

library(tidyverse)
library(splitstackshape)
library(DT)
library(scales)

survey <- tibble(respondent = 1:5,
                 response = c('blue;green;red',
                              'brown',
                              'orange;pink',
                              'black;red',
                              'pink;red'
                              ))

clean <- cSplit_e(data = survey, split.col = "response", sep=";", drop = TRUE, type = "character", fill = 0) 

results <- clean %>% select(contains("response_")) %>% 
  pivot_longer(cols = everything()) %>% 
  group_by(name) %>% 
  summarise(selected = sum(value)) %>% 
  mutate(percent = selected / nrow(survey)) %>% 
  arrange(desc(selected)) %>% 
  mutate(name = str_replace_all(name, "response_", "")) 

ggplot(results, aes(x = reorder(name, desc(percent)), y = percent )) + geom_col(fill='#1FA187') +
  geom_text(aes(label=scales::percent(percent)),nudge_y = .01) +
  theme_minimal() +
  scale_y_continuous(labels=percent) +
  labs(title='Percent of Respondents Selecting a Given Color',
     x = NULL, y = NULL)