Building Your First Regression Model in Python with Scikit-learn

Free Live ML Workshop on July 2 - Register Now

Splitting text columns with inconsistent delimiter positions in R

Splitting text columns with inconsistent delimiter positions in R

May 30, 2022

Working with delimited text fields

Text data can be tricky to work with. Add a separator like a comma, semicolon, or space, and it becomes even more challenging to manipulate into a format that is ready for analysis.

Although tools like Excel and Google Sheets provide UI to disentangle such text records, most only handle the simple use case of breaking the string into multiple columns in its original order.

Survey data example

Let’s a look at some sample data with results from a multiple choice survey question: What are your favorite colors?

Respondents can select as many colors as they desire from a list of seven. Five respondents take the survey and raw data is downloaded from the survey platform and imported into an R dataframe named survey.

respondent response
1 blue;green;red
2 brown
3 orange;pink
4 black;red
5 pink;red

Text to columns

The Text To Columns… feature in Excel is similar to separate() from dplyr in R, which needs to know names for which columns that results should be placed. I’ve manually added 1:7 because there are only seven distinct colors in the survey and therefore any given respondent could select seven at most.

library(tidyverse)

survey %>% 
  separate(col = response, sep = ";", into= str_c('color',1:7 )) # color1 through color7 column names
respondent color1 color2 color3 color4 color5 color6 color7
1 blue green red NA NA NA NA
2 brown NA NA NA NA NA NA
3 orange pink NA NA NA NA NA
4 black red NA NA NA NA NA
5 pink red NA NA NA NA NA

Speeding up calculations

Wouldn’t it be better to have results from each unique response option grouped automatically? Enter the library splitstackshape by Ananda Mahto.

Its core function cSplit() would achieve the same result as separate() with less effort by running cSplit(indt = survey, splitCols = "response", sep = ";").

But to solve our problem we want to use the function cSplit_e(), passing in the following parameters.

  • data: our dataframe
  • split.col: the dataframe column we want to split
  • sep: the separator/delimiter character
  • type: the underlying datatype
library(splitstackshape)

cSplit_e(data = survey, split.col = "response", sep=";", type = "character")  
respondent response response_black response_blue response_brown response_green response_orange response_pink response_red
1 blue;green;red NA 1 NA 1 NA NA 1
2 brown NA NA 1 NA NA NA NA
3 orange;pink NA NA NA NA 1 1 NA
4 black;red 1 NA NA NA NA NA 1
5 pink;red NA NA NA NA NA 1 1

The function automatically splits the data into as many unique values are contained within the column - regardless of what position they might are reported! If a given respondent did not mention a specific color, the cell becomes NA. We could remove the original column by adding drop=TRUE and change missing values to zero with fill=0.

clean <- cSplit_e(data = survey, split.col = "response", sep=";", drop = TRUE, type = "character", fill = 0)   

clean
respondent response_black response_blue response_brown response_green response_orange response_pink response_red
1 0 1 0 1 0 0 1
2 0 0 1 0 0 0 0
3 0 0 0 0 1 1 0
4 1 0 0 0 0 0 1
5 0 0 0 0 0 1 1

Summarizing results

From here we can head back to the tidyverse to explore the survey results.

results <- clean %>% 
  select(contains("response_")) %>% # select the columns with color detail
  pivot_longer(cols = everything()) %>% # make it long format
  group_by(name) %>% 
  summarise(selected = sum(value)) %>% # count colors selected
  mutate(percent = selected / nrow(survey)) %>% # calculate percent of total
  arrange(desc(selected)) %>% 
  mutate(name = str_replace_all(name, "response_", "")) # get rid of prefix

results

Now we know how many people selected each color with 60 percent of respondents indicating that red, for example, is one of their favorite colors.

name selected percent
red 3 0.6
pink 2 0.4
black 1 0.2
blue 1 0.2
brown 1 0.2
green 1 0.2
orange 1 0.2

Visualizing results

Finally, we turn to ggplot to make a simple visualization showcasing the relative popularity of the colors red and pink among our survey respondents.

ggplot(results, aes(x = reorder(name, desc(percent)), y = percent )) + geom_col(fill='#1FA187') +
  geom_text(aes(label=scales::percent(percent)),nudge_y = .01) +
  theme_minimal() +
  scale_y_continuous(labels=percent) +
  labs(title='Percent of Survey Respondents Selecting a Given Color',
     x = NULL, y = NULL)

The complete code

In just a few lines of code we can convert a messy text column with inconsistently delimited values into a tidy dataframe, calculate summary results, and generate a clean visualization to convey our findings.

Not bad at all.

library(tidyverse)
library(splitstackshape)
library(DT)
library(scales)

survey <- tibble(respondent = 1:5,
                 response = c('blue;green;red',
                              'brown',
                              'orange;pink',
                              'black;red',
                              'pink;red'
                              ))

clean <- cSplit_e(data = survey, split.col = "response", sep=";", drop = TRUE, type = "character", fill = 0) 

results <- clean %>% select(contains("response_")) %>% 
  pivot_longer(cols = everything()) %>% 
  group_by(name) %>% 
  summarise(selected = sum(value)) %>% 
  mutate(percent = selected / nrow(survey)) %>% 
  arrange(desc(selected)) %>% 
  mutate(name = str_replace_all(name, "response_", "")) 

ggplot(results, aes(x = reorder(name, desc(percent)), y = percent )) + geom_col(fill='#1FA187') +
  geom_text(aes(label=scales::percent(percent)),nudge_y = .01) +
  theme_minimal() +
  scale_y_continuous(labels=percent) +
  labs(title='Percent of Respondents Selecting a Given Color',
     x = NULL, y = NULL)
splitstackshape.knit