A new platform is looking for Alpha testers. Sign up, it's free!

# Tree diagrams in R

June 28, 2020

A tree diagram can effectively illustrate conditional probabilities. We start with a simple example and then look at R code used to dynamically build a tree diagram visualization using the `data.tree` library to display probabilities associated with each sequential outcome.

You can find the single-function solution on GitHub.

Gracie Skye is an ambitious 10-year-old. Each Saturday, she sells lemonade on the bike path behind her house during peak cycling hours. It is a lot of work to prepare the stand and bring the right quantity of ingredients, for which she shops for every Friday after school for optimal freshness.

It didn’t take Gracie long to realize that weather has a huge impact on potential sales. Not surprisingly, people buy more lemonade on hot days with no rain than they do on wet, cold days. She has even estimated a demand equation based on temperature.

When it rains, demand falls an additional 20% across the temperature spectrum. To generate a more realistic view of her business, and to inform ingredient purchasing decisions, Gracie collected historic data to help her better anticipate weather conditions.

She finds:

• Probability of rain: p(rain) = 0.72
• Probability of no rain: p(no rain) = 0.28

Further, she knows the temperature fluctuates widely depending on if it rains or not.

TemperatureNo rainRain
95°F0.250.05
85°F0.550.25
75°F0.150.35
65°F0.050.35

## Visualizing likely outcomes

Gracie translates these probabilities into a tree diagram to get a better sense of all potential outcomes and their respective likelihoods. The most probable outcome is to have no rain and a temperature of 85°F. There is a probability of 0.396 associated with this. The least likely outcome is rain with a temperature of 95°F (p=0.014).

## Expected outcomes

She then uses her demand function to calculate revenue, cost, and profit expectations for each scenario based on:

• Selling price: \$2 per glass
• Cost of goods sold: \$0.8 per glass
TemperatureRainProbDemandRevenueCostProfit
85no rain0.396448835.252.8
95no rain0.186212449.674.4
75no rain0.108285622.433.6
75rain0.098224417.626.4
65rain0.0988166.49.6
85rain0.07367228.843.2
65no rain0.0361020812
95rain0.014499839.258.8

Taking the sum of all probabilities multiplied against their associated business outcome, Gracie calculates expected values for revenue, cost, and profit for her lemonade stand operations.

• Expected Revenue: \$76.23
• Expected Cost: \$30.49
• Expected Profit: \$45.74

## Making tree diagrams in R

I created this example because there don’t seem to be many r packages with flexible outputs for tree diagrams. Specifically, I needed something with the ability to:

1. Take individual probabilities as inputs
2. Calculate and display the joint or cumulative probabilities for each potential outcome

The solution was to use the `data.tree` package and build the tree diagram with custom nodes.

All that is required is two columns:

1. pathString: This defines how the tree should be structured. In our example, the first branch level is rain or no rain. To add a second branch of decisions or possible paths, simply add the outcome to the first branch name with a `/` separator. For instance, `rain/95°F`, indicates the outcome of rain and a temperature of 95 degrees. The name of this variable, pathString, is important because it is expected by the `as.Node()` function we’ll call later.
2. prob: The probability associated with a specified event.

Let’s load in our input data from which we want to create a tree diagram.

``prob_data <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQWc07o1xTNCJcGhw-tWYAnD3xPCjS0_jE4CIBR-rp5ff3flVGJQf2K24bJ5FE-DauQvLrtB8wWJNuc/pub?gid=0&single=true&output=csv")``
pathStringprob
no rain0.72
no rain/95°F0.25
no rain/85°F0.55
no rain/75°F0.15
no rain/65°F0.05
rain0.28
rain/95°F0.05
rain/85°F0.25
rain/75°F0.35
rain/65°F0.35

## Create some helper variables

The goal in this step is to generate some new variables from the original inputs that will help define the required tree structure.

1. tree_level: the branch level on a tree for a specific probability.
2. tree_group: the name of the first branch to lookup parent probabilities
3. node_type: a unique name to build custom components in the visualization

We also make a variable named `max_tree_level` that tells us the total number of branch levels in our tree.

``````prob_data <- prob_data %>%  mutate(tree_level = str_count(string = pathString, pattern = "/") + 1,
tree_group = str_replace(string = pathString, pattern = "/.*", replacement = ""),
node_type = "decision_node"
)

max_tree_level <- max(prob_data\$tree_level, na.rm = T) ``````

## Finding parent probabilities

For us to determine the cumulative probability for a given outcome, we need to multiply the probabilities in secondary branches against the probability of the associated parent branch.

To do this, we create a new data frame, `parent_lookup`, that contains all probabilities from our input source. We then loop through the tree levels, grabbing the probabilities of all parent branches. Finally, we calculate the cumulative probability `overall_prob` by multiplying across all probabilities throughout a branch sequence.

``````parent_lookup <- prob_data %>% distinct(pathString, prob) # get distinct probabilities to facilitate finding parent node probability

for (i in 1:(max_tree_level -  1)) { # loop through all tree layers to get all immidiate parent probabilities (to calculate cumulative prob)

names(parent_lookup)[1] <-paste0("parent",i)
names(parent_lookup)[2] <-paste0("parent_prob",i)

for (j in 1:i) {

if (j == 1)  prob_data[[paste0("parent",i)]] <- sub("/[^/]+\$", "", prob_data\$pathString)
else if (j  > 1) prob_data[[paste0("parent",i)]] <- sub("/[^/]+\$", "", prob_data[[paste0("parent",i)]])
}

prob_data <- prob_data %>% left_join(parent_lookup, by = paste0("parent",i))

}

prob_data\$overall_prob <- apply(prob_data %>% select(contains("prob"))  , 1, prod, na.rm = T)  # calculate cumulative probability    ``````

## Generating terminal nodes

For us to display the final probability on the tree diagram, we will need to pass data from a node_type named terminal. Because we need unique `pathString` values, we do this by replicating the final branch probabilities (along with the cumulative probabilities we calculated above) and adding `/overall` to the `pathString`.

The final steps are to:

1. add the terminal rows back to the `prob_data` data frame
2. add one last row, with tree_level 0, that is the starting point for the tree. Here we name this `weather` for its `pathString`.
``````terminal_data <- prob_data %>%  filter(tree_level == max_tree_level) %>% # create new rows that will display terminal/final step calulcations on the tree
mutate(node_type = 'terminal',
pathString = paste0(pathString, "/overall"),
prob = NA,
tree_level = max_tree_level + 1)

start_node <- "weather" # name the root node

prob_data = bind_rows(prob_data, terminal_data) %>%  # bind everything together
mutate(pathString = paste0(start_node,"/",pathString),
overall_prob = ifelse(node_type == 'terminal', overall_prob, NA),
prob_rank = rank(-overall_prob, ties.method = "min", na.last = "keep"))

prob_data = bind_rows(prob_data, data.frame(pathString = start_node, node_type = 'start', tree_level = 0)) %>% # add one new row to serve as the start node label
select(-contains("parent"))``````

Our final data frame, `prob_data`, is now ready to be passed to a visualization function.

pathStringprobtree_leveltree_groupnode_typeoverall_probprob_rank
weather 0 start
weather/no rain0.721no raindecision_node
weather/rain0.281raindecision_node
weather/no rain/95°F0.252no raindecision_node
weather/no rain/85°F0.552no raindecision_node
weather/no rain/75°F0.152no raindecision_node
weather/no rain/65°F0.052no raindecision_node
weather/rain/95°F0.052raindecision_node
weather/rain/85°F0.252raindecision_node
weather/rain/75°F0.352raindecision_node

## Visualization function

We make a function, `make_my_tree`, that takes a data frame with columns `PathString` and `prob` and returns a tree diagram along with the conditional probabilities for each path. The function can handle three additional arguments:

1. display_level: Enables us to control how many branches of the tree diagram to display. The default is NULL which will show the entire tree.
2. show_rank: Option to show the rank of the final probability along with the terminal node. The default is FALSE where the rank is not shown.
3. direction: Option to change the direction that the tree is show. Default is `LR` or left to right. `RL` will display the tree right to left. Any other value with show it top-down.
``````make_my_tree <- function(mydf, display_level = NULL, show_rank = FALSE, direction = "LR") {

if (!is.null(display_level) ) {
mydf <- mydf %>% filter(tree_level <= display_level)

}

mytree <- as.Node(mydf)

GetEdgeLabel <- function(node) switch(node\$node_type, node\$prob)

GetNodeShape <- function(node) switch(node\$node_type, start = "box", node_decision = "circle", terminal = "none")

GetNodeLabel <- function(node) switch(node\$node_type,
terminal = ifelse(show_rank  == TRUE, paste0("Prob: ", node\$overall_prob,"\nRank: ", node\$prob_rank),
paste0("Prob: ", node\$overall_prob)),
node\$node_name)

SetEdgeStyle(mytree, fontname = 'helvetica', label = GetEdgeLabel)

SetNodeStyle(mytree, fontname = 'helvetica', label = GetNodeLabel, shape = GetNodeShape)

SetGraphStyle(mytree, rankdir = direction)

plot(mytree)

}``````

## Show the decision tree

Passing our data frame to the `make_my_tree` produces this baseline visual.

``make_my_tree(prob_data)``

## Alternative outputs

And here are a few alternative versions based on the optional arguments we included in the `make_my_tree` function.

### Only the first branch

``make_my_tree(prob_data, display_level = 1)``

### Full tree without the conditional probabilities

``make_my_tree(prob_data, display_level = 2)``

### Everything including ranks

``make_my_tree(prob_data, show_rank = TRUE)``

## Next steps

This approach works with tree diagrams of any size, although adding scenarios with many branch levels will quickly become challenging to decipher. Some things we could also consider:

• Enhance the script to calculate and display payoff amounts by branch as well as generate overall expected value figures.
• Add formatting options such as color, font type, and font size for the visual.
• Turn the data frame manipulation into its own function that will return all conditional probabilities. This could be especially useful as the number of branches grows larger.

## Inspiration

All of the code above was built on top of approaches found in the resources below:

Or create a free DataKwery.com account

### Related Courses

DataCamp
Intermediate Data Visualization with ggplot2
Rick Scavetta

4 hours

Intermediate

\$300

13,578

### Related Learning Paths

Coursera
Johns Hopkins University

5 Courses 7 Months