10 Understanding distributions

10.1 What is a distribution?

Day 44

Distributions are somewhat of a buzzword within the data community. Perhaps because the concept is both simple and complex, there tends to be anxiety when non-data people hear the term thrown around.

Let’s start with something that almost everyone understands. A series of data points. We all can envision that column in Excel filled with a long list of numbers.

A distribution is the visual shape of those numbers arranged smallest to largest from left to right. Values in that range with many observations appear taller, while areas with few observations appear shorter.

It is a series of numbers that form a distribution and dictate what shape it will take.


Different distributions

Because numbers within a data series can be very different, distributions can look very different. Let’s say our data series had values ranging from zero to ten. Our distribution might look like this.



We can see that there are some observations at zero and ten, and a lot of observations from one to nine. The distribution helps us understand where values tend to be located across the range.

We can also think of it in terms of probability. If the distribution is based on a lot of data points and representative of something (e.g., typical snowfall in centimeters for a city), we can use it to see that the probabilities of being values one through nine are equal and higher than the likelihood of being values zero or ten.


What else might a distribution look like?



These distributions can be interpreted as follows:

  1. Upper Left: All values occur at equal levels or probabilities.
  2. Upper Right: Much more likely to find values closer to zero than closer to ten.
  3. Lower Left: Five is the most common with fewer values as you approach zero or ten.
  4. Lower Right: Most values occur in the center of the distribution with very few at the extremes.

The directional sense of an underlying distribution is very important. In the real-world there are certain distributions that are more common than others and that come with special properties. For instance, the chart on the lower right is the famous normal distribution, for which we’ll explore further.



10.2 Normal distribution

10.2.1 Establishing normality

Day 45

The normal distribution may also be referred to as the Gaussian distribution or bell curve. It is characterized by being symmetric in shape with most values falling toward the center of the distribution, in which the measures of central tendency are approximately the same.


Normal distribution

Properties

  • Symmetric in appearance
  • The mean, median, and mode are all the same or approximately the same
  • The area under the curve equals 1 or 100%

Examples

  • Height of adults
  • Scores on the Graduate Records Examination (GRE)


Let’s use business metrics from a help desk as an example. A three-person team receives support requests throughout the workday. They track the time it takes in minutes for someone to provide the first non-automated response to a given user. Typically it takes around 70 minutes for this to occur, but it can be longer or shorter depending on how busy the day turns out to be.

Here is what response times look like for the previous thousand support requests.



The support team averaged 70 minutes to send a formal reply to users. Its fastest time was 42 minutes and its slowest time was 102 minutes. Half of the time - as shown by the interquartile range (IQR) - a response was sent between 63 and 77 minutes.

This distribution seems to meet the normal distribution characteristics described above. The mean and median are nearly identical and it is rare to find values near the extreme points on either side of the curve.

In fact, if we overlay a perfectly normal distribution curve we see that our response times track quite closely.



How can I prove that my distribution is normal?

To address this common question let’s add a second distribution that is notably less-normal, the life_expect variable from our countries dataset.


Approach 1. Adding a theoretical normal distribution to compare visually

The first approach is simply eye-balling the observed distribution against the theoretical curve.



At first glance the response times seem to fit the bell-curve more closely. The life expectancy chart on the right is somewhat lopsided and doesn’t have the expected number of observations continue down the curve to the right in a smooth way.


Approach 2. Create a QQ plot

Statisticians also use a Quantile-Quantile (QQ) plot to examine these differences more closely. A QQ plot maps the observed values against the theoretical quantile distribution expected in a pure Gaussian curve. A perfectly normal distribution will have all points on the reference line.



The observations seen in the response time QQ plot fall within the shaded boundaries, further evidence for normality. In contrast, the life expectancy QQ plot displays more volatility around the reference line and, especially at the top end of the distribution, even falls off entirely.


Approach 3. Running statistical tests

Finally, there is a statistical test called the Shapiro-Wilk normality test that can be deployed. It returns p-value, which is something we’ll define later. If the p-value returned by the test is very small (e.g., less than 0.05), there is sufficient evidence that the distribution in question is not normal. If it is above that decision threshold (e.g., more than 0.05), we can be confident that the data series is approximately normal.


Shapiro-Wilk normality test results

1. Response time in minutes: p-value = 0.4764686 –> normally distributed

2. Life expectancy in years: p-value = 0.0000732 –> not normally distributed


This result is a third validation that response time is normally distributed, while life expectancy is not. The statistical tests were run in R but there are also web versions available for you to experiment with.


So what?

We’ve looked at ways to tell if the data we have approximately follow a normal distribution. Next we’ll see why this is a useful characteristic for numeric data and examine what additional techniques can be applied.



10.2.2 Analyzing normal data

Day 46

The response time data from our help desk example was shown to be approximately normal. We keep saying approximately because you won’t find a perfectly normally distributed data series - and that’s ok.

When your data is approximately normal there are several useful analytical extensions.


68-95-99.7 Rule

We defined the 68-95-99.7 Rule earlier. In essence, if the data is normal, you should expect:

  • 68 percent of observations will be within one standard deviation from the mean
  • 95 percent of observations will be within two standard deviation from the mean
  • 99.7 percent of observations will be within three standard deviation from the mean

Based on a response time mean of 70.16 and a standard deviation of 9.92 we find:



In the case of help desk response times this rule works very well, a benefit of the data being approximately normal. Without having access to the full distribution, you can simply use the mean and standard deviation to learn a lot.


Turning distributions into probabilities

Another benefit of working with data sets that are normally distributed is that we can treat them as probability distributions. As all values must fall within our distribution, the visual is showing a hundred percent of the outcomes.

When you overlay a perfect normal distribution you’ll likely notice that if we were to analyze the red curve instead of the actual distribution, our results won’t be identical.

There are some ranges that are slightly under the curve, and other areas slightly above. By assuming normality, however, we are saying that we are ok with these limitations as the benefits will outweigh the costs.



Again, creating a full distribution requires us to have all of the actual response time values. The theoretic distribution only requires the mean and the standard deviation. Underneath the theoretical bell curve represents 100 percent of the expected values.

We can use this understanding to estimate probabilities for given values occurring within the range. You can already tell from the visual, or from your 68-95-99.7 calculations, that most values occur from around 55 minutes to 85 minutes. But what are the precise probability estimates for such values?


Probability of a specific value

The Probability Density Function (PDF) helps answer these questions by calculating the area underneath the curve for given value. We can use it to answer a question like, “what’s the probability that a response time for a randomly submitted help ticket will be 80 minutes?”.



You can use calculus to determine the area underneath a curve, or you can rely on embedded spreadsheet functions to do so. In Microsoft Excel and Google Sheets, NORM.DIST() accomplishes the task.


=NORM.DIST(specific value, mean, standard deviation, FALSE)


The FALSE input at the end will return the probability for the specific value passed in. So to calculate the probability of a response time being 80 minutes, use:


=NORM.DIST(80, 70.16, 9.92, FALSE)


This will return a probability of 0.0246 or 2.5%. Although interesting, that is very specific number of minutes from within the range.


Probability of a given value or less

The Cumulative Density Function (CDF) is often more useful. It returns the probability of getting up to and including a given value. The formula is the same as for PDF with the exception of the final argument. Here we use TRUE to return the cumulative probability.


=NORM.DIST(upper value, mean, standard deviation, TRUE)


What is the probability of the response time to a given ticket being 80 minutes or less?


=NORM.DIST(80, 70.16, 9.92, TRUE)


The returned probability is 0.8394 or 83.9% and is shown visually by the shared area in the chart below.



Probability of more than a given value

To find the probability of a help ticket response taking more than 80 minutes, simply subtract the previous result from one.


= 1 - NORM.DIST(80, 70.16, 9.92, TRUE)


The result is 0.1606 or 16.1%.



Probability between two points

Finally, we can use these functions to calculate the probability of a response time being between two values of interest by subtracting two CDFs.

Here we find that the probability of a response time taking between 50 and 80 minutes to be 0.8184 or 81.8%.


= NORM.DIST(80, 70.16, 9.92, TRUE) - NORM.DIST(50, 70.16, 9.92, TRUE)



It is incredible how much we are able to deduce with only the mean, standard deviation, and the belief that the underlying distribution is approximately normal.

You will find all the probability calculations shown above in this worksheet.



10.2.3 Data transformation

Day 47

Many statistical techniques work well with normal data. Some methods even require it. Data people sometimes go to great lengths to transform non-normal data into normal data because of this. There are other potential benefits that we’ll see as well, including relationship detection and outlier mitigation.

Data transformation involves taking all the original values from a data series and doing something consistent to them in an attempt to change the shape of the distribution.


Examples of data transformation

Let’s look at total economic output - gdp - and population from the countries dataset.



Both variables are clearly non-normal. Many observations are on the far left of the distribution and there are a small number of really large values moving right of the horizontal axis.


QQ Plots for the Original Data


The QQ plots show observations very far removed from the reference line, another indication that both series are non-normal.


Logarithmic transformation to the rescue

Something incredible happens when we take the log of each data point and re-plot the variables.



All three visuals look completely different. It is much easier to see variation across the individual distributions and the linear relationship now appears quite clear in the scatter plot.


QQ Plots for the Transformed Data


The QQ plots now indicate near-normality for the transformed data. Not surprising given the new histograms.


Common data transformation approaches

A logarithmic transformation is just one potential approach.


Common data transformations

  • Log: =log(data)
  • Square Root: =sqrt(data)
  • Cube Root: =data^(1/3)
  • Inverse: =1/data
  • Box Cox: read more
  • Johnson Transformation: read more


The first four approaches can be seen here. Box Cox and the Johnson Transformation are more advanced techniques that use power functions and can’t be easily shown in spreadsheets.

Let’s apply all the transformations above to gdp and observe the impact.



For gdp, the Johnson Transformation and Logarithmic approaches succeeded in changing the underlying distribution to normal. Depending on the characteristics of the original distribution, different transformations will be more effective than others.

For instance, see how many work for the gini coefficient, a measure of income inequality.



Five of the six data transformations tested on gini led to normal distribution designations. So, which one should you pick? It is generally best to select one that you and your audience will understand.


Downsides

Although there are benefits to transforming data to better approach normality, there are also costs. Most importantly, the interpretability of your results will likely challenge many audiences. It is easy to wrap our brains around data such as population totals. Deciphering the cube root transformation of total population, or its relevance, is more difficult.



10.3 Bernoulli distribution

Day 48

We are often involved in win-lose events in which there are only two outcomes. The probabilities associated with a given success or failure are shown by the Bernoulli distribution, which is fairly basic.


Bernoulli Distribution Characteristics

Properties

  • There are only two possible outcomes
    • Outcome 1: Success –> Something happened (p)
    • Outcome 2: Failure –> Something didn’t happen (q = 1 - p)
  • The probabilities associated with the two outcomes, p and q, sum to 1 or 100%


Let’s return to Gracie’s lemonade stand business and her expectations for rain, which has a negative impact on her expected number of customers.

Gracie’s research indicates that the probability of rain on any given Saturday is p(rain) = 0.28 or 28%. When modeled as a Bernoulli random variable this probability of success is labeled p.

We already know that the probability of something not happening is one minus the probability of something happening. In a Bernoulli distribution the probability of failure is defined as q with q = 1 - p. So q in our example is 1 - 0.28, which equals 0.72 or 72%.

We can visualize this Bernoulli random variable with a simple column chart.



By itself this isn’t terribly useful, but Bernoulli random variables set the stage for the geometric distribution and the binomial distribution, both of which pack plenty of practical relevance.



10.4 Geometric distribution

Day 49

A Bernoulli distribution shows the probabilities of success or failure for a given event. A geometric distribution shows how quickly we should expect our first success based upon those values.

Not surprisingly, the higher the probability of success, the faster we should expect its outcome.

The geometric distribution calculates the probability of getting your first success on a given attempt.


\[\text{First success on attempt n = (1-p)}^\text{(n-1)}*p\]

For instance, you can find the probability of having rain on the first day from our lemonade stand example.


\[\text{First rain occurs on day one = (1-0.28)}^\text{(1-1)}\times0.28 = \text{0.28 or 28%}\]

Day one matches the underlying probability of rain. On the other hand, there is only a 1.5 percent chance that it rains for the first time on the 10th day of business.


\[\text{First rain occurs on day ten = (1-0.28)}^\text{(10-1)}\times0.28 = \text{0.0146 or 1.5%}\]

We can plot the distribution for the first 20 days, after which the probabilities approach zero.



The chart highlights that given our probability of success, a 28 percent chance of rain, it is most likely to get the first rain within just a handful of days.

We can also look at the cumulative probabilities from the geometric distribution by adding each probability with the sum of the previous probabilities for each stage.



Plotting the cumulative probabilities shows that expectations for the first day of rain grows higher as more days are taken into consideration.



This tells us, for example, that there is a 96.3 percent chance that we have had our first rain by the 10th day. By the 20th day, having a first rain is a near certainty as the probability approaches 100 percent.

You can find the calculations for the rain example here.


Business conversions

The geometric approach works with any set of success-failure probabilities. In the business world we often attempt to convert some wider audience into a lead or sale. Since the world is big, conversion rates are generally low.

Let’s say your conversion rate from online ads was only 0.4 percent, meaning that one in every 250 people who see the ad click on it and get redirected to your website. We can model this scenario as we did for the rain expectations.



The chart on the right is especially useful because with a low probability of success there will be many sequential events that still have differentiated probabilities.

In this case, we can see that if a marketer only gets their ad in front of 100 people, there is only a 33 percent chance that a conversion will have occurred. The conversion probability more than doubles to 87 percent after 500 people have viewed the ad and approaches 100 percent by the 1,000 viewer.

These likelihoods are very useful to set business expectations when conversion rates are known and to quantify the impact of even marginally improving them.



10.5 Binomial distribution

Day 50

The binomial distribution relates to the Bernoulli distribution in which we know the probabilities of success and failure for a given event.

Instead of just one occurrence or trial, however, the binomial distribution evaluates a certain number of trials that are denoted n. The distribution then shows the likelihood of getting a given number of successes, k, across all trials.


Binomial Distribution

Properties

  • Displays the probabilities associated with a given number of successes (k) occurring within a certain number of trials (n) based on the probability of success (p).

Appropriate when each trial:

  • doesn’t impact the outcome of another (independence)
  • can be evaluated as a success or a failure
  • has the same probabilities associated with success-failure outcomes


Let’s continue with the lemonade stand example in which there is a 28 percent chance of rain on any given Saturday morning. How many Saturday’s throughout the year are expected to have rain?

We can model this with the binomial distribution formula.


\[\text{p(k successes in n trials) = }\frac{\text{n!}}{\text{k!(n-k)!}}*p^k(1-p)^{n-k}\]

A bit of mouthful - thankfully you will find spreadsheet shortcuts below. But if you did want to use the full formula, you could construct the following to estimate the probability of having 10 days of rain (k) during the 52 Saturdays in a year (n) with the 28 percent underlying chance of rain (p).


\[\text{p(10 rainy days from 52 Saturdays) = }\] \[\frac{\text{52!}}{\text{10!(52-10)!}}*0.28^10(1-0.28)^{52-10}=\text{0.0477 or 4.8%}\]

Note that the ! indicates a factorial, from which you take the number shown and multiply it all the way down to one. 10! = 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 3,628,800. This can also be done in a spreadsheet using =FACT(10).


The binomial distribution is revealed when you apple the formula above to a range of possible success counts. Here are the probabilities of achieving k days of rain across the 52 Saturdays.



We find a symmetric distribution from which we can start building confidence for rain outcomes during the year. It appears that the most likely number of rainy days is between 10 and 20. In fact, if we look at the respective probabilities, we find a 91 percent chance that the number of rainy days will be in this range.


Specific number of successes

We can calculate the individual probabilities associated with a specific number of successes in Microsoft Excel and Google Sheets using BINOM.DIST().


=BINOM.DIST(number of successes, number of trials, probability of success, FALSE)


The FALSE input at the end will return the probability for a given number of successes based on the number of trials and the underlying probability of success. So to calculate the probability that there are 10 days of rain within a 52 day period, use:


=BINOM.DIST(10, 52, 0.28, FALSE) = 0.0477 or 4.8%


You can find solutions to the spreadsheet approach here.


Maximum number of successes

The binomial cumulative distribution returns the probability of getting at most a given number of successes within a certain number of trials. The only difference to the spreadsheet function above is the use of TRUE to return the cumulative probability.


=BINOM.DIST(number of successes, number of trials, probability of success, TRUE)


What is the probability that there are a maximum of 10 days with rain?


=BINOM.DIST(10, 52, 0.28, TRUE) = 0.1020 or 10.2%



Consulting capacity

Let’s explore another example.

You run a consulting business and work mostly with government organizations. These organizations have a short window at the end of each fiscal year to accept proposals and take a long time to make approval decisions.

Each year you send proposals to as many relevant opportunities as possible. When a proposal is accepted you need to quickly match a lead consultant to the new project. But the labor market is tight and it isn’t easy to find quality talent on short notice.

To combat this you arrange for a certain number of vetted consultants to be ready to assist if needed.

How many consultants should you keep on retainer? It should depend on the number of proposals you are submitting and the probability that a given proposal will be accepted.

Historically, only 5 percent of proposals are accepted. This year you have sent 250 proposals into various opportunities.

You could multiply 250 by 5 percent to find the expected number of new projects, which is 12.5. Or you could use the binomial distribution to gain deeper understanding of the probabilities along the potential project count range.



You now have a much better sense for how to make decisions on your retainer approach. There is very little chance (1.5%) that you’ll need more than 20 consultants. There is also a very good chance (86%) that you’ll end up needing between 8 and 17.

Your final decision will depend on (1) the cost of retaining talent that might go unused and (2) the missed financial opportunity to pursue a given project due to lack of resources. Either way, the binomial distribution adds a lot of information beyond the baseline expected consultant need of 12.5 (250 *5%).