# 7 Finding relationships

In this section we cover how to examine and describe relationships between two or more variables. Specifically, we will learn techniques to work with numeric, categorical, or a combination of variable types.

## 7.1 Numeric

### 7.1.1 Correlations

*Day 25*

Let’s begin with a way to describe *linear* relationships between two numeric variables through the **correlation coefficient**, usually denoted as lowercase `r`

.

The possible values for the correlation coefficient range between negative one and positive one.

**Possible Correlation Values**

**1: A perfect positive linear relationship.**For every unit increase in one variable, the other variable goes up by an equal amount.**0: No linear relationship.**A change in one variable is not associated with movement in the other variable.**-1: A perfect negative linear relationship.**For every unit increase in one variable, the other variable goes down by an equal amount.

The best visualization to see correlations are scatter plots, which display pairs of numeric values.

Graph 1 highlights a positive relationship between two variables with a correlation value of r = 0.73. In graph 2 we see a clear negative relationship, quantified by an `r`

value of r = -0.87. Finally, in graph 3, it is difficult to identify any relationship between the two variables and the `r`

value is very close to zero.

You can calculate the correlation coefficient with the following formula.

\[\text{r = }\frac{\text{Sum ((observed x - mean x) * (observed y - mean y))}}{\sqrt{\text{((Sum(observed x - mean x)^2) * (Sum(observed y - mean y)^2))}}}\]

It looks more complex than it is and you can see it worked out here in `Google Sheets`

for `gdp`

and `population`

from the countries dataset.

But of course there are embedded spreadsheet functions that will take care of the calculations for you behind the scenes.

`=CORREL(range for data series 1, range for data series 2)`

When you pass two data ranges into the function above, the spreadsheet will return a correlation value between -1 and 1.

**Non-linear relationships**

A correlation value of zero does not necessarily imply that two variables are not related. We can only conclude that they are not *linearly related*.

This is why it is so important to look at the visual relationship in addition to reviewing correlation values. Take `salary`

and `years of work experience`

data as an example.

If you only look at the correlation value you would find a `r`

of -0.02, which indicates no *linear* relationship. Although the data points have a correlation value near zero, when you plot them you clearly see that they are related in a non-linear way.

Income initially increases along with years of work experience before eventually declining as people wind down their careers.

It’s always a good idea to visualize variables when searching for relationships in order to spot such non-linear associations that would otherwise have been missed through the correlation coefficient alone.

**Spurious correlations**

Another important fact is that strong linear correlation between two variables does not imply that one of them is causing a change in the other. This leads to the often used saying that **correlation does not imply causation**.

You’ll find many examples of **spurious correlations** on Tyler Vigen’s fantastic website. Give it a visit to see the tragic relationship between Nicolas Cage films and people drowning in pools.

Are Nicolas Cage film appearances causing people to fall into pools? Maybe, but it is highly unlikely!

Another amusing example shows the correlation between the divorce rate in Maine and the per capita consumption of margarine at nearly 1. Should we conclude that eating margarine causes people to get divorced? Of course not.

The main point is that causation can’t be determined from observational data alone. The only way to validate causation is through randomized, controlled experimentation.

### 7.1.2 Correlation matrix

*Day 26*

It is useful to see all the linear correlation values for your dataset in one place, especially if you have a large number of numeric variables. A **correlation matrix** facilitates this by creating a table with all variables repeated across the rows and columns.

Each cell in the matrix represents the correlation coefficient, `r`

, between the respective row and column variable combination. Down the diagonal of the correlation matrix you will typically find a value of `1`

, which represents the perfectly positive linear relationship between the same variable (e.g., `inflation`

and `inflation`

).

A **correlation matrix** like this can be created in spreadsheet tools such as `Google Sheets`

through use of `=CORREL()`

and you’ll find an example from our countries dataset here.

In the spreadsheet you’ll notice conditional heatmap formatting placed on the cells to draw readers attention to especially strong positive (blue) or negative (red) correlation values. This visual enhancement speeds up the identification of strong associations.

**What is strong? What is weak?**

It is natural to ask what levels of correlation values relate to different relationship strengths. So far all we’ve done is define a continuum from negative one (perfect negative relationship) to positive one (perfect positive relationship).

Although there are not exact cutoff points or widespread agreement, I’ve always liked this directional summary graphic provided in an undergraduate statistics course at Southeastern Louisiana University.

If you the descriptive ranges helpful, you can group the variables accordingly.

**Doing more**

Although beyond the scope of this data literacy program, you should know that statistical programming software such as R and Python give you the ability to generate much more complex correlation visuals.

We made the following plot using `library(corplot)`

in R. It shows the magnitude of the correlation in both color intensity and size of circle.

You could also choose to swap out the circles with the actual values.

In either variation, you likely can determine relationships more quickly than in the basic matrix.

**Incorporating categorical variables**

With `library(GGally)`

, again in R, we can add even more complexity by visualizing the numeric correlations and distributions against a categorical variable, demonstrated here with `region`

.

## 7.2 Categorical

### 7.2.1 Crosstab counts

*Day 29*

We learned that frequency counts and percentages are the primary summary approach for both nominal and ordinal data. Simply count the number of observations within a variable that match a given value label to get a sense of how they are distributed.

A **crosstab** or **contingency table** is the starting point when you need to summarize *two* categorical variables. Let’s look at `region`

and `income_level`

from the countries dataset. The pure frequency counts for the variables are as follows:

**Crosstabs** enable us to push these standalone summaries together to summarize the relationship between a pair of categorical variables.

The table displays the unique value labels on the vertical and horizontal headers and provides the frequency counts for all unique combinations in the center.

To build a crosstab for `region`

and `income_level`

we display their unique labels along the row and column headers, respectively.

Each combination in our data set is counted and displayed in the appropriate location. We can see that there are three countries in the `Americas`

, for example, that are in the `Lower middle income`

bracket.

It is also common to include **column totals** and **row totals** in crosstab tables. We end up with 25 distinct data points - a lot to potentially talk about.

**Pivot tables**

The most efficient way to build a crosstab in a spreadsheet is by taking advantage of **Pivot Table** functionality. Pivot tables significantly speed up exploratory analysis when compared with using multiple functions like `=COUNTIF()`

to explore individual value labels.

**Creating a pivot table in Google Sheets**

- Select all the cells from your table of interest
- Click
`Data`

from the menu bar - Select
`Pivot table`

- Click
`Create`

- Click
`Add`

next to`Rows`

in the Pivot table editor window to include desired row variables - Click
`Add`

next to`Columns`

in the Pivot table editor window to include desired column variables - Click
`Add`

next to`Values`

in the Pivot table editor window to include choose the type of summary data to show. Hint: If you want counts, select any variable that has no missing data.

Following these steps will populate the table below, which can be further customized from a design perspective or have other variables swapped in.

Notice that the column order for income is different than in our initial table. This is because `Google Sheets`

ordered the income level groups alphabetically. Although annoying, this can be overcome with some manual tweaks.

**Visualizing two categorical variables**

There are many approaches for visualizing multiple categorical variables. The standard starting point is a column chart, which enables us to split the first categorical variable by a second.

Here we see income levels split side-by-side and placed above their respective regions.

Another option is a bubble chart variant. It shows each categorical value combination in a similar way to the crosstab table, but with the size of the bubbles representing its frequency.

### 7.2.2 Crosstab percents

*Day 30*

We can also transform a crosstab or contingency table to reveal more data perspectives by calculating relative frequencies or proportions from different cohort perspectives.

Let’s first revisit our crosstab between `region`

and `income_level`

that shows all combinations for the 142 countries in the dataset.

**Percent of all**

We will first modify the table by dividing each cell by the total number of countries. This returns the proportion of each unique category combination.

The table shows us, for instance, that 9.9 percent of countries in the dataset are from the `Americas`

*and* are `Upper middle income`

. Through the row and column totals, it also reveals the relative frequency of each individual value label. For example, 16.2 percent of all countries are in the `Low income`

bracket.

**Row percentages**

Another perspective it to calculate row percentages based upon the total number of observations in each row.

Notice that in **row percentage** tables the `Total column`

has 100 percent values for each row. This is because we’ve calculated the proportion of each cells observations to that total row count. As an outcome, we can now say that 70.7 percent of countries in `Europe`

are `High income`

. The last `Total row`

tells us that 16.2 percent of *all* countries in the sample are `Low income`

.

**Column percentages**

Finally, we can make the proportion calculations based upon the total observations in each column. In our case, this reflects the `income_level`

groupings.

With **column percentage** tables we find 100 percent values for the final `Total row`

of the table. This allows us to focus on the `income_level`

variable as our unit of analysis. Among `Low income`

countries globally, 91.3 percent are located in `Middle East & Africa`

.

**So much to say**

There are many potential data points to focus on across the four crosstab variants. What you choose to report or explore further likely depends on what the results in each table look like and who your audience might be. We’ll discuss data storytelling techniques in later lessons to help you make such decisions.

**Calculating percentage crosstabs in spreadsheets**

It would be tedious to make each total, row, or column percentage calculation in a spreadsheet. Thankfully pivot tables have options to transform your table from counts to a desired percentage view.

You can toggle between the various options to mimic the approaches discussed above by selecting `Default`

(i.e., pure counts), `% of row`

, `% of column`

, or `% of grand total`

.

## 7.3 Mixed

*Day 31*

We are not limited to summarizing *only* numeric variables or *only* categorical variables. We are able to mix and match variable types across the descriptive techniques discussed earlier. There are several ways to present such information.

**Table format**

Let’s start by taking some common summary statistics for each unique region in the countries dataset.

We see than `Europe`

boasts average life spans more than 10 years longer than those observed in the `Middle East and Africa`

. At the same time the standard deviation for countries in the `Middle East and Africa`

is more than double Europe, indicating that some countries will be quite different than the regional mean.

**Two or more categorical variables**

We can further drill down into the data by including both `region`

and `income_level`

. This looks like a crosstab table, but instead of focusing on counts or relative proportions we’ll instead report the mean `life_expect`

for each unique value combination.

The table shows that average life expectancy tends to be longer in higher income countries. This is true for each world region.

One thing to keep in mind when crossing categorical variables is the sample size or number of observations, typically denoted `n`

. When you replace counts with other summary statistics it is no longer evident how much data is backing up each cell. This `n size`

can fall quickly as more variables are crossed against one another or the more unique values a given categorical variable may have.

**Dealing with a small number of observations**

When you know that there are a small number of observations you can decide to either:

**Hide summary statistics**for cells that have total observations below some threshold. Many data scientists will make this threshold 30, for reasons we’ll dive into later.**Show all the data but explicitly label the n sizes**associated with each cell so that readers know to take certain values with caution.

You could also choose a hybrid approach that hides cells below a threshold (e.g., 10 observations) *while* still showing `n sizes`

for the remaining values. This makes our results look sparse.

*Table Note: Data only shown for records with 10 or more observations.*

**A spreadsheet approach**

Pivot tables once again make finding summary values across variables quite easy and avoid having to use several instances of formulas such as `=AVERAGEIF()`

.

Simply select a numeric variable from the `Values`

field of the `Pivot table editor`

in Google Sheets and click the `Summarize by`

drop down to select a desired summary measure.

**Chart form**

There are many options to visualize numeric summaries for categorical variables. Here are two column chart examples for one and two categorical numeric summaries. A box and whisker chart would show more numeric detail, but is harder for many to interpret.

See the data visualization section of the appendix for more detail.