Feature Engineering Foundations in Python with Scikit-learn

Free Live ML Workshop #2 on Aug 6 - Register Now

A Data Visualization Blueprint with 5 Guiding Questions

A Data Visualization Blueprint with 5 Guiding Questions

May 31, 2021

Visualization is one of our most powerful tools to summarize complex data relationships in a way that helps diverse audiences absorb key findings.

This can be approached in the traditional static sense, where we use Excel or PowerPoint to create a simple bar chart, as an infographic where we intertwine a variety of visual elements to tell a story, or in interactive environments where users are in control through dynamically filtering and real-time exploration.

The format, style, or level of interaction you choose to share will likely involve a little science, but a lot of art. It is a mix because we have so many tools at our visualization disposal.

Communicating data findings

Everyone has different information digestion preferences and these preferences change depending on a variety of factors and situations.

As the data artist, you must choose to surface results across a continuum ranging from handing over the raw data to crafting a pithy social media post.

  • Raw Data
  • Interactive Dashboards
  • Comprehensive Reports
  • Execuitve Summaries
  • Webinars
  • Blogs
  • Infographics
  • Tweets

Which format you decide to deploy likely comes down to three key questions. First, what messages are you trying to convey? Second, who will be consuming the data output? Finally, where and how will they be consuming it?

The only certainty is that the typical person won’t have the tools, skillsets, or time to synthesize the raw data on their own. This means it is your job to create a visual summary that grabs the attention of your target audience.

Choosing the best visualization

There is a well-established library of traditional charts and graphs from which you can experiment.

  • Area Chart
  • Bar Chart
  • Box & Whisker
  • Bubble Chart
  • Circle Diagram
  • Column Chart
  • Formatted Table
  • Histogram
  • Line Chart
  • Map
  • Pie Chart
  • Radar Chart
  • Scatter Plot
  • Tree Diagram

Some data types go better with certain visuals. For instance, if we want to show the relationship between two numeric variables, we lean on scatter plots. If we are comparing across groups, bar or column charts are often the best bet. And if we are tracking trends over time, a line chart generally does the job most effectively.

That said, there are no concrete rules and in most cases you’ll try multiple approaches and ultimately be forced to use your best judgement. During this process, from Data to Viz can be an invaluable resource to help you narrow down appropriate chart types.

In terms of general design principles, I also recommend the online book DATA + DESIGN. It shares tactical advice for creating attractive and digestible visualizations and also discusses the ongoing balance that data practitioners must work towards.

“In most cases, you want a chart or graph to be able to stand alone, outside of any narrative text. It’s important to find a balance between giving enough information to your audience and keeping your text simple.” | DATA + DESIGN

We want a visualization to contain enough information that a reader will find meaning in it without additional context. At the same time, we want to keep it simple and clean enough so that they won’t be overwhelmed.

Thankfully, even without explicit rules to create attractive and effective visualizations, we can still make things easier by considering five guiding questions that apply in most situations.

To do so, let’s take a look at average college_share values by world region from this countries dataset. Keep in mind that it is possible for a country to have a college_share above 100 percent due to inbound student mobility.

If we were to calculate the average by world region and call the default column chart in Microsoft Excel, the output would look something like this.

To be honest, basic spreadsheet software has improved substantially in recent years in terms of default visualizations and formatting options. Still, there is a lot that we can consider in order to improve this baseline chart with the goal of maximizing understanding and engagement.

Five Guiding Questions

1. How can you craft a title to capture attention?

The first element to consider is the title, which surprisingly tends to be the most overlooked. Your title is a chance to summarize key messages and clearly define what the readers are looking at.

In our example, it is accurate but limiting in to simply say Average college participation rates. We could add a bit more depth by changing it to College participation by world region in 2020.

If you really want to take advantage of the opportunity and give your readers a clear takeaway, consider a main title like Europe leads the world in access to higher education and a subtitle adding the data definition Average college participation rates by world region in 2020.

If people love your visual, they will want to share it. They may snap a picture at a conference or take a screenshot from a report before forwarding on. Although it is great for others to benefit from your creation, it would also be good for you or your organization to get some credit. This is why it is also a good idea to add a logo or source text when there is a chance a visual may be shared more widely.

We’ve added Source: World Bank World Development Indicators (WDI) in smaller text on the bottom of our chart so that readers at least know where the underlying data comes from.

2. What can you do with axis values and labels to reduce visual clutter?

Next let’s look at our axes, both the vertical y-axis that is showing the mean college participation rate, and the horizontal x-axis that is displaying the distinct world regions from our dataset. We can make two quick fixes.

Numeric formatting

Change the default Excel labels on the vertical axis from decimals such as 0.2 and 0.4 to their percentage form of 20 percent and 40 percent.

Text wrapping and font direction

Adjust the font spacing and wrapping of the response option labels shown on the horizontal axis to give the Middle East & Africa more room to breathe.

Although not a problem here, it is best to avoid using vertical or rotated text labels because it reduces legibility. If you have too many labels, consider using a bar chart with horizontal columns.

3. Is it possible to highlight specific data points on which you want readers to focus?

If you have relatively few data points to show, it is usually a good idea to completely remove the axis with the numeric summary and simply label the actual values on each column.

If you do this, you can also remove the horizontal gridlines to further reduce clutter. Just make sure to keep your minimum vertical axis value at zero since the distance between that point and the top of the column is what creates an accurate proportional representation.

Taking these steps works well with our example as we only have four data points.

Finally, think about how you choose to sort the results in the chart. Our example began in alphabetical order. If our focus really is Europe, we could also sort based on college_share values from high to low.

In this case, you could consider adding a reference line to help readers further internalize the difference between Europe and the other world regions.

Here we see the impact of our changes. The vertical axis is gone, data labels are added, and results are sorted from high to low. In my opinion, these steps help focus the reader on the main message without introducing undue complexity.

Regardless of your final decision, the right question to ask is, does the selected ordering or data highlighting make sense and how might it help or hurt a reader’s interpretation of the results?

4. Is a legend really needed?

Another element at our disposal is the legend, something that shouldn’t be included if the information is already embedded into the chart.

In our simple example, a legend doesn’t make sense as the horizontal axis labels make it clear which region is associated with each column. But let’s add another variable, income_level, to see a case where a legend is required.

This action introduces another judgment call. Where do we place the legend? I generally recommend putting it underneath the title if shows the different color categories that appear in the chart so that readers can scan the color assignments before getting to the data results.

It is also helpful to keep the ordering consistent between the legend and the visual. For instance, the one we added goes left-to-right, matching how the data is displayed for each region. A vertical legend would have been inconsistent with the rest of the visual, forcing readers to take unnecessary mental steps to match the legend to the columns.

5. How can color enhance interpretation?

The final element we consider is color, which can either enhance or detract from our final result.

Thankfully there are great resources that provide guidance in terms of both color scale selection and color code identification, which can then be plugged into spreadsheet software.

Multiple colors

We recommend a website called Color Brewer, which asks you to provide how many distinct colors you need and what type of data scale you are using. If you don’t know the answer to these choices, the website also helps you match data scales to color scales.

  • Sequential: Use light to dark gradients of the same color when there is a natural low-to-high ordering as with ordinal data. We used this approach in the income levels chart above.
  • Diverging: Use when values can go above or below some baseline level. For instance, GDP growth ranges, which can include both negative and positive values.
  • Qualitative: Use colors that look nice together but don’t convey a natural ordering for nominal variables such as world region.

All the same color

Another choice would be to just use one color. For instance, keeping the columns blue for all regions and relying on the labels to do the talking. This is the default Excel selection.

A single callout color

We could also use one specific color to highlight a value for which we want readers to pay attention. Since our narrative is about Europe boasting higher college participation rates that other regions, we might decide to make it green while changing the other regions to a neutral gray.

Now Europe stands out in terms of both ordering and color. It would be hard for a reader to miss the main data message.

Finally, it is worth considering the ability of our audience to discern color differences.

Black and white print outs

Some people still print out presentations and reports. It is therefore useful to test how your graphics would look with both color and black-and-white print settings.

Color blindness

It is estimated that nearly one in ten men are colorblind. Certain color combinations will therefore be indistinguishable for them (source: http://www.colourblindawareness.org). Thankfully, the Color Brewer resource linked above provides color options to reduce this risk.


Final version

So here we are. We’ve gone from the default Excel chart on the left to the thought-out visual on the right.

We made subjective decisions as we went through each guiding question. Although you may or may not agree with our choices, this type of internal dialogue or collaborative conversation is extremely helpful when fine-tuning important visualizations.

Want to learn more about creating effective data visualizations? Take a look at the learning resources below.