4 Getting data
4.1 A starting point
Welcome to day 1 of the #66DaysOfData Literacy program. We start without numbers because most of us don’t just have data that is ready to explore.
This is one reason that it may feel intimidating to start working with data. I believe it is much easier to learn core data concepts if you understand how a dataset was constructed and are at least somewhat interested in what it has to say. Before we get there, however, let’s brainstorm some ways in which people get data in the first place.
Surveys remain a primary way for organizations to quantify the world. Although use cases vary (e.g., understanding why people like to travel or determining if current customers are happy), approaches are similar. Agree on what you want to know, come up with questions that help reach that goal, put into a survey questionnaire, and attempt to get people to complete it. Often times surveys are preceded by one-on-one interviews or focus groups that help uncover important themes that can then be validated more widely through a formal survey.
There are many survey services available with varying levels of support, functionality, and pricing. Survey Monkey is a good starting point and has a free tier. Qualtrics is great when you require more flexibility and expect more respondents to complete it. Although Google Forms doesn’t look as pretty, it is also an effective and free tool to distribute online questionnaires.
Customers, members, or users are another source of potential insights. Administrative records include information collected through normal business operations and can include data on who the customer is, what they look like, and what their purchased or engagement history has been.
Beyond formal interactions, custom behavior is now easier to track than ever if you have a digital property like a website, app, or social media channel. The hope is to use such information to improve the existing user experience or identify new product opportunities. Of course such tracking, and your response to what is uncovered, raises a set of ethical and legal questions that we’ll save for another day.
There is also a wide world of secondary data sources that exists beyond your organization’s servers.
Publicly available information is often released by government or international organizations. Topic such as employment data from the U.S. Bureau of Labor Statistics or economic trends from IMF World Economic Outlook are good examples.
Although the techniques we’ll cover in these 66 days can be applied to many different data sets, it is important to have a good understanding of the data we’re using. This will help you ask better questions and internalize the analytical approaches you’re learning.
Click here to see the main dataset we’ll use. It is country-level data on a subset of variables from the World Bank’s World Development Indicators (WDI). You can make your own copy in Google Sheets or download to your desktop as an Excel file.
Tomorrow we’ll take a look at what’s included.
4.2.1 Survey logistics
Everyone is being asked for their opinion these days and surveys are increasingly seen as a burden. You likely receive - and perhaps ignore - multiple invitations each week. So when you make the decision to put a survey into the world, make sure that you don’t waste the research opportunity.
As previously discussed, it often helps to have informal conversations with people that fit your target population before building the actual survey questionnaire. This gives you the chance to test initial assumptions and uncover important perspectives from the group that you hadn’t previously considered.
It is also crucial to get input from your colleagues as you go through the survey development process, especially the people who have a vested interest in the final results. There is nothing worse than someone saying, “why didn’t we ask about this or that?”, when results are shared. And it is very difficult to go back after the survey closes to ask each respondent for one more piece of information.
Data science hiring plans in Europe
Let’s say that you are interested in understanding data science hiring plans for European companies. You decide to reach out to human resource (HR) professionals with recruiting responsibilities in the market.
Broadly speaking we can think of potential respondents as:
- Population: Global HR professionals who have recruiting responsibilities.
- Target population: HR professionals in Europe. We want to generalize results for this group.
- Sample: HR professionals in Europe for which we have contact information and will send the survey. Data from the respondents will be used to describe the target population.
We want our survey results to reflect the actual hiring plans for the companies in our target population. To do this, we need to take steps that minimize bias during survey creation and administration. The two most common types of survey bias include:
Sampling bias: When we don’t hear from people who reflect our target population. This can arise, for example, when we take a sample of convenience (e.g., only HR professionals who work with our partner organizations) instead of a random sample (e.g., all companies in Europe are equally likely to be invited and respond).
Response bias: Anything that encourages misleading responses. This can come from poorly worded survey questions such as, “with the economy in such poor shape this year, do you think it will really improve next year?” This is an example of a leading question. A better, more neutral variant is, “what do you think will happen with the economy next year?”.
A related issue comes from asking double barrel questions. Even if they avoid leading language, they are still difficult to decipher when summarizing results.
An example would be, “Why do you hire Business Analysts and Data Engineers?”. The reasons for a respondent hiring a Business Analyst are likely different than for a Data Engineer. When the roles are combined into the same question, it is impossible to disentangle the underlying drivers for each. Be cautious whenever you see the word
and in a question. It is generally better to break such instances into multiple prompts.
Making analysis easier
You also want to make sure that there is consistency in your response options. Let’s look at the question, “Will you hire one or more data scientists next year?”.
Balance: It is generally best to have a balanced set of possible responses. Take these options.
- Not balanced:
Definitely not | Maybe not | Probably not | Probably yes | Yes
Definitely not | Probably not | Probably yes | Yes
The first example has three negative options compared with two affirmative ones. This itself is somewhat leading. It also makes reporting look awkward and could raise methodological questions. The second set of options is balance and could be evenly collapsed into
Yesfor further analysis.
- Not balanced:
Indifference: Although there is no right or wrong here, you often need to decide if respondents will be able to select an option of indifference. The benefit of including indifference is that you can quantify a sense of uncertainty. The drawback may be that indifference is harder to take action on. By removing the indifference option, you force respondents to take a stand on either side of the scale.
Definitely not | Probably not | Not sure | Probably yes | Yes
- No indifference:
Definitely not | Probably not | Probably yes | Yes
Testing the survey
The draft survey needs to be programmed into survey software. It is very important that a preview version of the survey on the platform is widely tested. This will ensure that (1) the questions and logic are operating as expected and (2) that other stakeholders have one more chance to give feedback and make final content recommendations.
Another question to address at this stage is the survey length. Ask you colleagues to take the survey as if they were encountering the questions for the first time and report back how long it took to complete. The optimal length depends on the anticipated engagement of your audience as well as your expectations for the total number of responses.
The guiding philosophy here should be longer than it needs to be. Keep it as short as possible without sacrificing the key topics you hope to better understand from the results.
The question types you choose matter as they place varying levels of mental burden on your respondents. A single-select multiple choice question on a non-controversial topic takes less effort than an open-ended text question asking respondents to justify their religious beliefs. Open-ended questions, which are more taxing, are generally best placed at the end of the survey and made optional as some people will drop off when they encounter these. When placed at or near the end, you should be able to salvage earlier responses from those dropping off.
Once you and other stakeholders are happy with the questionnaire, many decisions remain.
How will you deliver the survey to potential respondents?
Create a generic link: A generic link is one web link that you can share via an email invitation, on your website, through social media, or any other place that people may see it. The benefit of a generic link is that it is easy to distribute to the masses. The drawback is that you won’t know who started or completed the survey, unless you ask directly for that information in the questionnaire. This approach also makes survey reminders more problematic as you’ll likely have to blast everyone, even the people who already completed it.
Create a personalized link: A personalized link is a unique web link that relates to a specific potential respondent. It is generally connected to an email address that is used to send the invitation through the survey platform or your own mail service. The benefits of the personalized link is that you can easily track who opened, started, and completed the survey. This means that you can choose to send targeted reminders for those who have not yet engaged with it.
What messaging will be used to encourage a response?
You also need to decide what type of messaging to use in order to engage potential respondents and maximize survey completions. Given the volume of noise coming into people’s email accounts, it makes sense to keep the message short, clear, and inspiring. This is true for both the subject line and the body. Make it sound like it came from an actual human and clearly communicate the benefits of taking the survey.
It is also important to briefly explain what will be done with the survey responses. Are they anonymous? Confidential? Where will the information be stored? Given the long list of potential questions, it often makes sense to link to a full data privacy and survey participation policy.
Here is an example of a concise survey invite.
Subject: Help others improve data science hiring
Everyone is struggling to hire data talent today. We’re working with HR professionals from around Europe to understand key challenges and opportunities. Everyone who participates will receive the full set of results, complete with benchmarking opportunities.
Would you like to join the community? If so, please complete this 5 minute survey:
Start here –> https://www.surveyplatform.com/data_hiring
Your responses are anonymous and only aggregate data will be reported out. Click here for our data privacy policies or feel free to reach out to me directly.
Should you use incentives? What should they be?
Incentives can be a double-edged sword. Although they may be necessary given the competition for attention today, they may also indirectly introduce sampling bias as a certain type of person may become over-represented in the results.
If you decide to use an incentive, you also have to choose (1) what it should be and (2) who will be eligible?
Perhaps you will send a summary report of the final results to everyone who completed the survey. This is a non-monetary incentive.
It is also common for organizations to provide monetary alternatives, such as five dollar Amazon gift card, to help compensate respondents for their time. Some choose to give small amounts to everyone who responds. Others add a sense of urgency by only offering that amount to the first
nnumber of respondents. Another approach is to select a larger incentive amount that will be limited to a smaller number of randomly selected respondents.
There are many options to consider. Regardless of what you choose, be sure to follow through on whatever promise you make and have a legal set of conditions available to share.
How often should you remind people to complete it?
Most survey projects send out reminders to further boost response rates. The number of reminders and any changes to messaging or incentives will likely depend on progress made against the original sample target. You normally should plan on sending at least one reminder message, but try to avoid sending more than two or risk being seen as spammer.
What are your success metrics?
Your ability to calculate various success measures depends on how you ultimately distributed the survey. Here are some possible options if you sent out invitations via email.
- Opened the email: Proportion of people who opened the initial message.
- Clicked on the survey link: With the right tracking software, you should also be able to determine the proportion of people who clicked on the survey call-to-action link.
- Response rate: Finally, you can divide the number of people who have taken the survey by the total number of people invited to find the response rate. Your response rate will depend on several variables such as your relationship with the audience, the effectiveness of your invitation message, the magnitude of included incentives, and the topic and length of the survey itself. There is also the potential distinction between partial completes - people who started but did not finish the survey - and full completes - the people who press the final submit button.
Building a questionnaire, delivering the survey to potential respondents, and collecting results takes a lot of coordination and effort. It is also just the starting point to make use of the insights found. Next we’ll look at how to conduct survey analysis based on the most common question types.
4.2.2 Survey analysis
A few basic questions types represent the vast majority of content in most survey questionnaires. Understanding what they are and what can be done with their respective results is crucial for getting the most value from of a survey project.
1. Multiple choice - select one answer
A multiple choice question in which respondents are limited to making only one selection can be used when there is only one possible choice (e.g., what month were you born?) or when you want to understand the most relevant selection from a set of given options (e.g., what is your favorite type of ice cream?).
Since respondents are forced to make only one choice, be sure to include all possible options. When that is not practical due to length, simply include an
other - please specify option. In this case, we added an
I do not know response in case the respondent doesn’t have enough information about hiring from the previous or the current year. These response options from our example create a nominal variable.
The most common way to summarize
select one questions is to share the proportion of responses from each unique option relative to the total number of respondents. The sum of these will always total 100 percent.
Two common feedback variants
Likert scale: This response scale attempts to gauge satisfaction or agreement from low to high. The survey may make a statement and provide
Strongly disagree | Somewhat disagree | Somewhat agree | Strongly agreeresponse options. Count and percentage summary measures are most common, although you may also see mean scores reported by assigning a numeric value to the ordered response sequence. This is mathematically possible because Likert scales lead to results that are ordinal in nature, although many statisticians would advise against using mean values in these situations.
Net Promoter Score (NPS): NPS is a popular and simple business metric created by Bain & Company. Respondents are asked how likely they would be on a scale from zero (not at all likely) to ten (extremely likely) to recommend a specific product, service, or company to other people. Respondents are then grouped accordingly.
Promoters: Responses of nine and ten.
Neutral: Responses of seven and eight.
Detractors: Responses of zero through six.
The final NPS score is the percentage of respondents who are
promotersminus the percentage who are
detractors.It can therefore range from minus 100 percent to positive 100 percent and is often reported without the percent sign (e.g., we have a NPS of 32). The higher value, the better.
2. Multiple choice - select all answers that apply
Sometimes you don’t want to limit your respondents to only one choice. A
select all multiple choice question allows people to choose as many options as are applicable.
select one multiple choice questions, summary percentages will usually sum to more than 100 percent as respondents are able to pick all the options that reflect their reality.
Here we see that it is more common for companies to anticipate
Data Scientist hires (77.0%) when compared with
Data Architect hires (28.5%).
3. Rank order
Select one multiple choice questions limit the respondent to one response. Although
select all questions enable a respondent to choose as many options as relevant, it doesn’t reveal how more important one selection is compared with another. That’s where
rank order questions come in.
Rank order questions explicitly ask respondents to make value judgments against a series of options. There are a few ways to analyze such responses. Assuming everyone responds to each option, you should have a rank for each strategic option. In our case, this ranges from a rank of one to a rank of four.
The mean rank calculates the average selected rank for each response option. A lower average value indicates a higher rank (e.g., 1.2) and a higher average value indicates a lower rank (e.g., 3.4).
Here we see that launching new products has a higher average rank (1.96) when compared with entering new markets (2.73), a glimpse into the strategic plans of these companies.
Proportion of ranked values
You could also look at how many times a specific rank was assigned to a given option. For instance, 63 percent of respondents assigned launching new products as their top priority. Meanwhile, only 6.5 percent of respondents indicated that entering new markets was the top goal.
Although both approaches show the same directional finding, the proportion of ranked values measure is likely more impactful in this case as it demonstrates wider numeric differentiation.
4. Text entry
Text entry questions work well when you don’t have a great idea for which response options frame a certain topic or you just want to give respondents a chance to provide deeper, less-controlled input.
Analysis of text data is not as straight forward as the other question types covered. However, we can use basic text analytics and Natural Language Processing (NLP) to get a sense on what people are thinking.
Here are some customer reviews from an e-commerce platform to use as an example with responses shown for the first ten respondents.
It would be time consuming to go through the 23,486 records manually to tag certain words or emotions. Thankfully we can turn this into a data problem by putting each word in its own row and then analyzing the adjusted data series.
It then becomes easy to count which words appear most often and even apply sentiment analysis to get a sense if customers are satisfied with the service. For now we’ll just take a look at the most frequently mentioned words, excluding common terms such as
Overall, customers look pretty happy. A wordcloud is a common way to visualize which terms dominate the conversation, aligning word size with the number of mentions.
rank order, and
text entry are the primary methods that researchers employ in surveys to better understand the world. Although there are more advanced question types and analytical techniques, understanding these big four provides you with the foundation to uncover meaningful insights from survey results.
4.3 Data dictionaries
Let’s say your manager asks you to retrieve some market data. The business goal is to decide in which country should the company open its first global office. Senior leadership wants a stable operating environment that also has an attractive domestic market for new sales opportunities.
You were fortunate enough to find a global data set posted by an economics graduate student on his website. It includes several interesting sounding variables. Score!
You open the file in Excel and are pleased with the country coverage. The top of the dataset looks something like this:
Note: you can access the full dataset here.
The good news is that the table appears optimally organized for data analysis. It consists of:
- Rows: Observations from a single object of interest (e.g., Argentina)
- Columns: Variables that describe or measure something for each object (e.g., Population)
- Cells: The value of a specific variable (e.g., 44.9 million)
The more you look at it, the more you realize that just because you have data doesn’t necessarily mean that you’re ready to start making sense of it. Some of the column headers seemingly make sense such as
population. Others, like
college_share, are rather ambiguous. So, what do you do?
This lack of clarity is a real problem - and here we are only talking about one small dataset. Imagine all the data tables your organizations dumps into SQL Server with minimal documentation or the datasets that provide supporting evidence for the news articles that you read.
We need to answer several questions before we have the confidence to perform meaningful analysis.
- What do the column names mean? Tip: When building your own datasets, make column names as descriptive as possible. Also, avoid white space as this could turn into a headache in subsequent analysis. I prefer using all lowercase with the underscore character as needed. For example, instead of
Population in 2020, consider
- How was the data collected? Did the results come from a survey? How many people responded? Was it representative of a wider group of interest? Are the results based on internet traffic? Does it include people who weren’t signed into the site or that had their ad-blockers on?
- How often is it updated? One of the most common follow-up questions you’ll hear is
has the data changed over time? Knowing when the next data refresh is available or how old is the current data is very helpful.
- Who is responsible for ensuring its updates and accuracy? Many companies have difficulty assigning ownership to datasets. If the data truly is an organizational asset, someone needs to be guiding the collection, cleaning, and storing process. If it’s not, why bother collecting and storing it in the first place?
All of this information should be documented somewhere in a
data dictionary or
data glossary. This can be as simple as a shared word doc or in a dedicated tool connected directly to a company’s data storage systems.
What should you do about the data provided in our example above? You decide to reach out to the graduate student and ask these clarifying questions. You’ll thrilled when she emails back a new spreadsheet that has more detail. She also informs you that the data, which was originally found in the World Bank’s World Development Indicators, represents the most recent year available for each variable.
Note: the data dictionary is available here.
Two days and still no analysis? What’s the deal? Well, another question you should ask a new dataset is what type of variables are included and what does that enable in terms of analytical possibilities?
In Day 3 we will introduce four specific
Data Types to provide guidance on possible next steps.
4.4 Approaching analysis
As data becomes more integral in your planning and decision-making, you need a directional blueprint for approaching analysis. It is easy to get lost or distracted without one.
Here are seven steps to help keep your data projects on track.
Identify a clear question or problem you wish to address
You want to be sure that the analysis you are about to conduct has practical relevance. The best way to confirm this is to ask, “if we had this data, what would we do differently because of it?”. Having clarity around this question will help others understand the importance of your project and add more impact to the results you eventually present.
Speak with others who have a vested interest in the content or potential results
A typical mistake when conducting analysis is to take the initial concept and then lock yourself in a room until it is complete. But it is incredibly important that others in your organization - and potentially outside of it - have a chance to understand and provide feedback on the project objectives.
This will undoubtably help refine your core set of questions. Giving others a voice will also make them more receptive to the final results due to their early involvement, reducing “why didn’t you think of x, y, and z” questions at the culmination.
Collect the required information
Once you have further clarified objectives based on these conversations, you are ready to start collecting data. Although the techniques you’ll take depend on the nature of the project, be sure to document what is being collecting and store it in clean and accessible ways.
Explore what is available
Now the fun begins. You’re able to dig into what you’ve uncovered to explore relationships, trends, and anything else that will get you closer to your project objectives. In this stage, you may also find that the original goal needs revision. This is completely ok. Just be sure to have further conversations with the people in step two so that everyone remains on the same page.
Determine the practical implications of what you have found
This is the pivotal step in your data project. It is likely that no one else will have either the skill or energy to sift through all the raw data that your uncovered and connected. The only items that make it into the final report or presentation will therefore be guided by the original objectives, internal conversations, and your unique business perspective.
Clearly communicate findings to key stakeholders
After you’ve identified what matters most, you still need to package the findings in a way that resonate with a wider audience. This means balancing important detail with digestibility.
Few decision-makers will have the time to read hundreds of pages filled with methodologies and comprehensive cross tabs. You need to create a limited number of impactful visuals and talking points that guide others towards action.
Push for a decision or change in behavior
Don’t let your efforts go to waste. Most data projects these days are not designed as simple fact-finding missions or nice-to-have industry overviews. They are generally started with the intention to make changes that improve the status-quo.
Positioning your data findings to make explicit recommendations or raise provocative questions about current operations will dramatically increase the perceived value of your work.
An opportunity to lead with data
As organizations craft strategic plans to become more data-driven, employees with data skills and mindsets have a real opportunity to align themselves with these ambitions.
However, as more people acquire baseline data skills, differentiating yourself will require more robust abilities. In other words, your data-centered career shouldn’t be limited to data retrieval.
Moving up the data value chain
You move up the data value chain by not only providing what is asked for, but through supporting requests with complementary information, such as how have things changed over time, and eventually through experienced-based recommendations that align with the data.
This progression should make working with data more rewarding as you get to be involved in higher-level conversations, which in turn makes it even easier for you to provide deeper insights again in the future.
Working with data therefore becomes a professional development opportunity that empowers you to better internalize and support organizational goals.
4.5 Organizational strategy
Most of our discussion so far has focused on you, the individual, and how to frame data problems. However, you are likely working in an organization with hundreds or thousands of other people. The organization itself, or rather the leaders of it, have a responsibility to build a cohesive data strategy that empowers its people to be the data-driven employees it claims to need.
Below are five questions that your company should work through on its path to become a data-centric. Although it is unlikely that you’ll be directly involved in each stage, you should proactively press senior management if you sense a lack of clarity or progress in any of the areas.
What are our organizational objectives?
The first question is all about focus and priorities. What are we trying to accomplish? What are our unique capabilities? What does success look like?
Although data may not be the centerpiece of your core business outputs, it will almost certainly support your operations and product development. Data Strategy by Bernard Marr is a great resource to help organizations begin to balance overall strategy with data considerations.
What data do we have or need to find in order to support these goals?
Once you answer question 1, you should:
Take inventory of existing data: This is a very important exercise, especially for medium and large organizations that are likely sitting on more pieces of information than they realize. You should get representatives from each team in a room and go around discovering what datasets are being generated or used by a given group. There are many benefits to this. The organization will get a better sense on how much data is currently available and individuals will learn how others are using data to support their work.
Brainstorm potential data: This exercise is a token gap analysis. Start by reiterating your organizational objectives and summarizing the sources of data identified during the inventory. Then ask, “what data do we not have that are needed to achieve our goals?”. The results turn into a list of action items. Some will be easy (e.g., turn on Google Analytics for all web properties) and some will take more effort (e.g., collect consistent user feedback metrics across all divisions).
How can we collect and store this data in the best way?
Now things turn more technical. How will specific data assets be collected? Once collected, where will they be stored? Coordination between data engineers, analysts, and business leaders is crucial. This stage requires significant investment in both talent and infrastructure and can take months or years to setup depending on the current systems and complexity of plans.
The end user (e.g., employees trying to understand the market better or customers purchasing insight products) don’t care about how hard it is to maintain the
Data Lake, or
Data Warehouse, they just expect a useful environment to interact with accurate insights.
How do we ensure the proper legal, ethical, and accurate use of these data assets?
Data collection and use is increasingly under the microscope from ethical and legal considerations. Policies such as the General Data Protection Regulation (GDPR) in Europe and the California Consumer Privacy Act (CCPA) in the United States are forcing organizations to have a better idea of why and what they are collecting and how it is going to be stored and used.
Cross-functional Data Governance groups are often organized as a centralized resource in addressing such issues. They generally contain representatives from across the organization who have an interest or role in collecting, maintaining, or utilizing data. Most commonly these include members from IT, legal, product, and analytics teams. In addition to keeping an eye on compliance, the committee can also help oversee the organization of - and potentially assign monetary value to - a company’s data assets.
How can we empower our employees to generate insights, make evidence-based decisions, and create derivative value from our data?
Finally, we need to ensure that everyone in the organization is empowered to use available data to better understand the business, so that discussions and decisions are based on information over instinct.
Business Intelligence (BI) tools are often at the forefront of this goal. Dashboarding products such as Tableau, Microsoft Power BI, or Looker connect to an organization’s data assets and have the potential to surface data and insights in engaging ways - even for non-technical people.
Success of a BI tool rollout depends on (1) the condition of a company’s data systems, (2) clear documentation for what is available, (3) encouragement from senior leadership for making of data-driven decisions, and (4) a baseline level of data literacy for all employees.
Organizations spend millions of dollars in the pursuit of data transformations. These investments are justified based on the belief that becoming a data-centric organization will lead to better operational decisions, product enhancements, and customer engagement - all of which should improve the bottom line. Having clear strategies and assigned ownership relating to the five questions above will improve your company’s chances of success.
You can help keep your company on track by asking questions when clarity lacks. For instance:
- Do we have a data governance team?
- Where can I find more information about the data in this dashboard?
- How long do we store customer information for?
- What is the single source of truth for monthly financial numbers?