Ask the Expert: On Bad Data, Bullets, And Survivorship Bias

Technology   |   Alan Jacobson   |   Apr 2, 2020

Editor’s Note: Data scientists and analysts hang their professional hats on data every day, but we also recognize that there’s a very human element of subjective decision-making in what data to include and how to draw conclusions from it. In this new “Ask the Expert” series with Alan Jacobson, CDAO at Alteryx, we explore how cognitive biases can influence the way data scientists and analysts analyze data and draw insights from it.

Q: How do you handle bad data? Do you worry that you can draw incorrect conclusions from it?

A: I don’t necessarily believe there is such a thing as “bad” data. But it is true that data sources frequently have erroneous and/or missing data. I have met very few real-world data sets that were perfectly clean. That said, most data can still be used effectively, as it is good enough to provide value, insight, and a solution to a problem.

Let’s start with an example of how data can mislead people — or, more accurately, how people’s biases can cause them to use data in ways that might provide the wrong answer.

We’ve had moments throughout history in which biased historical data has allowed people to draw the wrong conclusions. Let’s take a famous story about Abraham Wald from World War II. The legend goes that to make planes safer and more likely to survive battle, a team of researchers evaluated every plane that came back from missions with bullet holes to assess where to place more armor.

This is a great data science problem. Adding more armor to the wrong places makes a plane heavier, slower, and more likely to be shot down; adding armor to the right places makes the plane more likely to survive battle, a great optimization problem that would have meaningful consequences — exactly the kind of stuff that data scientists love to work on!

Normalizing the number of bullets by square footage, the team concluded that the best place to add armor would be where they saw the highest number of bullets per square foot, which in this case was the fuselage.

As the legend goes, when they reviewed this data with Wald, he responded that the armor doesn’t go where the bullet holes are, it goes where they aren’t (the engine).

Wald’s “aha moment”? The damaged portions of returning planes show locations where they can sustain damage and still return home; those hit in other places do not survive. Even though the data appeared to be saying that the engine was the least likely to need armor, Wald pushed past the cognitive bias we call “survivorship bias” to dig deeper.

Survivorship Bias

The logical error of concentrating on the people or things that “survived” and overlooking those that did not, typically because of their lack of visibility.

This method was used throughout World War II as well as the Korea and Vietnam wars. It’s a great example of how the initial analysis would have been fatally incorrect if it wasn’t more carefully examined.

There are many other types of cognitive biases beyond this one and each can affect an outcome. It’s important to consider these as you’re looking at data to ensure you get the best possible result. And keep in mind, the data wasn’t “bad,” it just wasn’t used in the right way.

(Source: cognitive-biases-that-affect-decisions-2015-8)

We’ve got to be careful to avoid biases and being blindsided by our assumptions and the conclusions we make with data, and never forget to include clever humans (like Wald) as part of the process. Avoiding cognitive bias starts with being aware it exists, and then actively combating it. The internet is full of tips, including checking your ego, not making decisions under time pressure, and avoiding multitasking.

But one of the most powerful tools in your arsenal is in computer augmentation, aka using data science tools that free people to think deeper about the analysis and create better answers.

Some have suggested that the way to prevent these issues of cognitive bias in evaluating data is to not provide data science tools broadly to knowledge workers. My view is quite the opposite. Indeed, providing data science tools to people who aren’t domain experts frequently can add more bias into the analysis, and not understanding the context of the data and the problem you’re trying to solve greatly increases the risk of making a mistake. However, the solution is not to leave tools only to the hands of the elite, highly trained data scientists. Instead, prevent biases by educating citizen data scientists on advanced data analysis.

Education is the key to eliminating bias and preventing errors. And to your original question, the data will always be dirty. It’s what you do with it that counts.


Great Lakes Data Analytics Summit promo

Get the full list of questions and answers