Let's Ponder Data Dictionaries
One of the larger challenges in answering a question with data is typically finding the data in the first place. Various studies have shown this task can make up 10-40% of the total time spent solving a problem. So what are the best ways to find data? Let's first consider the current state of data science and analytics. According to an IDC info brief, most data workers spend 90% of their workweek on data-related activities. Let's dive deeper into the results; here's what we know:
There have been many approaches to this problem, with the oldest and most common being the data dictionary. The thought here is to simply have someone create a document that defines every field in every system where data scientists, analysts, and business users will easily be able to find what they need. While in principle this sounds quite logical, in most cases this approach is not only time-consuming and impractical, but it rarely works well. That’s not to say that having a data dictionary is bad, but it is typically not a sufficient nor a reliable way to make business decisions.
Diving into a Dictionary
“How can this be?” you ask. Why wouldn’t a dictionary solve all these problems? Two main challenges emerge: recency and completeness. To expand on this, one needs to understand the dynamic nature of most businesses. Systems and fields are changing continuously, as are the ways the systems are used. I have used systems that were 20-plus years old in a company, and the way several fields were used had long changed past what the data dictionary stated.
An outdated data dictionary can set up serious roadblocks to data science and waste valuable time. In addition, the nuances that would fully define what the data meant and how it could and couldn’t be used would likely require an essay versus a sentence to answer appropriately. The IDC info brief also stated that 44% of the time is being wasted every week because data workers are unsuccessful in their activities.
Subject matter experts can appreciate that certain measurement factors have multiple definitions for a spectrum of purposes — to annotate a data dictionary with completeness for each possible scenario would be madness.
I have used systems that were 20-plus years old in a company, and the way several fields were used had long changed past what the data dictionary stated.
To illustrate, imagine working at a global manufacturing company and looking for the right system to find the cost of all the parts that go into a product. You would likely find hundreds of systems and data sources for this; that includes variable cost, total landed cost, warranty cost, and numerous other cost elements.
To make it even more confusing, numerous systems that contain the same cost element, let’s say variable cost, would have different sub-attributes (e.g., the timing of the cost being at the time of order vs. time of payment, some including tax, others having it separated, etc.). Figuring out which system and field to take would take reading nearly all of these entries to sort out the answer. To make matters even more complicated, data isn’t simply born within the four walls of the company. When data scientists and analysts work to solve a problem, they frequently require data from external data sources to supplement internal data.
These elements would be burdensome to include in a data dictionary and recency issues are relevant once again.
Solve Specific Problems
So how does one know which data fields to use to solve a specific problem? How do the most successful data scientists quickly know what data to leverage to solve a problem? My experience would suggest there are two key elements to success:
- Experienced people that know the systems and domain, and even more, they have a broad network of people to leverage
- Data scientists that are good at checking data and validating if it’s reliable enough to solve the problem at hand
So how can technology help with these problems? There are phenomenal solutions in this space that help tackle data cataloging and management. Many analytic solutions have helpful technology on the latter item, profiling data and providing useful heuristics. These data profiles show things like the number of NULL values, max and min values as well as distribution.
It's an issue if you miss an issue.
Get each new release of INPUT before the rest.
More sophisticated users can do checks with correlation analysis and other analytic tools, but in the end, all of this requires some knowledge of what the data should look like to determine if it is good enough to use. This leads back to the first bullet point, finding experts who know the data and the domain to help, and on this front, there are few technologies that have attacked this problem, but there are solutions that use a social mechanism to solve it.
Let’s Get Social
The social means of solving a problem has become standard in the software community, with the ability to find technical help on nearly any problem through online forums and communities becoming faster than corporate help desks. Crowdsourcing has even become the standard for how I find out how to diagnose and repair my consumer goods.
I’m not sure I remember looking at an owner’s manual in the last decade but someone has already encountered and fixed the problem I’m experiencing all I need to do is read the thread. Online communities are the standard now for how I diagnose and repair my consumer goods. I’m not sure I remember looking at an owner’s manual in the last decade.
Online communities are the standard now for how I diagnose and repair my consumer goods. I’m not sure I remember looking at an owner’s manual in the last decade.
I’ve seen best-in-class examples of this type of collaborative data discovery approach that provides a social means of understanding who is using different pieces of the data and what they are doing with it.
What do you think? Is this social approach to problem-solving the best new way to speed up the challenge of finding data? Or do you think Merriam-Webster could have a new market to attack?