A New Approach to Data Quality for the Era of Cloud & AI

テクノロジー   |   Paul Warburg   |   2019年3月19日

Data quality has been going through a renaissance recently.

As a growing number of organizations ramp up efforts to transition computing infrastructure to the cloud and invest in cutting-edge machine learning and AI initiatives, they are finding that the #1 barrier to success is the quality of their data.

The old adage “Garbage In, Garbage Out” has never been more relevant. With the speed and scale of today’s analytics workloads and the businesses that they support, the costs associated with poor data quality have also never been been higher.

You’re seeing this reflected in a massive uptick in media coverage on the topic. Over the past few months, data quality has been the focus of feature articles in The Wall Street Journal, Forbes, Harvard Business Review and MIT Sloan Management Review among others. The common theme is that the success of machine learning and AI is completely dependent on data quality. To quote Thomas Redman, the author of the HBR article referenced above, If Your Data is Bad, Your Machine Learning Tools are Useless.

We’re seeing this trend of increasing focus on data quality reflected in the work of our customers including The Centers  for Medicare and Medicaid (CMS), Deutsche Boerse  and GlaxoSmithKline. The need to accelerate data quality assessment, remediation and monitoring has never been more critical for organizations and they are finding that the traditional approaches to data quality don’t provide the speed, scale and agility required by today’s businesses.

This repeated pattern is what led to today’s announcement on our expansion into Data Quality and unveiling two major new platform capabilities with Active Profiling and Smart Cleaning. This is a big moment for our company because it’s the first time we’ve expanded our focus beyond  data preparation. By adding new data quality functionality, we are advancing Trifacta’s capabilities to handle a wider set of data management tasks as part of a modern DataOps platform.

Legacy approaches to data quality involve many manual, disparate activities as part of a broader process. Dedicated data quality teams, often disconnected from the business context of the data they are working with, manage the process of profiling, fixing and continually monitoring data quality in operational workflows. Each step must be managed in a completely separate interface. It’s hard to iteratively move back-and-forth between steps such as profiling and remediation. Worst of all, the individuals doing the work of managing data quality often don’t have the appropriate context for the data to make informed decisions when business rules change or new situations arise.

Trifacta takes a different approach. Interactive visualizations and machine intelligence guide users by highlighting data quality issues and providing intelligent suggestions on how to address them. Profiling, user interaction, intelligent suggestions, and guided decision-making are all interconnected and drive the other. Users can seamlessly transition back-and-forth between steps to ensure their work is correct. This guided approach lowers the barriers to users and helps to democratize the work beyond siloed data quality teams, allowing those with the business context to own and deliver quality outputs with greater efficiency to downstream analytics initiatives.

In upcoming posts, my colleagues in product and engineering will provide a more detailed overview of the new capabilities we announced today including Active Profiling and Smart Cleaning. They’ll share not only what users are able to do with these new features but also context into the design and development of each function.

Keep in mind that this is just the first (albeit significant) step for our company into data quality. We have much more planned. Later in the year, Designer Cloud will be adding new capabilities to govern and monitor data quality in automated workflows, allowing users to isolate bad data, orchestrate workflows, and set and monitor data quality thresholds.