Automatic data quality assessment is a Google Cloud Dataprep user favorite. Who wouldn’t want to give their eyes a rest from combing through data while Cloud Dataprep automatically points out possible data flaws?
The feature is particularly useful when onboarding or integrating unfamiliar data. With unfamiliar data, it’s not only difficult to tell what errors might be lurking in the data, but it can be tough to tell where to even begin transforming the data. Automatic data quality assessment helps map out the work you need to do before reaching the level of data quality required for your analytics initiative.
Of course, checking for data quality isn’t a one-time occurrence. At every step of the data preparation process, Cloud Dataprep offers quality checks and suggests transformations to clean the data. Cloud Dataprep’s rich sets of data quality capabilities includes:
- Continually refreshed data profiles that provide a snapshot of the data’s contents. This includes data type inference, data distribution, mismatch values, missing values, statistics and outliers.
- Rich set of transformations to clean data, including “transform by example”, data standardization based on similarities, transformation suggestions based on patterns, combining data sets based on fuzzy matching, etc.
- Data transformation and remediation suggestions based on identified data issues.
- Data quality rules suggestions and monitoring
- Data validation and profiling results at scale
Cloud Dataprep Profile Results & Quality Rules
When you run a data preparation job with Cloud Dataprep, you can request for “Profile Results” and Cloud Dataprep will collect profiling statistics from your transformed dataset and displays it in the job workspace after the job has run.
Additionally, when you create Data Quality Rules in your recipe, quality bars are also displayed in the job workplace under the Rules tab.
Since Profile and Quality Rules results are collected after each job, many of our users have collected this information via API and created a comprehensive dashboard that monitors data quality over time. Data quality dashboards allow users to track trends in their data and quickly intervene should they see data quality go awry. Users can also customize the dashboard to highlight specific data quality indicators related to their analytic initiatives.
Data Quality Dashboard with Data Studio
Here is an example of a Data Studio Data Quality dashboard we created to monitor both the profile results and data quality rules over time. Try it out yourself to discover how you can filter out and identify data quality trends.
The first page highlights Data Profiling results. You can select a particular Output Dataset or a particular date to zoom in on a specific piece of information.
The dashboard shows us normal profiling results statistics, while the bottom chart shows summary information for each job run so that you can see the trends for each dataset.
The second page summarizes the Data Quality Rules results. It has richer features.
You can select a particular dataset, a rule type (Valid, Unique, Not Missing, Match, Implies, Greater Than, Formula), a specific rule name, the status of the rule result (passed or failed), or a particular data to focus on.
The bottom of the dashboard provides a history of each run. We even created a Quality Score (in that case, a ratio of pass records divided by the total number of records). With this quality scoreline, it becomes easy to track trends.
Building Your Own Data Quality Dashboard
What if you could build your own Data Quality Dashboard and automatically refresh it every time you run a Cloud Dataprep job? We have it covered for you! Read the step-by-step guide to building the entire, fully-automated solution, which uses Google Functions and APIs to collect the Dataprep Profile and Rules results, loads them in BigQuery, and gets the Data Studio report ready.
While we created the solution with Data Studio, which is free and feature-rich enough, you may already be familiar with another BI solution such as Tableau, Qlick or Looker. You can also easily connect those solutions to the Profiling and Rules tables stored in BigQuery. These BI solutions may support alerting, so you could also define an alert if a quality score goes under a certain threshold to take immediate action.