As part of our expanded focus into Data Quality, we recently announced a new approach aimed at quickly and intuitively resolving data quality issues — Smart Cleaning. This blog will cover two different features of Smart Cleaning — Cluster Clean and Pattern Clean. Cluster Clean allows users to quickly and flexibly standardize and resolve similar values to a standard value, whether chosen from the existing data or entered by the user. Pattern Clean addresses issues with mismatched formats, for instance date or phone number formats. Within this framework, Smart Cleaning brings scale and flexibility to the traditionally rigid processes of standardization.
We recently introduced a brand new way for users to standardize and clean misspelled or varied data in Designer Cloud, Cluster Clean. Many of our users struggle to deal with messy data that can be hard to reconcile. Sometimes it’s data that has been manually entered into systems, other times it’s data coming from multiple sources. We’ve found that traditional methods of clustering and standardizing similar values are slow and brittle. Users choose a clustering method and then spend lots of time manually moving values into the right cluster. When they get new data they need to painstakingly reconcile it with their existing clusters. Our approach is different, we know that no single clustering method does a great job catching every type of issue. We designed Cluster Clean with this in mind, allowing users to quickly explore multiple clustering options and catch new problems. It’s also resilient to new data, easily incorporating new values without being tied down to whatever clustering method happened to work best the first time.
We approached Standardization with user flexibility in mind. Our initial model was centered around allowing users to take advantage of the clustering aspects of the key collision–comparing string similarity, and metaphone–comparing pronunciation, algorithms while being able to break out of these groupings if the user saw that the algorithms did not produce the desired groupings. To accomplish this, we allowed the user to not only specify a value for the cluster to resolve to, but also to specify the new value for any of the individual values within the cluster.
Through user testing, we found that this model gave our users the flexibility that they were looking for, but it was missing a key aspect when the clusters didn’t match the user’s expectations: bulk editing. Users felt that it was tedious to pull multiple values out of a cluster one by one so we added the ability to select multiple values and edit them in bulk. Through this iteration, we discovered that selecting values was generally very useful and intuitive so we extended the pattern to work across clusters as well. Now, users are able to select a single cluster, multiple clusters, specific values within a single cluster, as well as specific values across multiple clusters and resolve them appropriately.
Another data quality issue that Designer Cloud can quickly address is issues of mismatched formatting. If you saw our blog on Active Profiling, you saw how Designer Cloud helps users identify and drill down on issues related to mismatched formatting. By interacting with that profiling information, Designer Cloud also gives users a powerful method to address those mismatched values, Pattern Clean.
For those familiar with more traditional approaches to addressing these types of mismatched values, it often requires a heavy reliance on regular expressions and complex conditions. That can take enormous amounts of manual effort to identify those issues and patterns, and equal amounts of manual effort to replace them with the correct format. Designer Cloud identifies the patterns for you, and by interacting with those patterns, Pattern Clean will predict the best way to resolve them.
This is just the first step in our focus on making data cleaning more intelligent and efficient, we have a lot of exciting extensions of this work coming soon! We’ll make it possible to use an existing set of ‘gold standard’ values to speed the process up even more with Reference Clean. We also have new clustering methods planned, allowing for more ways to explore how values may be related. We’ll be layering in more intelligence over time, giving users intelligent suggestions on how they can resolve their values and clusters. Try it out today and let us know what you think!