What Is Data Cleansing?
Data cleansing, also known as data cleaning or scrubbing, identifies and fixes errors, duplicates, and irrelevant data from a raw dataset. Part of the data preparation process, data cleansing allows for accurate, defensible data that generates reliable visualizations, models, and business decisions.
Why Is Data Cleansing Important?
Analyses and algorithms are only as good as the data they’re based on. On average, organizations believe that nearly 30% of their data is inaccurate. This dirty data costs companies 12% of their overall revenue — and they’re losing more than just money. Cleansing produces consistent, structured, accurate data, which allows for informed, intelligent decisions. It also highlights areas for improvement in upstream data entry and storage environments, saving time and money now and in the future.
The Data Cleansing Process
Data cleansing is an essential step to any analytics process and typically involves six steps.
Dedupe: Duplicates, or dupes, usually show up when data is blended from different sources (e.g., spreadsheets, websites, and databases) or when a customer has multiple points of contact with a company or has submitted redundant forms. This repeated data uses up server space and processing resources, creating larger files and less efficient analysis. Rules around deduping depend on a company’s desired outcome. For example, if a customer submitted different emails on different pages of a website, a conservative approach would eliminate only rows of data where every field is an exact match.
Remove Irrelevant Observations: Data that’s not relevant to the problem being solved can slow down processing time. Removing these irrelevant observations doesn’t delete them from the source but excludes them from the current analysis. For example, when investigating campaigns from the last year, there’s no need to include data from outside that time frame. Keep in mind, however, that even if a certain variable isn’t needed, it might be correlated with the outcome being investigated (e.g., a customer’s age could impact which email was most successful).
Manage Incomplete Data: Data might be missing values for a few reasons (e.g., customers not providing certain information), and addressing it is vital to analysis as it prevents bias and miscalculations. After isolating and examining the incomplete values, which can show up as “0,” “NA,” “none,” “null,” or “not applicable,” determine if those are plausible values or due to missing information. While the easiest solution might be to drop the incomplete data, be aware of any bias that might result in that action. Alternatives include replacing null values with substitutes based on statistical or conditional modeling or flagging and commenting on the missing data.
Identify Outliers: Data points that are far removed from the rest of a population can significantly distort the reality of the data. These outliers can be identified with visual or numerical techniques including box plots, histograms, scatterplots, or z-scores; when part of an automated process, it allows for quick assumptions, testing those assumptions, and resolving data issues with confidence. Once identified, outliers can be included or omitted depending on how extreme they are and what statistical methods are being used in an analysis.
Fix Structural Errors: It’s important to correct errors and inconsistencies including typography, capitalization, abbreviation, and formatting. Look at the data type for each column and make sure entries are correct and consistent, which may include standardizing fields, and remove unwanted characters such as extra whitespaces.
Validate: Validation is the opportunity to ensure data is accurate, complete, consistent, and uniform. This happens throughout an automated data cleansing process, but it’s still important to run a sample to ensure everything aligns. This is also an opportunity to document what tools and techniques were used throughout the cleansing process.
The Future of Data Cleansing
Data cleansing is essential to valid, powerful analysis, yet for many companies it’s a manual, siloed process that wastes time and resources. Analytic Process Automation (APA) allows for repeatable, scalable, accessible data cleansing and enables:
- The democratization of data and analytics
- The automation of business processes
- The upskilling of people for quick wins and transformative outcomes
Data cleansing is the foundation of Analytic Process Automation (APA), and with that strong foundation, companies have a clear path to deeper analysis with data science and machine learning.
Getting Started With Data Cleansing
Manual data cleansing is tedious, error-prone, and time-consuming. With its suite of easy-to-use automation building blocks, Alteryx Analytics Automation™ empowers organizations to identify and clean dirty data in a variety of ways — without code. The end-to-end analytics platform is designed with the significance and specifications of data exploration in mind and on the understanding that clean data leads to good analysis. The Alteryx Platform creates a fast, repeatable, and auditable process that can be built once and automated forever.