motif blanc

Data Cleansing

motif blanc

What Is Data Cleansing?

Data cleansing, also known as data cleaning or scrubbing, identifies and fixes errors, duplicates, and irrelevant data from a raw dataset. Part of the data preparation process, data cleansing allows for accurate, defensible data that generates reliable visualizations, models, and business decisions.

Why Is Data Cleansing Important?

Analyses and algorithms are only as good as the data they’re based on. On average, organizations believe that nearly 30% of their data is inaccurate. This dirty data costs companies 12% of their overall revenue — and they’re losing more than just money. Cleansing produces consistent, structured, accurate data, which allows for informed, intelligent decisions. It also highlights areas for improvement in upstream data entry and storage environments, saving time and money now and in the future.

The Data Cleansing Process

Data Cleansing Process

Data cleansing is an essential step to any analytics process and typically involves six steps.

Dedupe: Duplicates, or dupes, usually show up when data is blended from different sources (e.g., spreadsheets, websites, and databases) or when a customer has multiple points of contact with a company or has submitted redundant forms. This repeated data uses up server space and processing resources, creating larger files and less efficient analysis. Rules around deduping depend on a company’s desired outcome. For example, if a customer submitted different emails on different pages of a website, a conservative approach would eliminate only rows of data where every field is an exact match.

Remove Irrelevant Observations: Data that’s not relevant to the problem being solved can slow down processing time. Removing these irrelevant observations doesn’t delete them from the source but excludes them from the current analysis. For example, when investigating campaigns from the last year, there’s no need to include data from outside that time frame. Keep in mind, however, that even if a certain variable isn’t needed, it might be correlated with the outcome being investigated (e.g., a customer’s age could impact which email was most successful).

Manage Incomplete Data: Data might be missing values for a few reasons (e.g., customers not providing certain information), and addressing it is vital to analysis as it prevents bias and miscalculations. After isolating and examining the incomplete values, which can show up as “0,” “NA,” “none,” “null,” or “not applicable,” determine if those are plausible values or due to missing information. While the easiest solution might be to drop the incomplete data, be aware of any bias that might result in that action. Alternatives include replacing null values with substitutes based on statistical or conditional modeling or flagging and commenting on the missing data.

Identify Outliers: Data points that are far removed from the rest of a population can significantly distort the reality of the data. These outliers can be identified with visual or numerical techniques including box plots, histograms, scatterplots, or z-scores; when part of an automated process, it allows for quick assumptions, testing those assumptions, and resolving data issues with confidence. Once identified, outliers can be included or omitted depending on how extreme they are and what statistical methods are being used in an analysis.

Fix Structural Errors: It’s important to correct errors and inconsistencies including typography, capitalization, abbreviation, and formatting. Look at the data type for each column and make sure entries are correct and consistent, which may include standardizing fields, and remove unwanted characters such as extra whitespaces.

Validate: Validation is the opportunity to ensure data is accurate, complete, consistent, and uniform. This happens throughout an automated data cleansing process, but it’s still important to run a sample to ensure everything aligns. This is also an opportunity to document what tools and techniques were used throughout the cleansing process.

The Future of Data Cleansing

Data cleansing is essential to valid, powerful analysis, yet for many companies it’s a manual, siloed process that wastes time and resources. Analytic Process Automation (APA) allows for repeatable, scalable, accessible data cleansing and enables:

  • The democratization of data and analytics
  • The automation of business processes
  • The upskilling of people for quick wins and transformative outcomes

Data cleansing is the foundation of Analytic Process Automation (APA), and with that strong foundation, companies have a clear path to deeper analysis with data science and machine learning.

Getting Started With Data Cleansing

Manual data cleansing is tedious, error-prone, and time-consuming. With its suite of easy-to-use automation building blocks, Alteryx Analytics Automation™ empowers organizations to identify and clean dirty data in a variety of ways — without code. The end-to-end analytics platform is designed with the significance and specifications of data exploration in mind and on the understanding that clean data leads to good analysis. The Alteryx Platform creates a fast, repeatable, and auditable process that can be built once and automated forever.

Voiture de course McLaren
Témoignage client
Temps de lecture : 5 minutes

McLaren Racing mise sur l'analytique des données pour booster ses résultats

Avec plus de 20 week-ends de course de Formule 1 prévus, chacun générant 1,5 To de données, il est vraiment essentiel de pouvoir collecter, traiter et exploiter ces données. L'équipe McLaren Racing s'appuie sur la plateforme Alteryx d'Automatisation des processus analytiques pour accélérer la prise de décision stratégique, tant sur le circuit qu'en dehors.

Chaîne d'approvisionnement
Responsable Analytique
BI/Analytique/Data science
Lire maintenant
	5 cas d'usage pour aider les professionnels de la FP&A à gagner du temps Time_dynamic_432x767
Temps de lecture : 7 minutes

5 cas d'usage pour aider les professionnels de la FP&A à gagner du temps

Les processus FP&A manuels vous énervent et vous prennent tout votre temps ? Découvrez dans cet e-book cinq processus FP&A à rationaliser pour gagner du temps, améliorer les prévisions et prendre de meilleures décisions.

Planification et analyse financières
Lire maintenant
Personne montant des escaliers
Temps de lecture : 5 minutes


Les directeurs financiers qui s'approprient vraiment l'IA rendent service à leur entreprise, et à leur carrière.

Responsable Business
Alteryx Platform
Lisez vite !

Intelligence Suite Starter Kit

This Starter Kit provides analytics templates to jumpstart your journey to no-code advanced analytics using assisted modeling – the guided creation of machine learning models.