Dedupe: Duplicates, or dupes, usually show up when data is blended from different sources (e.g., spreadsheets, websites, and databases) or when a customer has multiple points of contact with a company or has submitted redundant forms. This repeated data uses up server space and processing resources, creating larger files and less efficient analysis. Rules around deduping depend on a company’s desired outcome. For example, if a customer submitted different emails on different pages of a website, a conservative approach would eliminate only rows of data where every field is an exact match.
Remove Irrelevant Observations: Data that’s not relevant to the problem being solved can slow down processing time. Removing these irrelevant observations doesn’t delete them from the source but excludes them from the current analysis. For example, when investigating campaigns from the last year, there’s no need to include data from outside that time frame. Keep in mind, however, that even if a certain variable isn’t needed, it might be correlated with the outcome being investigated (e.g., a customer’s age could impact which email was most successful).
Manage Incomplete Data: Data might be missing values for a few reasons (e.g., customers not providing certain information), and addressing it is vital to analysis as it prevents bias and miscalculations. After isolating and examining the incomplete values, which can show up as “0,” “NA,” “none,” “null,” or “not applicable,” determine if those are plausible values or due to missing information. While the easiest solution might be to drop the incomplete data, be aware of any bias that might result in that action. Alternatives include replacing null values with substitutes based on statistical or conditional modeling or flagging and commenting on the missing data.
Identify Outliers: Data points that are far removed from the rest of a population can significantly distort the reality of the data. These outliers can be identified with visual or numerical techniques including box plots, histograms, scatterplots, or z-scores; when part of an automated process, it allows for quick assumptions, testing those assumptions, and resolving data issues with confidence. Once identified, outliers can be included or omitted depending on how extreme they are and what statistical methods are being used in an analysis.
Fix Structural Errors: It’s important to correct errors and inconsistencies including typography, capitalization, abbreviation, and formatting. Look at the data type for each column and make sure entries are correct and consistent, which may include standardizing fields, and remove unwanted characters such as extra whitespaces.
Validate: Validation is the opportunity to ensure data is accurate, complete, consistent, and uniform. This happens throughout an automated data cleansing process, but it’s still important to run a sample to ensure everything aligns. This is also an opportunity to document what tools and techniques were used throughout the cleansing process.