6 Steps to a Bulletproof Data Strategy
1. A Bird’s-Eye-View Sharper Than Hawkeye’s
Before you start working intensely with a new dataset, it’s a good idea to step boldly into the raw material and do a bit of exploring. Genetically modified eyesight can help (like Hawkeye’s), but it’s not required. Start with a mental picture of what you’re looking for, but also keep an open mind and let the data do the talking.
Tips: Data exploration
- Scan column names and field descriptions to see if any anomalies stand out, or any information is missing or incomplete.
- Do a temperature check to see if your variables are healthy: how many unique values do they contain? What are the ranges and modes?
- Spot any unusual data points that may skew your results. You can use visual methods — say, box plots, histograms, or scatter plots — or numerical approaches such as z-scores.
- Scrutinize those outliers. Should you investigate, adjust, omit, or ignore them?
- Examine patterns and relationships for statistical significance.
2. Data More Refined Than Iron Man’s Core
Data full of errors and inconsistencies carries a big price tag: Studies have shown that dirty data can shave millions off a company’s revenue. Because these mistakes can be as expensive as a palladium core, to prevent big losses, you'll need to clean your data until it glows with a fervent, self-illuminating light.
Tips: Data cleansing
- Ditch all duplicate records that clog your server space and distort your analysis.
- Remove irrelevant rows or columns that won't affect the problem you’re trying to solve.
- Investigate and possibly remove missing or incomplete info.
- Nip any unwanted outliers you discovered during data exploration.
- Fix structural errors — typography, capitalization, abbreviation, formatting, extra characters.
- Validate that your work is accurate, complete, and consistent, documenting all tools and techniques you used.
3. A Stronger Combination Than the Avengers
The more high-quality sources you incorporate into your analysis, the deeper and richer your insights. Any project you undertake will typically require six or more data sources, requiring data blending tools to fuse them together seamlessly. Basically, you need to assemble the ultimate team of high-quality data.
Tips: Data blending
- Acquire and prep. If you’re using modern data tools rather than trying to make files conform to a spreadsheet, you can include almost any file type or structure that relates to the business problem you’re trying to solve and transform all datasets quickly into a common structure. Think files and documents, cloud platforms, PDFs, text files, RPA bots, and application assets like ERP, CRM, ITSM, and more.
- Blend. In spreadsheets, this is where you flex your VLOOKUP muscles. (They do get tired, though, don’t they?) If you’re using self-service analytics instead, this process is just drag-and-drop.
- Validate. It’s important to review your results for consistency and explore any unmatched records to see if more cleansing or other prep tasks are in order.
4. Data Sense Is the New Spidey Sense
Data profiling, data exploration’s cousin, requires more scrutiny. It means examining a dataset specifically for its relevance to a particular project or application. You'll have to use your instincts and know-how to determine whether a dataset should be used at all — a big decision that could have serious financial consequences for your company.
Tips: Data profiling
- Structure profiling. How big is the dataset and what types of data does it contain? Is the formatting consistent, correct, and compatible with its destination?
- Content profiling. What information does the data contain? Are there gaps or errors? This is the stage where you’ll run summary statistics on numerical fields; check for nulls, blanks, and unique values; and look for systemic errors in spelling, abbreviations, or IDs.
- Relationship profiling. Are there spots where data overlaps or is misaligned? What are the connections between units of data? Examples might be formulas that connect cells, or tables that collect information regularly from external sources. Identify and describe all relationships, and make sure you preserve them if you move the data to a new destination.
5. Establish Your Secret Base
With the enormous volume and complexity of data sources available to you, it’s inevitable that you’ll need to extract it, integrate it, and store it in a centralized location that allows you to retrieve it for analysis whenever you need it — kind of like a secret base (Batcave?) for your day-saving data.
Tips: ETL (Extract, Transform, Load)
- Extract. Pull any and all data — structured or unstructured, one source or many — and validate its quality. (Be extra thorough if you’re pulling from legacy systems or external sources.)
- Transform. Do a deep cleanse here, and make sure your formatting matches the technical requirements for your target destination.
- Load. Write the transformed data to its storage location — usually, a data warehouse. Then sample and check for data quality errors.
6. Wrangling Better Than Wonder Woman’s Lasso of Truth
The term “data wrangling” is often used loosely to mean “data preparation,” but it actually refers to the preparation that occurs during the process of analysis and building predictive models. Even if you prepped your data well early on, once you get to analysis, you’ll likely have to wrangle (or munge or lasso) it to make sure your model will consume it — not spit it back out.
Tips: Data wrangling
- Explore. If your model doesn’t perform the way you thought it would, it’s time to dive back into the data to look for the reason.
- Transform. You should structure your data from the beginning with your model in mind. If your dataset’s orientation needs to pivot to provide the output you’re looking for; you’ll need to spend some time manipulating it. (Automated analytics software can do this in one step.)
- Cleanse. Correct errors and remove duplicates.
- Enrich. Add more sources, such as authoritative third-party data.
- Store. Wrangling is hard work. Preserve your processes so they can be reproduced in the future.