The first step in profiling any data, whether an entire database or just one file, is to look at its structure and format. Some questions to ask during structure profiling:
- What’s the overall size of the dataset?
- What types of data does it contain? (E.g., strings, floats, datetime, Boolean, spatial objects)
- Is data formatted consistently and correctly? This is important when it comes to migrating data to a new repository.
After addressing the above, label and tag data with the findings to improve usability.
Looking at the content — both from a cognitive and visual perspective — can provide a better understanding of data and highlight where it has gaps or errors. During content profiling, one should:
- Run a summary of statistics such as min/max values for numerical fields and frequency of values for categorical fields
- Check for the number of null values, blanks, and unique values to gain insight into the range and quality of the data and whether a field is relevant
- Look for systemic errors such as misspellings and variable representation of values (E.g., “Doctor” versus “Dr.”), which can derail an analytic process
Identifying the key relationships across data can guide efforts in retention and spotlight where data might need to be transformed to be more effective. A relationship could be as simple as a formula in one spreadsheet cell that references another cell or as complex as a table that has aggregated sales data from a collection of regularly updated tables.