What is AI’s role in the data preparation process? It doesn’t take much more than asking Siri to clean your data to realize that we can’t sit back and let AI take care of all of our data preparation needs. In session 3 of the Data School with Professor Joe Hellerstein, Joe takes a look at how human intelligence, and artificial intelligence, can work together to make data cleaning easy and intuitive.
Here’s what we do know–traditional data transformation interfaces are incredibly tedious to work with, asking users to create complex code or use IT power tools to work with data in a scalable manner. On the other side of the spectrum, spreadsheets provide an easy to use interface but lack scale and governance.
There must be something between “siri clean my data” and a 600 line python code to solve this problem. We want a medium that is natural for people to immerse themselves in their data. Speech is probably not that medium, but what about the visual medium? Things like spreadsheets and dashboards are familiar to those who work with data. The visual medium provides a great foundation for seeing how your data changes as you clean, structure and blend it, and also gives users visuals of their data to interact with. This interaction is exactly where AI comes into the equation.
Let’s take an example, you have a date column with multiple different formats, some rows resemble 03/17/15 and others are like 17-Nov-2017. You only noticed this after loading your data into your favorite visualization tool and noticed that a large group of rows in your date column show up as null. Using a traditional data transformation interface, best case scenario you can select an edit column block and then create a complex if/then or case statement using regular expressions that identifies numerous different conditions where the first 2 digits are the day, followed by a “-”, followed a three letter string, followed by another “-”, then followed by a four digit year, that you want to reformat to a different pattern that you specify using more regular expressions. And once you have that in place, you have reformatted one of the date formats in your column. What happens when you have 4 different formats? There goes your Wednesday on just this one task. And that’s only your reality if you’re code savvy, and the majority of us aren’t.
What happens if we add some visualizations and AI to solve this problem? We can quickly identify when there are multiple formats, which would cause an issue in your analytics. Simply clicking on the column that has funky data, identified by a visual indicator, allows the AI to provide a ranked list of suggestions on what you might want to do to resolve your issues. You can clean up mismatched dates with just a couple of clicks rather than lines and lines of complex code, saving hours of time and frustration.
Make sure to watch the full video above to see how AI can significantly improve the experience of cleaning your data!
Where else can you find Professor Joe Hellerstein?
Joe Hellerstein is the Chief Strategy Officer and Co-Founder of Trifacta. You can also find him at UC Berkeley as the Jim Gray Chair of Computer Science. He has produced many academic resources for public consumption, including undergraduate course videos on database systems, notes from his graduate course, or research from his team and affiliated labs at UC Berkeley: DSF and RISELab.