As part of growing our massive new Data Science program at Berkeley, it became clear that we needed to target a class specifically for Data Engineering. The goals of Data Engineering are different than Software Engineering. So it was interesting to think through this curriculum and how we would teach it differently than our established database classes.
In this new approach, we ended up emphasizing four steps to SQL for Data Engineering that are atypical of a traditional databases class: data quality, data reshaping, “spreadsheet tasks,” and data pipeline testing.
More
Joe Hellerstein • September 7, 2021
When we use SQL for Transformation—the “T” in ELT—the focus changes. In this case, we’re taking many messy and disparate tables and manipulating them into a more usable or common form. To take our example from before, we may be extracting and loading sales data from 17 electronics chains that sold the phones, and our job in SQL is to write transformation queries that integrate that data together.
More
Joe Hellerstein • August 30, 2021
ELT is increasingly attractive these days. Modern data warehouses are flexible and increasingly cost-effective, allowing us to store large volumes of data—even messy data that includes volumes of text and images. In this environment, transformations occur in the data warehouse, where the native language is SQL.
More
Joe Hellerstein • August 23, 2021
For the first decades of the Millenium, it seemed like the Java-centric approach was the "hot new thing," but SQL has been roaring back. Today, SQL seems to be the focus of every data engineering conversation and popping back up on billboards in Silicon Valley.
The comparison of the two "shops" inevitably leads to the question: which is better? There are pros and cons to emphasizing one or the other.
More
Joe Hellerstein • August 16, 2021