Blazingly fast execution with BigQuery Pushdown for Google Cloud Dataprep

Technologie   |   Bertrand Cariou   |   7. April 2021

In the previous blog of this series, we discussed how Cloud Dataprep intelligently “pushes-down” the transformation logic to the source database, thereby reducing the overall data load ingested into the application, and significantly improving performance.

Now let’s take it to the next level, and talk about what Cloud Dataprep can do if the source and destination for your data transformation is in the same cloud data warehouse or data lake, and how Designer Cloud can fully pushdown the entire transformation to the powerful compute engine of the underlying cloud database itself.

Introducing Google BigQuery Pushdown for Designer Cloud flows!

When the data is already inside BigQuery, instead of extracting the data, running a transformation job in Google Dataflow, and publishing it back to BigQuery, Dataprep intelligently translates it’s wrangle language, fully or partially, into standard SQL, and executes this natively inside BigQuery using it’s compute power. The result of this is upto 20x faster job executions! Yes, you read that right! 

Not only is performance blazingly fast, but there are some cost optimizations for our customers running workloads directly inside Google BigQuery, primarily due to the efficient optimizations of execution, but also because there is no data ingestion or extraction cost. More details about performance enhancements and cost optimizations are shared in Google Cloud’s blog from Sudhir Hasbe, Sr. Director, Product Management, Smart Analytics at Google Cloud.  

Fig. Dataprep processing engine options with Dataflow and BigQuery

Apart from performance and cost optimization benefits, security standards are also elevated with SQL security enforcement at the user or service level with IAM & OAuth. There are also optimization settings available at a flow level, for fine-grained control for data manipulations such as filters, joins, unions, and aggregations.

The best part about this exciting new feature is that the data engineering experience is un-changed for the user, and all this magic happens intelligently under the hood!

From ETL to ELT

Think about BigQuery Pushdown as ELT (Extract, Load, Transform), which has  become the preferred mode for Data Engineering as opposed to ETL (Extract, Transform, Load). The ELT framework allows business users to focus on the business transformation of data within the data warehouse, instead of worrying about the loading of the data and transforming it mid-stream.

This Paradigm powered by BigQuery pushdown truly democratizes data-engineering for building ELT data pipelines without the need to write or maintain a single line of code. This also empowers business users to become citizen data engineers with a no code / low code solution.

More Performance Optimizations Coming Soon

Execution for most flows where source and destination are both BigQuery will now be pushed down to BigQuery for execution with the next release of our product. We are working hard to make sure every flow, and every transformation gets executed as efficiently as possible within the data warehouse, including other use-cases where the source could be a Relational Database or a file system.

Our next blog in this series will be a technical deep dive for a more technical audience to talk about what happens under the hood, and the innovations done by the Alteryx engineering teams to make these pushdowns possible.