Efficient Data Engineering for the Modern Data Stack

Technology   |   Naresh Govindaraj  |  
Shyam Srinivasan   |   Dec 15, 2021

This is the second post in a series where we discuss data architectures and the continued need for scalable and efficient data engineering. In the first post, we talked about the evolution of data architectures and how data teams are looking for a scalable and modern data solution that can address the needs of today and be agile to meet future requirements. Towards the end of the post, we introduced the concept of ‘Speed of Thought Design” for modern data engineering and how it is critical to enable data practitioners and accelerate the deployment of the modern data stack. It’s now time to dive deep into this concept and to achieve this with ease and scale with Designer Cloud, the Data Engineering Cloud.

The cloud has enabled nearly infinite storage and computing capacity, but this also means you need the right tools to get the right data in place. You also need this data to be in the right format with high quality for use in different applications. Applications such as AI and ML are used both for getting the right data in place as well as pushing it downstream. This means AI needs to be leveraged at the time of data transformation by understanding the data at its most granular level. This is where a data engineering leader like Trifacta comes in.

Strong Foundation for your Data

Designer Cloud is built from the ground up for getting you the data in the format you need, at the quality needed for your business and applications, and with the ease of use for the data practitioners including analysts, engineers, and scientists. Designer Cloud provides a visually rich interface to load, transform, and understand your data. With an intelligent AI engine at the back end, suggestions are presented on which aspects of your data need to be transformed as well as real-time previews of those suggestions. Using an agile design platform, you can collaborate with other users to transform and pipeline your data. There is no need to be code proficient in tools such as SQL or have operational expertise in data pipelining tools. Designer Cloud caters to both the non-code professional and those who want to work with code such as SQL and Python.

The Modern Design Paradigm for Data Engineering

With data architectures evolving from ETL to ELT, Trifacta delivers the most advanced modern, cloud-based platform to engineer your data efficiently and at scale. Designer Cloud provides the required level of efficiency, intelligence, and sophistication needed by data paradigms such as warehouses, lakes, and fabrics.

Data Sampling and Profiling

Designer Cloud offers a flexible approach to choosing your data with both connectivity and data sampling. You can choose the size of your data sample to transform your data and the applied transformations are permeated to your entire data set. There are different sampling techniques to choose from including random, filter-based, cluster-based, anomaly, and stratified sampling. Working with different types of data samples enables you to prepare your data at its basic level and cover edge cases to avoid any errors downstream. Trifacta provides intuitive ways to view your data so you can explore and understand your data better to implement the required transformations with high accuracy.

AI-assisted design with real-time previews

Designer Cloud provides users a natural spreadsheet-like view of your data, so users can apply data transformations and see the data change in real-time. The real-time preview ensures that users are able to validate the transformations quickly and accurately and can avoid multiple iterations may you experience in other tools. In Designer Cloud, you don’t need to remember the syntax of a complex REGEX formula to parse out components such as zip codes from Address fields. The AI system learns from your actions and derives the formula for you when you brush over and highlight the data segment you are interested in and shows you what the results will look like across multiple records. 

In the Designer Cloud design interface, null values in columns and missing data are revealed and the AI system recommends transformations to fix these types of common data-related problems. Additionally, you can perform all the popular database operations such as joins, unions, and aggregations to combine different datasets and generate summaries along with quick previews of your result. Also, working with JSON arrays and objects can be done with ease as you can see the data change on the screen as you perform these complex operations.

During execution, the Designer Cloud runtime optimizer maps recipe functions to SQL equivalents or user-defined functions (UDFs) so they can natively execute on data warehouse engines. By leveraging UDFs, a richer set of data quality and parsing functions are made available for users beyond what would be provided in the native data warehouse SQL language.  

Designer Cloud also provides collaboration features for users to share best practices via reusable recipes and macros and other shared assets. Data teams can work together to conquer the mountain of information they are presented with.

Self-Documenting Recipes

Designer Cloud captures data transformation steps in recipes and displays them in readable text, so you can easily understand what operations have been performed. This self-documenting capability of Designer Cloud recipes makes it much easier to maintain than in other tools or code-based environments where it is the users’ responsibility to maintain documentation that can easily go out of sync.

Closed Loop Metadata and Data-Driven Design

The Designer Cloud Data Engineering platform is based on both metadata and a data-driven framework. Unlike other tools, In Designer Cloud, you can operate on data during design, and the system captures the metadata in the background. Designer Cloud preserves all the advantages of a metadata-driven system such as lineage and reuse, and it is data-driven so you can continue to operate on data naturally similar to spreadsheets. The closed-loop system enables the AI system to learn more and make new recommendations based on usage patterns captured in the metadata. 

Uncompromised Execution at Scale

The Designer Cloud runtime optimizer generates data warehouse compatible SQL and uses pushdown (ELT) operations to run the SQL directly in the cloud data warehouse (CDW) engine to perform the transformations. The data stays in the CDW environment and interactive query responses return in sub-seconds taking advantage of the performant and scalable CDW engines.

If you need to extract data from the CDW and load it back into your applications and operational applications (via reverse ETL), then the Designer Cloud execution paradigm continues to work and leverages the data warehouse engines for transformations before data is sent to the required operational environments. If your backend data lake is running on a cluster, then Designer Cloud generates Spark runtimes for native execution.  This approach isolates implementation from execution and provides more deployment flexibility for users.

Now you have a data engineering design interface that operates at near speed of thought and can leverage and complement the speed of data processing provided by the modern CDWs. The recipe metadata is captured in an execution neutral manner, so if you need to change execution environments in the future to spark or other SQL-based environments then it can be done with minimal changes. 

Agility to the Modern Data Stack

Designer Cloud adds speed and agility to the modern data stack in the cloud. With Designer Cloud, you are equipped with a modern, explorative, and agile data engineering interface to fine-tune and convert raw data to high quality and useful information and derive maximum value. Test drive Designer Cloud today with our free 30-day trial and leverage these exciting capabilities.