Google Cloud Dataprep by Trifacta
Google Cloud Dataprep by Trifacta is a native Google Cloud service jointly developed and supported by Google and Trifacta. Dataprep by Trifacta combines Trifacta’s award-winning, interactive data wrangling experience with the elastic scale and security measures of Google Cloud storage and processing. Dataprep by Trifacta is available in the Google Cloud Console and in the Google Cloud Marketplace and inherits the Google consumption, invoicing and security principles of Google Cloud.
Dataprep by Trifacta is the only native and serverless data preparation solution on Google Cloud. Designed for enterprise-wide deployments, it can scale securely to support any number of users and any volume of data.
Trifacta Security by Design
Trifacta ensures security by design for Dataprep by determining security requirements and architecture best practices during Trifacta’s design phase of the Software Development Life Cycle (SDLC).
Trifacta has implemented a vulnerability management program to identify, track, and remediate security vulnerabilities promptly on the Trifacta Software which is the basis for Google Cloud Dataprep by Trifacta.
- Every month, production application base images of the Trifacta Software are scanned for security vulnerabilities.
- Every quarter, vulnerability scanning for open-source components of the Trifacta Software is performed by an automated system.
- Multiple times a year, Trifacta engages a third-party data security provider to conduct penetration testing of the Trifacta hosted production environment.
For each case, Trifacta reviews and tracks security vulnerabilities to resolution.
Integrity and activity monitoring tools help Trifacta detects and report changes to software and configuration parameters and monitor user service availability, web services, databases, policy changes, security groups, and firewalls. In case of issue detection, Trifacta reviews and analyzes the issue to resolution.
Google Cloud Organization Policies
Dataprep by Trifacta relies on the Google Cloud Organization policies set up by each Customer for its projects such as Google Cloud’s Domain Restricted Sharing and Resource Location Constraints are honored. If selected by the Customer, working in a shared VPC-SC perimeter setup is also supported.
All access to resources in Customer’s projects and APIs to execute and monitor Dataprep jobs is governed by the designated Google Service Account as provisioned and defined by the Customer. Dataprep by Trifacta’s authentication and authorization to Customer data is managed via Service Account grants also managed by the Customer in the Google IAM.
Dataprep by Trifacta Security Architecture
Google Cloud Dataprep by Trifacta is architected with data security in mind. Dataprep by Trifacta translates user-generated metadata describing data transformation logic into a job executed by the Google Cloud Dataflow or BigQuery scalable data processing engines or, for small datasets around 1Gb of data using Dataprep by Trifacta’s in-memory engine.
The Dataprep by Trifacta job reads, transforms, and writes Customer data between the data source and target systems with data never stored outside of the Customer controlled Google Cloud Project resources. Dataprep by Trifacta uses a secure connection between the data source and target systems leveraging Secure Socket Layer (SSL) or Transport Layer Security (TLS) encryption.
The Dataprep by Trifacta’s web-interface is leveraged by Customer’s users to define the data transformation logic and schedule job execution. The Dataprep by Trifacta instance stores these definitions only in the form of metadata within Google Cloud SQL encrypted relational database. Dataprep by Trifacta does not store any Customer data.
Dataprep by Trifacta inherits the Customer’s user permissions defined by the Customer in Google Cloud Identity and Access Management (IAM) set on data resources. As such, the Customer’s users can only prepare the data they have access to.
The Dataprep by Trifacta service is hosted in the US-Central region of Google Cloud. Note that each Customer may determine other regions globally for its projects.
Fig. Dataprep by Trifacta Security Architecture including data at rest, data in motion, and metadata
User Authentication and Authorization
Dataprep by Trifacta fully relies on and inherits from Google Cloud security the settings determined by each Customer within Google Cloud for any authentication management. Trifacta never accesses or stores Customer passwords.
Data authorization to Google Cloud sources or destinations such as Cloud Storage, BigQuery or Google Sheets, is managed by each Customer within Google IAM. Google allows Customers to determine how these authorizations are defined at the Dataprep service level or at the user-specific level leveraging IAM and OAuth 2.0. Trifacta has no ability to alter or supersede these authorizations.
A Customer may use Dataprep by Trifacta to access other data sources such as applications and databases, but to do so, the Customer has the responsibility to create a connection in the Dataprep by Trifacta user interface with the proper credentials. These credentials are stored in Google Cloud SQL database and are encrypted using AES-256.
IAM authentication and authorization govern Dataprep by Trifacta access via the Google Cloud Console, APIs, and the Google Cloud Command Line Interface (CLI) to ensure all the access points are verified.
Dataprep by Trifacta Processing Engines
Google Cloud Dataprep by Trifacta optimizes runtime execution by leveraging the appropriate data processing engine based on the characteristics of the workload. This approach ensures that it is possible to meet user’s requirements that balance latency needs, varying data volumes, and different source data formats. The Dataprep by Trifacta runtime optimizer makes the optimal choice of engine based on the type of workload to ensure that performance goals are met while minimizing the overall processing costs.
The runtime engines supported by Google Cloud Dataprep by Trifacta are the following:
Google Cloud Dataflow
Google Cloud Dataflow is leveraged for processing very large data sets via the scalable and elastic infrastructure provided by Dataflow. Terabyte and petabyte-scale processing can be achieved by running natively on Dataflow. Also, leveraging Dataflow ensures native access to Google Cloud data sources in the Customer’s Google Cloud projects to provide optimal performance. The Google Cloud Dataflow service runs securely within the Customer’s Google Cloud project.
For data that is already in the Customer’s BigQuery data warehouse, SQL-based processing is leveraged to take advantage of the highly performant and elastic BigQuery SQL engine. This optimized pushdown approach (known as ELT for Extract Load & Transform) ensures that data stays in the data warehouse environment without moving out to the database or network. BigQuery processing executes securely in the Customer’s Google Cloud project.
Dataprep by Trifacta In-memory Engine
The Google Cloud Dataprep by Trifacta in-memory engine is a high-performance native in-memory engine that is optimized for low latency processing of smaller datasets (approximately 1Gb and under). This mode of processing is leveraged at design time in the web browser to allow the users to wrangle sample datasets and see transformation results in real-time. The Dataprep by Trifacta in-memory engine is also used during runtime to ensure smaller datasets such as files are processed quickly by taking advantage of in-memory operations.
Dataprep by Trifacta in-memory processing is executed in Dataprep’s project in processes that are dedicated to the Customer. No data is retained during processing. It is possible for a Customer’s Dataprep Administrator to disable this engine.
Dataprep by Trifacta Data Management
For certain source data formats, some preprocessing of data is needed. For instance, data originating in non-tabular formats need to be converted to Comma Separated Value (CSV) formats before runtime processing. The data preprocessing step executes in Dataprep’s project. During preprocessing no data is retained within Trifacta’s environment. In all cases, data is encrypted over the wire using TLS during transmission. Following are scenarios when data is preprocessed:
- Conversion of Microsoft Excel and Portable Document Formats (PDFs) to CSV
Fig. Dataprep by Trifacta within the Google Cloud Analytics ecosystem and its processing engine options
Data originating from certain data sources may need to be persisted in physical storage before executing the data preparation process. In those cases, the data is persisted ONLY in the Customer’s Google Cloud project in a Google Cloud Storage bucket as configured by the Customer. Data is encrypted in the storage bucket as configured by the Customer. Google Cloud Customer-Managed Encryption Keys (CMEK) is supported in this context. The data is kept for these purposes only as long as it is used and then deleted. NO data for these purposes is retained outside of the Customer’s Google Cloud project.
Data is persisted in the customer’s Google Cloud project for the following types of data sources:
- Excel and PDFs
- JSON and other hierarchical datasets
- External Relational data (e.g. Oracle, MySQL)
- Software as a Service (SaaS) application data (e.g. Salesforce, Oracle NetSuite)
- Other REST API based data sources
Data in external non-Google Cloud data sources such as software as a service (SaaS) applications and databases need to be accessed and streamed to the Customer’s Google Cloud project before processing. This access point executes in the Customer’s Dataprep project directly in the Customer’s Google Cloud project. The data is NOT stored in Trifacta’s environment at any point. In all cases, data is encrypted over the wire using TLS during transmission. Here are types of data sources that require streaming of data:
- Relational databases such as Oracle and MySQL
- SaaS applications such as Salesforce, Oracle NetSuite, etc.
- REST API based data sources
- Other JDBC data sources
- sFTP sources
Note that the Customer is responsible for ensuring access to any databases behind firewalls to Customer’s Google Cloud. This access may require whitelisting of Dataprep by Trifacta IP addresses so that the database can be reached over the network, and Trifacta recommends that the database is configured to use SSL or TLS to ensure encrypted connections are used and data is encrypted during transmission.
Data at Rest
- Customer storage and databases are managed by the Customer. Encryption is under the control of the Customer.
- Sample data, intermediate files, file job results are stored in the Customer’s Google Cloud Storage bucket(s). Encryption is under the control of the Customer.
- The Dataprep by Trifacta service stores only Customer metadata (e.g. data preparation recipes, flow names, user names, etc.). Although this does not contain any Customer production data, it is nevertheless stored in Google Cloud SQL instances with AES-256 encryption.
Data in Transit
- Dataflow and BigQuery configuration is managed by the Customer with encryption under the Customer’s control.
- Browser communication is encrypted with TLS.
- All API communications between Google Cloud Services are encrypted with TLS.
Data Processing Examples
To illustrate the various processing techniques based on the data source origin, the format of the source data, the volume of the data, and the target system; here are some specific scenarios and how the data is processed in the Google Cloud Dataprep by Trifacta service. Please refer to the following high-level architecture diagram for the examples below.
Fig. Dataprep by Trifacta execution data pipeline
Google Cloud Dataprep Sign Up Process
During the Dataprep sign-up process from the Google Cloud Console or in the Google Cloud Marketplace, the Customer needs to agree on the terms and access authorizations with Google Alphabet, Google Cloud, and Trifacta to let Dataprep service operate.
The signup process requires a Customer to:
- Agree to the Google Cloud Terms of Service and the terms of service of any applicable services and APIs.
- Agree to the Google Cloud Dataprep Terms of Service.
- Agree for Google Cloud to share the Customer account information with Trifacta. This is the standard Google Cloud practice to allow a Google Cloud Customer to use partner integrated services with Google Cloud. This authorization is necessary for technical support purposes, sales attribution for billing via the Google Cloud services, and product updates communications. Account information is limited to email contact in those specific circumstances.
- Allow Dataprep by Trifacta service to access your Google Cloud project data. This is necessary to enable the Dataprep by Trifacta service to seamlessly perform the data transformation instructions authored by the user and on behalf of the user. Dataprep by Trifacta runs and instructs Google Cloud Dataflow jobs on behalf of the user within the project. Roles and permissions are defined with Google Cloud IAM as documented here.
- Agree with the Trifacta Terms of Service.
Trifacta takes the security of its Customers’ data very seriously. The Google Cloud Dataprep by Trifacta platform is designed so that Dataprep by Trifacta has as little involvement with actual Customer data as possible and so that all Customer data is stored solely in Customer controlled environments (including the Customer controlled Google Cloud.) Trifacta follows rigorous processes and controls to secure Customer data. Taking steps to ensure our platform remains secure is vital to protecting our data as well as our Customers’ information. This is our highest priority.
The Trifacta Data Engineering Cloud platform is built with ease of use, performance, reliability, and security at its core to protect your most valuable asset.
If you want to know more about Trifacta, reach out to [email protected].
If you need to report a security concern, email us at [email protected].