From Raw to Refined: The Staging Areas of Your Data Lake (Part 1)

テクノロジー   |   Bertrand Cariou   |   May 9, 2016

In this two-part series, we’re talking about the Hadoop data lake, both in terms of the necessary components and people involved. Our first post covers the different staging areas of the Hadoop data lake and what they should accomplish. Our second post is now up – check it out to continue learning about the roles each person plays in the data lake. 

Businesses are increasingly driven by data. This is evident both by the number of new initiatives, as well as the ever-expanding ecosystem of tools and best practices to serve them. One such solution is the data lake, which can host all internal data (e.g. transactional systems), and external data (e.g. mobile apps interaction data, weather, social media) that the organization needs. It’s cost-effective and flexible, setting the organization up for long-term success.  

However, while an IT organization will rightfully focus on building a robust data lake solution that is secured, predictable, scalable and governed, business users have their own needs. They want on-demand access to data to execute agile analytics projects. And if they can’t achieve that with the data lake, they’ll turn elsewhere—likely Excel, which is common among less-technical analysts, but also can’t scale with the business and comes with a high risk of reproducing errors.

Setting up the data lake for success

To foster adoption, IT organizations need to take into account supporting applications, such as self-service data preparation, that will help drive adoption by inviting business users onto the lake. That can take shape in many forms—for a digital/event marketing manager it might mean being able to clean, format, and de-duplicate leads from an event before bringing them into a marketing system, while a supply chain analyst might combine POS data with weather data to model the impact of weather on sales.

From a data lake storage perspective, it translates into having various zones where data can be refined based on the business requirements. There’s a general agreement that a lake mandates at a minimum 3 zones, each for a different purpose, type of users, and level of security.

In this two-part post, we’ll first discuss the various stages of a data lake—and how each should be supported by applications—and then describe how different users can, and should, get involved.

Types of zones:

Landing zone

Also called the raw zone, bronze zone or even the swamp is a place that contains the source data as is, with no transformation, such as a raw log file or a binary file coming from a mainframe.

The initial landing zone is often managed by the IT organization which automates the data lake ingestion process. However, project that would be business driven may also bring their raw data (external data) in the raw zone for future usage.

Refinery zone

Also called the silver zone, the pond, the sandbox, the exploration zone, is the place where data can be discovered, explored and experimented with for hypothesis validation and tests.

It usually includes private zones for each individual user and a shared zone for team collaboration. It is often seen as a sandbox with minimal security constraints where end users can access and process the data they want with light automation.

Production zone

Also called the gold zone, the refined zone, the lagoon, operationalization zone, is where clean, well structured data is stored in the optimal format to inform critical business decisions and drive efficient operations.

It often includes an operational data store that feeds traditional data warehouses and data marts. This is a zone that has strict security restrictions for data access and automated provisioning of data where end users only have a read access.

The bottom line

There might be many variations to these zones, usage and level of security, but the above classifications paint a general picture as to how the zones should be managed and accessed for loading and consumption.

Common to all zones is the ability for business users to use a data preparation solution, such as Designer Cloud, to prepare and move the data from one zone to another. IT organizations, too, often use data preparation to obtain detailed specifications from business users, who increasingly want their work to be operationalized. Stay tuned for our second post, where we delve deeper into the different roles and responsibilities for each stage of the data lake!

Want to try Designer Cloud yourself? Start Wrangling today for free.