Building an Open Data Lake Analytics Stack Using Presto, Hudi, and AWS S3

In this blog, you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3.

Praburam Upendran

Mar. 03, 22 · Analysis

Likes (5)

Comment

Save

6.2K Views

Rise of Open Data Lake Analytics

Data warehouses have been considered a standard to perform analytics on structured data but cannot handle unstructured data such as text, images, audio, video, and other formats. Additionally, machine learning and AI are becoming common in every aspect of business and they need access to vast amounts of data outside of data warehouses.

The cloud transformation has triggered the disaggregation of compute and storage, which leverages cost benefits and enables adaptability to store data coming from multiple dimensions. All this has led to a new data platform architecture called the Open Data Lake Analytics stack. This stack solves the challenges of the traditional cloud data warehouse through its use of open source and open format technologies such as Presto and Hudi. In this blog, you will learn more about open data lake analytics stack using Presto, Hudi, and AWS S3.

What Is Open Data Lake Analytics

Open Data Lake Analytics is based on the concept of running analytics on technology and tools that do not require any vendor lock-in including licensing, data formats, interfaces, and infrastructure.

Specifically, there are four key elements to the Open Data Lake Analytics stack:

Open source — The technologies on the stack we will be exploring for Open Data lake Analytics are completely open source under the Apache 2.0 license. This means that you benefit from the best innovations, not just from one vendor but from the entire community.

Open formats — Also they don’t use any proprietary formats. In fact, it supports most of the common formats like JSON, Apache ORC, Apache Parquet, and others.

Open interfaces — The interfaces are industry-standard ANSI SQL compatible and standard JDBC/ODBC drivers can be used to connect to any reporting/dashboarding/notebook tool. And because it is open source, industry-standard language clauses continue to be added in and expanded on.

Open cloud — The stack is cloud-agnostic and without storage natively aligns with containers and can be run on any cloud.

Why Open Data Lake Analytics

Open data lake analytics allows consolidation of structured and unstructured data in a central repository at cheaper cost and removes the complexity of running ETL, resulting in high performance and reducing cost and time to run analytics.

Bringing compute to your data (decouple of compute and storage)
Flexibility at the governance/transaction layer
Flexibility and low cost to store structured and semi/unstructured data
Flexibility at every layer — pick and choose which technology works best for your workloads/use case

The Open Data Lake Analytics Stack

Now let’s dive into the stack itself and each of the layers. We’ll discuss what problems each layer solves for.

BI/Application Tools — Data Visualization, Data Science Tools

Plug in your BI/analytical application tool of choice. The Open Data Lake Analytics stack supports the use of JDBC/ODBC drivers so you can connect Tableau, Looker, preset, jupyter notebook, etc. based on your use case and workload.

Presto — SQL Query Engine for the Data Lake

Presto is a parallel-distributed SQL query engine for the data lake. It enables interactive, ad-hoc analytics on large amounts of data on data lakes. With Presto, you can query data where it lives, including data sources like AWS S3, relational databases, NoSQL databases, and some proprietary data stores.

Presto is built for high performance interactive querying with in-memory execution

Key characteristics include:

High scalability from 1 to 1000s of workers
Flexibility to support a wide range of SQL use cases
Highly pluggable architecture that makes it easy to extend Presto with custom integrations for security, event listeners, etc.
Federation of data sources particularly data lakes via Presto connectors
Seamless integration with existing SQL systems with ANSI SQL standard

A full deployment of Presto has a coordinator and multiple workers. Queries are submitted to the coordinator by a client like the command line interface (CLI), a BI tool, or a notebook that supports SQL. The coordinator parses, analyzes and creates the optimal query execution plan using metadata and data distribution information. That plan is then distributed to the workers for processing. The advantage of this decoupled storage model is that Presto is able to provide a single view of all of your data that has been aggregated into the data storage tier like S3.

Apache Hudi — Streaming Transactions in the Open Data Lake

One of the big drawbacks of traditional data warehouses is keeping the data updated. It requires building data mart/cubes, then doing constant ETL from source to destination mart, resulting in additional time, cost, and duplication of data. Similarly, data in the data lake needs to be updated and consistent without that operational overhead.

A transactional layer in your Open Data Lake Analytics stack is critical, especially as data volumes grow and the frequency at which data must be updated continues to increase. Using a technology like Apache Hudi solves for the following:

Ingesting incremental data
Changing data capture, both insert, and deletion
Incremental data processing
ACID transactions

Apache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source-based transaction layer with storage abstraction for analytics developed by Uber. In short, Hudi enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. Hudi uses open file formats Parquet and Avro for data storage and internal table formats known as Copy-On-Write and Merge-On-Read.

It has built-in integration with Presto so you can query “hudi datasets” stored on the open file formats.

Hudi Data Management

Hudi has a table format that is based on directory structure and the table will have partitions, which are folders containing data files for that partition. It has indexing capabilities to support fast upserts. Hudi has two table types defining how data is indexed and layed out, which defines how the underlying data is exposed to queries.

(Image source: Apache Hudi)

Copy-On-Write (COW): Data is stored in Parquet file format (columnar storage), and each new update creates a new version of files during a write. Updating an existing set of rows will result in a rewrite of the entire parquet files for the rows being updated.
Merge-On-Read (MOR): Data is stored in a combination of Parquet file format (columnar) and Avro (row-based) file formats. Updates are logged to row-based delta files until compaction, which will produce new versions of the columnar files.

Based on the two table types Hudi provides three logical views for querying data from the Data Lake.

Read-optimized — Queries see the latest committed dataset from CoW tables and the latest compacted dataset from MoR tables
Incremental — Queries see new data written to the table after a commit/compaction. This helps to build incremental data pipelines and its analytics.
Real-time — Provides the latest committed data from an MoR table by merging the columnar and row-based files inline

AWS S3 — The Data Lake

The data lake is the central location for storing data from disparate sources such as structured, semi-structured, and unstructured data and in open formats on object storage such as AWS S3.

Amazon Simple Storage Service (Amazon S3) is the de facto centralized storage to implement Open Data Lake Analytics.

Getting Started

How to run Open data lake analytics workloads using Presto to query Apache Hudi datasets on S3.

Now that you know the details of this stack, it’s time to get started. Here I’ll quickly show how you can actually use Presto to query your Hudi datasets on S3.

Data can be ingested on Data lake from different sources such as Kafka and other databases by introducing Hudi into the data pipeline. The needed Hudi tables will be created/updated, and the data will be stored in either Parquet or Avro format based on the table type in S3 Data Lake. Later, BI Tools/Application can query data using Presto, which will reflect updated results as data gets updated.

Conclusion

The Open Data Lake Analytics stack is becoming more widely used because of its simplicity, flexibility, performance, and cost.

The technologies that make up that stack are critical. Presto, being the de-facto SQL query engine for the data lake, along with the transactional support and change data capture capabilities of Hudi, make it a strong open-source and open-format solution for data lake analytics, but a missing component is Data Lake Governance, which allows running queries on S3 more securely. AWS has recently introduced Lake formation, a data governance solution for data lake, and Ahana, a managed service for Presto seamlessly integrates Presto with AWS lake formation to run interactive queries on your AWS S3 data lakes with fine grained access to data.

Data science AWS Data lake Open data Analytics Database Presto (SQL query engine) Relational database

Published at DZone with permission of Praburam Upendran. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending