Data Fabric vs. Data Lake: Operational Comparison
In this article, we will focus on which is the most appropriate big data store for high-scale, real-time, operational use cases – data fabric vs data lake.
Join the DZone community and get the full member experience.
Join For FreeThis article will focus on which is the most appropriate big data store for high-scale, real-time, operational use cases – data fabric vs data lake. It will also discuss data warehouses, as well as relational, and non-relational, databases.
What Are Operational Use Cases?
Data-intensive enterprises are driven by a broad array of real-time use cases requiring a high-scale, high-speed data architecture that can support millions of concurrent transactions. Examples include:
- 360 customer view from many different legacy systems (to a self-service IVR or mobile/web portal, customer service reps, chat agents/bots, and field technicians).
- Churn prediction.
- Credit scoring.
- Fraud prevention.
- Payment card transaction security, and more.
Operational Use Case Requirements
Operational use cases need a big data platform capable of performing complex data queries in milliseconds while dealing with:
- Live data, which is continually being updated from operational systems (with millions, to billions, of updates each day).
- Terabytes of fragmented data, spanning many different databases or tables, typically in different formats and technologies.
- A specific instance of a business entity, such as a single customer, product, location, etc.
- High concurrency, representing thousands of requests every second.
Big Data Storage Options
Today, the most used storage options that data teams rely on include:
Data Lake
According to an analyst at Gartner, a data lake is a collection of storage instances of various data assets. These assets are stored and maintained as an exact, or near-even exact, replica of the structured or unstructured source format – in addition to the original data stores. Examples of data lake providers include Amazon S3, Apache Hadoop, and Azure Data Lake.
Data Warehouses (DWH)
A data warehouse refers to a storage architecture designed to persist data extracted from operational data stores, transaction systems, and external sources. It combines the data in an aggregated form appropriate for enterprise-wide data analysis and reporting. Examples of DWH providers include Amazon Redshift, Google BigQuery, and Snowflake.
Database Management Systems (DBMS)
A database management system stores and organizes data with defined formats and structures. A DBMS is categorized by its basic structure and by its use or deployment.
A relational DBMS, which usually includes a Structured Query Language (SQL) API, is organized and accessed via the relationships between the data entities. Examples of relational DBMS providers include MS SQL, Oracle, and PostgreSQL.
A non-relational (NoSQL) DBMS is often used in big data and real-time web applications. Although optimized for high-scale use, a non-structured database can’t enforce relationships between data entities. Examples of non-relational DBMS providers include Cassandra, MongoDB, and Redis.
Data Fabric
A data fabric can be defined as an integrated layer of connected data, that's ingested and normalized from an enterprise's data sources – regardless of the data’s format, technology, or source system. It holds the processed data in its own data store, delivering it to big data stores, consuming applications, and AI/ML/real-time decision-making engines – on demand. Examples of data fabric providers include IBM Cloud Pak, K2View, Denodo, Talend and Informatica.
Storage Options – Pros and Cons
The following summarizes the strengths and weaknesses of data fabric vs data lake/DWH, as well as relational, and non-relational, databases.
Data Lake/DWH
Strengths
- Support for complex data queries, across structured and unstructured data.
Weaknesses
- No support for single entity queries, with resultant slow response times.
- No support for live data, so data that needs to be constantly updated is unreliable or delivered at unacceptably slow response times.
Relational Database
Strengths
- Support for SQL, broad adoption, and ease of use.
Weaknesses
- Non-linear scalability, needing expensive hardware to perform complex queries, on Terabytes of data, in near real-time.
- High concurrency, resulting in unacceptably slow response times.
NoSQL Database
Strengths
- Distributed data store architecture, with support for linear scalability.
Weaknesses
- No support for SQL, needing specialized skills.
- In order to support data queries, indexes need to be predefined – or complex application logic needs to be embedded – hampering development agility and time to market.
Operational Data Fabric
Strengths
- Full support for SQL.
- Distributed data store architecture, with support for linear scalability.
- Support for high concurrency, with high performance.
- Support for complex queries for single business entities.
Weaknesses
- No inherent support for querying across multiple Micro-Databases, but Elasticsearch resolves this issue satisfactorily.
Conclusion
In the data fabric vs data lake comparison, the architecture of choice for real-time operational use cases is obviously data fabric. But data fabric solutions and data lakes are actually complementary in that data fabric can prepare trusted data for data lakes, while data lakes can provide operational intelligence to data fabric for immediate use.
Opinions expressed by DZone contributors are their own.
Comments