Busting the Myth of Zero ETL: Why Data Transformation Is So Critical
This article discusses the truth about Zero ETL and describes the crucial role of data transformation.
Join the DZone community and get the full member experience.
Join For FreeA Brief History of Transformation of Data
Twenty years ago, there were organizations that used Oracle ERP and an on-premises data warehouse on Oracle. All the data needed for the analytics was right there in the same network. Yet there was a need to use specialized tools to get the data ready for analysis. This involved joining, transforming, cleansing and refining the transactional data so it served the analytics needs of users across the business.
Now, 20 years later, computing power has increased multifold, and compute costs have reduced significantly. Many systems are now in the cloud, and modern technologies are available to store and retrieve data quicker. Plus, with cloud elasticity, you don’t need to purchase your hardware to support your 5-year growth plan.
But does this mean your data is also automatically ready for analysis? Of course not. Because data has also grown significantly in volume, there are more data sources and structures, which add more complexity. Some data has no structure at all or has a continually changing structure.
Twenty years ago, if you just replicated all your transactional tables from the Oracle database to your Oracle data warehouse, it would not have made any sense to the users. And that’s still the case despite all the technological innovations over the last twenty years.
Why is that? Because technology does not fix data. Let’s explore this concept further.
Applications Don’t Generate Analytics-Ready Data
The source data extracted or ingested into a data lake or a data warehouse is created by business applications. The main purpose of such applications is to enable business interactions (such as a point-of-sale transaction or an order on a retail website) and store the resulting data in an optimized form for that application. Whether that data is ready for any analysis is not the concern of these applications. Unfortunately, this data is usually cryptic, stored in a normalized form, sometimes erroneous, and difficult for anyone other than the application stakeholders to understand.
Even if the transactional application and the data warehouse are both owned by the same company and if the data is stored in the same table for both transactional and analytics use cases, it still does not automatically make the data ready for analysis.
There’s yet another consideration: the possibility of mergers and acquisitions. When this happens, sometimes companies consolidate their IT departments. And then you’re dealing with multiple sources of data on the same subject matter (sales invoices, for example) and need to understand how to put it all together.
Finally, there is the analytics data model. Most organizations standardize on a data model suitable for analytics, such as a dimensional data model or data vault. Transformations are also required to make the raw data conform to such a data model.
It’s for these reasons that you need transformation of data. A metadata-driven tool is essential for configuring such transformations because it enables you to implement these transformations using either of the popular patterns:
- Classical Extract-Transform-Load (ETL)
- Extract-Load-Transform (ELT)
Why is this raw data unsuitable for analytics as-is? Let’s explore that in a little more detail.
Why Data Transformation Is Critical for Analytics
The purpose of a data warehouse is to provide analytics that cut across business functions. An organization needs several analyses that combine data across sales, finance, support and services and provide meaningful metrics so business users can make key decisions. Apart from data that comes from the company’s ERP or CRM applications, nowadays there is a lot of other data that is being generated:
- 50+ cloud applications which store data related to various business functions
- User interactions on your website
- Social interactions by users such as tweets, reviews, etc.
- Machine-generated data
For any meaningful cross functional analysis, you need to combine data from two or more of these sources. Some examples include:
- Long-term performance of marketing campaigns
- Correlation between your users’ social interactions and their subscriptions or cancellations
- Product-specific profitability and margins
- Operational effectiveness reports that include original company stores, acquired company stores and online sales
- Recently added maintenance alerts on devices used by customers in their locations
Getting any meaningful and actionable insights from such analyses requires clean and complete data that can be consistently connected. For example, a recent purchase by a customer and her product reviews on your retail website are two discrete sets of data that may or may not have attributes that easily link those two items. Being able to correctly join such disparate data is challenging due to three key reasons:
- The data resides in separate databases, files and applications. In each case, the identifiers of the specific entity could be different. A customer ID used in the ERP may not be available in the social interaction record.
- The grain of the data can vary across business functions even within the same applications. Sales might use two different records for a large enterprise customer based on two different teams they have worked with. In fact, service or support records may contain 10+ different instances of the same customer based on their threads of interaction.
- Data managed by an acquired company can be in a separate application, using their own identifiers, codes, etc.
This requires extensive transformation of data. This includes aggregating data at lower grains, cleansing erroneous records, de-duplicating customer or order/interaction records, replacing source-specific codes with commonly used ones, applying business rules specific to your organization and calculating metrics based on that and much more.
Technology Cannot Eliminate the Need to Make Data Ready for Analytics
Most claims of “Zero ETL” are based on the following:
- The ability to load data quickly and making an assumption that just because these systems reside in the same cloud, the data hosted in one is consumable by another
- Tables that can support both transactional and analytic use cases, such as Apache Iceberg
- Data virtualization that allows you to interact with data from an ERP or CRM application from within a data warehouse
The above claims fall short because of the following reasons:
- Such features simplify the work of connecting to a third-party system and fetching that data into a data warehouse, or vice versa. But it’s still raw data and it still needs transformation to make it analytics ready.
- An open table format, such as an Apache Iceberg, which can support transactional and analytical use cases, can “technically” store the transactional data in it and run analytics directly on it. These are great for data lake use cases where they can provide reliability and simplicity of access like that of a relational database. But these don’t inherently do anything to make the underlying data ready for analysis.
- Even allowing direct access to a third-party system’s data from within a data warehouse application, or vice versa, simplifies accessing the “raw” data. However, it does not eliminate the need to transform that data.
Applying cool technology or just moving the data as-is into another system does not inherently make the data readable. If I never learned the English alphabet and went to an optometrist, I would never be able to make sense of those letters in front of me. Applying technology over raw data is exactly like that. Can I do this using ELT pattern then? Or can I just do it in the analytics layer instead? Not really. Let us look at that perspective too.
The “T” in ETL, ELT or Reverse ETL
Transformation of data is typically done using data integration tools. Regardless of whether you use the ETL or ELT approach, cleansing and transforming data and storing it in a consistent data model is necessary for enterprise-wide analyses. For accurate reporting and decision making, it is critical that all departments and users interpret the data in the same way.
If you do not transform data upfront, you will end up doing it at the time of analysis. While the latter is OK for one-off data science or machine learning (ML) projects, for enterprise-wide projects it is costly, redundant and prone to errors.
Having the data available in a cloud storage system and being able to load it “quickly” into a data warehouse or a database is just one small step towards getting it ready. Similarly, having a single data store for all data also leaves all the transformation work to the users. For example, an open table format may be efficient for both transactional and analytics use cases, but it does not eliminate the need to transform the data. You are just postponing the transformation process which might end up “costing” you more. Someone will have to write SQL or Python code, downstream, for the transformation and this will hardwire them to a certain data warehouse.
This is similar to writing PL/SQL code in the old days. People tried that in the early days and soon realized the maintenance challenges that come with it. Also, writing code for transformations makes it hard to obtain fine grain lineage, generate impact analysis and drive general data governance.
Using a metadata-driven data integration tool will provide you with features that address the core issue – data readiness for analytics. This includes graphical and endpoint-agnostic ways to transform, cleanse and massage the data to make it consistently understood across the enterprise, so you have analytics-ready data at scale. Don’t take the shortcut by trying to eliminate data transformation. Instead, future-proof your investment in data, perhaps your most critical asset, by investing in an intelligent, enterprise-grade, scalable data integration tool.
Opinions expressed by DZone contributors are their own.
Comments