Data Quality Faults With Your Data Vault
In this article, we’ll dive into data vault architecture, challenges and best practices for maintaining data quality; and how data observability can help.
Join the DZone community and get the full member experience.
Join For FreeOver the past several years, data warehouses have evolved dramatically, but that doesn’t mean the fundamentals underpinning sound data architecture need to be thrown out the window.
In fact, with increasingly strict data regulations like GDPR and a renewed emphasis on optimizing technology costs, we’re now seeing a revitalization of “Data Vault 2.0” data modeling.
While data vault has many benefits, it is a sophisticated and complex methodology that can present challenges to data quality. In this article, we’ll dive into data vault architecture, challenges and best practices for maintaining data quality; and how data observability can help.
What Is a Data Vault Model?
For those unfamiliar, data vault is a data warehouse modeling methodology created by Dan Linstedt (you may be familiar with Kimball or Imon models), created in 2000 and updated in 2013. The data vault collects and organizes raw data as an underlying structure to act as the source to feed Kimball or Inmon dimensional models.
The data vault paradigm addresses the desire to overlay organization on top of semi-permanent raw data storage. There are three components to a data vault model:
- Satellite tables: contain information related to a particular business concept.
- Hub tables: link satellite tables together either through standard key relationships, composite keys, or more security-conscious approaches incorporating hashes.
- Link tables: maps different hub tables together through keys (or equivalent).
Data Vault Benefits
Data vault modeling has a lot of advantages compared to other methodologies (or not modeling at all!), such as its suitability for auditing, quickly redefining relationships, and easily adding new datasets.
Pie Insurance, a leading small business insurtech, leverages a data vault 2.0 architecture (with some minor deviations) to achieve its data integration objectives around scalability and the use of metadata.
“A data vault data model is intended to be a physical implementation of the organization’s business model, so it becomes a standard place to plug in data from multiple sources. For example, as new data is added to our data warehouse, we are able to plug in the data to our model by adding Satellite tables. Or as new subject areas become in-scope, we can add new Hub and Link tables based on our business model,” said Ken Wood, staff data engineer in data architecture, Pie.
“The other advantage is because we follow a standard design, we are able to generate a lot of our code using code templates and metadata. The metadata contains our data mappings, and the code templates contain the expected structure of our ETL code scripts/files,” he said.
Other advantages Pie has experienced from its data vault implementation include:
- Organizational – Pie can find the data they need quickly because it is organized to its business model.
- Foundational – Data vault provides a solid foundation that allows for fast “speed-to-insights.” When new questions arise, they can build dimensional tables downstream from data vault tables and even virtualize downstream content (views).
- History – The design of Satellite tables allows Pie to search and query changes to data over time, essentially providing the data needed for slowly changing dimensions and fact history views of the data.
How Do You Implement a Data Vault Architecture?
While deployments will vary, Pie’s data vault architecture implementation includes four conceptual layers of data architecture that make up their data pipeline ecosystem:
- Ingestion Layer– Landing and Staging raw data from source systems.
- Landing – Source files landed in AWS S3 buckets
- Staging – Raw Source Data is stored in VARIANT columns within Snowflake tables.
- Curation Layer – Organizes the raw data.
- Raw Data Vault – Within Snowflake environment and has minor transformations mapping it into Hub, Satellite, and Link tables as recommended by the Data Vault 2.0 methodology.
- Business Data Model – Pie’s data vault design is the physical model of their business data model – as opposed to trying to design based on each source system’s data. This gives them a single model to conform to, regardless of the source.
- Transformation Layer– Transform and clean data using business logic.
- Business Vault – Pre-Transformed data, following business transformation rules.
- Information Warehouse – This layer alone follows the dimensional (or Kimball) star (or snowflaked) data model.
- Presentation Layer– Reporting layer for the vast majority of users. This layer has minimal transformation rules.
- BI/Reporting Tool(s) – Pie uses Looker, which has a metadata layer that reflects the “Information Warehouse” (transformed data).
- Future Reporting Tool Plug-in – This allows future reporting or querying tools to be added without major development because the data is already transformed in the database layer.
- Dynamic Rules – Dynamic rules or calculations that need to change depending on different grains or aggregated views self-service users need to see their information.
“We think of our architecture from left to right. The far left is data in its most raw form, and the far right is information that has been fully transformed, cleansed, and is ready to be consumed by the business,” said Ken.
Data Quality Faults With Your Data Vault
There are many benefits to data vault architecture, but it does create more tables with more complex transformations and relationships between upstream and downstream assets than other methodologies. This can create data quality challenges if not addressed properly.
Some challenges can include:
Code Maintenance
The ETL code for Hub, Satellite, and Link tables must follow the same rules for common column value definitions (like business and hash key definitions) to enable them to load independently. As a result, any changes to code may have to be done in multiple places to ensure consistency.
One tip? “Our code generation is metadata-driven, so when we change the metadata in one place, it regenerates the ETL code wherever that particular metadata is used,” said Ken.
Multiple, Complex Transformations Between Layers
Transformations are a necessary step in any data engineering pipeline using any methodology, but they can create data quality incidents. This can happen either as the transformation code gets modified (perhaps incorrectly) or the input data isn’t aligned with the underlying expectations of the transformation model (perhaps there has been an unexpected schema change or the data doesn’t arrive on time).
Long blocks of transformation code at multiple layers within a data vault architecture can compound these errors and make root cause analysis more difficult. A best practice here is to keep transformations as simple as possible.
“We are working to evolve our design to apply complex transformations in only one place, the Information Warehouse, within the data pipeline,” said Ken. “As the raw data vault scales, the transformation logic becomes more complex, so we are designing ways to reduce complexity.”
Maintaining Integrity Across the Hub, Link, and Satellite Tables
Any load (or other errors) in the data vault hub, link, and satellite tables will mar downstream queries with outputs showing partial or missing data.
“The key is to have automated validation checks or data observability in place to detect these anomalies when they happen,” said Ken.
Published at DZone with permission of Lior Gavish. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments