Data Quality Faults With Your Data Vault

In this article, we’ll dive into data vault architecture, challenges and best practices for maintaining data quality; and how data observability can help.

Lior Gavish

Mar. 30, 23 · Analysis

Likes (1)

Comment

Save

3.3K Views

Over the past several years, data warehouses have evolved dramatically, but that doesn’t mean the fundamentals underpinning sound data architecture need to be thrown out the window.

In fact, with increasingly strict data regulations like GDPR and a renewed emphasis on optimizing technology costs, we’re now seeing a revitalization of “Data Vault 2.0” data modeling.

While data vault has many benefits, it is a sophisticated and complex methodology that can present challenges to data quality. In this article, we’ll dive into data vault architecture, challenges and best practices for maintaining data quality; and how data observability can help.

What Is a Data Vault Model?

For those unfamiliar, data vault is a data warehouse modeling methodology created by Dan Linstedt (you may be familiar with Kimball or Imon models), created in 2000 and updated in 2013. The data vault collects and organizes raw data as an underlying structure to act as the source to feed Kimball or Inmon dimensional models.

The data vault paradigm addresses the desire to overlay organization on top of semi-permanent raw data storage. There are three components to a data vault model:

Satellite tables: contain information related to a particular business concept.
Hub tables: link satellite tables together either through standard key relationships, composite keys, or more security-conscious approaches incorporating hashes.
Link tables: maps different hub tables together through keys (or equivalent).

Data Vault Benefits

Data vault modeling has a lot of advantages compared to other methodologies (or not modeling at all!), such as its suitability for auditing, quickly redefining relationships, and easily adding new datasets.

Pie Insurance, a leading small business insurtech, leverages a data vault 2.0 architecture (with some minor deviations) to achieve its data integration objectives around scalability and the use of metadata.

“A data vault data model is intended to be a physical implementation of the organization’s business model, so it becomes a standard place to plug in data from multiple sources. For example, as new data is added to our data warehouse, we are able to plug in the data to our model by adding Satellite tables. Or as new subject areas become in-scope, we can add new Hub and Link tables based on our business model,” said Ken Wood, staff data engineer in data architecture, Pie.

“The other advantage is because we follow a standard design, we are able to generate a lot of our code using code templates and metadata. The metadata contains our data mappings, and the code templates contain the expected structure of our ETL code scripts/files,” he said.

Other advantages Pie has experienced from its data vault implementation include:

Organizational – Pie can find the data they need quickly because it is organized to its business model.
Foundational – Data vault provides a solid foundation that allows for fast “speed-to-insights.” When new questions arise, they can build dimensional tables downstream from data vault tables and even virtualize downstream content (views).
History – The design of Satellite tables allows Pie to search and query changes to data over time, essentially providing the data needed for slowly changing dimensions and fact history views of the data.

How Do You Implement a Data Vault Architecture?

While deployments will vary, Pie’s data vault architecture implementation includes four conceptual layers of data architecture that make up their data pipeline ecosystem:

Ingestion Layer– Landing and Staging raw data from source systems.
- Landing – Source files landed in AWS S3 buckets
- Staging – Raw Source Data is stored in VARIANT columns within Snowflake tables.
Curation Layer – Organizes the raw data.
- Raw Data Vault – Within Snowflake environment and has minor transformations mapping it into Hub, Satellite, and Link tables as recommended by the Data Vault 2.0 methodology.
- Business Data Model – Pie’s data vault design is the physical model of their business data model – as opposed to trying to design based on each source system’s data. This gives them a single model to conform to, regardless of the source.
Transformation Layer– Transform and clean data using business logic.
- Business Vault – Pre-Transformed data, following business transformation rules.
- Information Warehouse – This layer alone follows the dimensional (or Kimball) star (or snowflaked) data model.
Presentation Layer– Reporting layer for the vast majority of users. This layer has minimal transformation rules.
- BI/Reporting Tool(s) – Pie uses Looker, which has a metadata layer that reflects the “Information Warehouse” (transformed data).
- Future Reporting Tool Plug-in – This allows future reporting or querying tools to be added without major development because the data is already transformed in the database layer.
- Dynamic Rules – Dynamic rules or calculations that need to change depending on different grains or aggregated views self-service users need to see their information.

“We think of our architecture from left to right. The far left is data in its most raw form, and the far right is information that has been fully transformed, cleansed, and is ready to be consumed by the business,” said Ken.

Data Quality Faults With Your Data Vault

There are many benefits to data vault architecture, but it does create more tables with more complex transformations and relationships between upstream and downstream assets than other methodologies. This can create data quality challenges if not addressed properly.

Some challenges can include:

Code Maintenance

The ETL code for Hub, Satellite, and Link tables must follow the same rules for common column value definitions (like business and hash key definitions) to enable them to load independently. As a result, any changes to code may have to be done in multiple places to ensure consistency.

One tip? “Our code generation is metadata-driven, so when we change the metadata in one place, it regenerates the ETL code wherever that particular metadata is used,” said Ken.

Multiple, Complex Transformations Between Layers

Transformations are a necessary step in any data engineering pipeline using any methodology, but they can create data quality incidents. This can happen either as the transformation code gets modified (perhaps incorrectly) or the input data isn’t aligned with the underlying expectations of the transformation model (perhaps there has been an unexpected schema change or the data doesn’t arrive on time).

Long blocks of transformation code at multiple layers within a data vault architecture can compound these errors and make root cause analysis more difficult. A best practice here is to keep transformations as simple as possible.

“We are working to evolve our design to apply complex transformations in only one place, the Information Warehouse, within the data pipeline,” said Ken. “As the raw data vault scales, the transformation logic becomes more complex, so we are designing ways to reduce complexity.”

Maintaining Integrity Across the Hub, Link, and Satellite Tables

Any load (or other errors) in the data vault hub, link, and satellite tables will mar downstream queries with outputs showing partial or missing data.

“The key is to have automated validation checks or data observability in place to detect these anomalies when they happen,” said Ken.

In this data vault diagram, red dots indicate where a transformation may introduce data quality issues.

Understanding Dependencies

Mapping all the dependencies within a data vault architecture can be challenging, especially if you don’t understand the source data to map it to target Hub, Satellite, and Link tables or if the source systems have additional, unexpected business keys that aren’t in the target model.

“We deal with this by using multi-active Satellite tables. These add a key to the Satellite table to match the grain,” said Ken.” Or, we add Hub and Link tables as new keys are introduced, AND it aligns with our business model.”

Scaling Testing

Data testing across your data warehouse is already a headache, but with a data vault, it’s especially difficult.

That’s because no human can possibly anticipate and write tests for all the ways data can break, and if they could, it is virtually impossible to scale across all of the tables and pipelines in your environment.

That is especially true when it comes to data vault architectures since there is 3x the surface area to cover, and the multiple layers and transformations add even more unknown data quality issues.

Using ML Data Anomaly Detectors to Improve Data Vault Reliability

Using ML data anomaly detection (either built internally or within a data observability solution) can address data vault data quality challenges in several key areas:

One of the hallmarks of Data Vault architecture is that it “collects 100% of the data 100% of the time,” which can make backfilling bad data in the raw vault a pain. Machine learning anomaly detectors reduce time to detection to enable data teams to close the spigot of broken pipelines and stop the flow of bad data flowing into the raw vault, thereby reducing the data backfilling burden.
From raw data landing zones down to reporting tables, ML data anomaly detection can make sure that your range of numbers and types of values are as expected.
Transformation queries that move data across layers are monitored to make sure they run at the expected times with the expected load volumes, defined in either rows or bytes. Automatic data lineage will enable Pie (and other organizations utilizing data vaults) to greatly accelerate their root cause analysis.
With more table and column references created by the data vault architecture, the need to monitor for schema changes, such as a table or column name changes, also increases.

Final Data Vault Advice

Following the advice of leaders in the data vault space, like Pie, is a good next step. For Ken, success comes down to constant alignment with the business and the business model.

“Avoid the temptation to just model Data Vault tables to easily fit the data coming from a source system – unless that source system’s data model was built based on your business’s data model,” he advised.

Your team has invested significant time and expertise into developing and maintaining a data vault architecture. Ensuring data trust at each step of the way will justify the fruits of your hard-earned labor.

Architecture Data quality Data warehouse Data (computing) Data model (GIS) Fault (technology)

Published at DZone with permission of Lior Gavish. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending

Data Quality Faults With Your Data Vault

In this article, we’ll dive into data vault architecture, challenges and best practices for maintaining data quality; and how data observability can help.

What Is a Data Vault Model?

Data Vault Benefits

How Do You Implement a Data Vault Architecture?

Data Quality Faults With Your Data Vault

Code Maintenance

Multiple, Complex Transformations Between Layers

Maintaining Integrity Across the Hub, Link, and Satellite Tables

Understanding Dependencies

Scaling Testing

Using ML Data Anomaly Detectors to Improve Data Vault Reliability

Final Data Vault Advice

Related

Partner Resources