Data Lineage in a Data-Driven World
Data lineage plays a pivotal role across various aspects of data governance and system evolution. It helps organizations to make informed decisions about data.
Join the DZone community and get the full member experience.
Join For FreeData Lineage
It won’t be an exaggeration to say that the success of today's business is driven by the data. Whether it be a small enterprise or a big business house, everyone has understood that data can give them an edge in this competitive world. This realization of the importance of data is leading them towards implementing better data governance in their business. Data lineage is an important function of data governance that tracks the journey of data from its origin to its final destinations via various hops.
Importance of Data Lineage
The necessity for data lineage in businesses arises due to various factors and different reasons that may apply to different enterprises.
- Inorganic growth of data: Across various industries, the accumulation of data has surged significantly over time. The rapid pace at which organizations accumulated vast amounts of data led to multiple challenges. Organizations seldom follow good old practices of documentation, data modeling, and standards. This resulted in the inorganic growth of data lacking comprehensive or updated documentation. This growth served the needs of the organizations in the short term. But in the long run, it caused issues like redundancies, disparities, and inconsistencies and had a direct impact on strategic initiatives around data.
- Lack of SME knowledge: In certain organizations – software or codes has persisted for a decade or longer without substantial modifications. The software industry has been known for its high attrition rate. There is a rare chance that people who would have designed these systems or built these codes are still working in the same organization. So, over time, fewer people will know these codes. In some places, it has gone to the extent that the entire system has become a black box. People are just using it as it is working. Now, if an organization wants some enhancements in such a system, they face the challenge of lacking SME knowledge.
- Lack of understanding of old-generation languages: In industries at the forefront of the IT revolution, such as banks and insurance companies, there still exist programs coded in COBOL and Fortran that operate on mainframe computers. Over time, these skills have become very rare. You will hardly find any new-generation programmer who can understand these languages. Now, if an organization wants to make some changes in such codes, they find it difficult to find such people. We have seen a lot of technology migration projects these days where they want to transition from COBOL to new world languages, and then lineage becomes crucial there.
- Lack of data lineage in procedural languages: Procedural languages such as PL/SQL, BTEQ, SAS, and Python play an important role in data-intensive operations. Codes written in these procedural languages lack inherent data lineage capabilities. Unlike ETL tools, which provide a graphical interface, these codes become difficult to interpret. Complex code may also use dynamic code constructions (e.g., SAS Macros) making it more difficult to understand the data flow. So, you need not only a technical expert but a functional expert to establish data lineage.
- Data spread across multiple systems and software: In today's interconnected world — from its origin to final destination — data flows from multiple systems developed using diverse technology. Some tools, like Ab Initio and SAS DI studio, can generate their data lineage. However, not all technologies or systems provide this functionality. An organization having a mix of technologies may have partial/fragmented lineage available but not the complete lineage. Furthermore, such lineage will be disconnected as Ab Initio will hold lineage for codes written in AB initio, and SAS will hold lineage for codes written in SAS. So even lineage is present – it’s a disconnected lineage. What organizations seek is a complete journey of data on a single canvas, regardless of the technology used in the movement of data.
- End user-defined applications: EUDAs – of late, EUDA has become another concern of organizations. These EUDAs are diverse, ranging from numerous spreadsheets to small-scale applications or ad hoc codes. Each organization has tens and hundreds of such EUDAs. They may vary in size and complexity, but their existence is widespread. They influence operations in almost every organization. However, the presence of these EUDAs often disrupts the data lineage, creating gaps and inconsistencies within the lineage records.
- Regulatory requirements: This is one of the main reasons that is forcing organizations to be serious about their Data Lineage. The organizations regulated by governments and independent bodies have been mandated to demonstrate the data lineage. An example of such control would be BCBS239, which applies to the banking sector. Under BCBS 239, banks are challenged with understanding, managing, and tracing their metadata. Similar controls exist for other industries. Another example is SOX controls, which go one step ahead and demand the placement of reconciliation control at every important junction of data. How can someone put a control unless they understand a data lineage?
Types and Levels of Data Lineage
In data modeling, we have different levels of data models for different types of users and use cases. For example, the subject area/conceptual model will give provide high-level entity and relationships for analysts and architects. A logical data model is more for designers to explore key normalization. Finally, the physical model is for developers to understand physical aspects. Overall, it represents the same information for different use cases. The same analogy can be applied to data lineage, where we need different types and levels of data lineage for different users.
Business Lineage
This can be further divided into high-level and process-level lineage.
High Level
The main purpose of business lineage is to understand the impact of data on the business.
- It describes the source and evolution of data into business information.
- At a high level, it describes the purpose of the data and the users/stakeholders of the data.
- This lineage is for non-technical users. This lineage is at a very high level and does not capture any technical details like low-level components, databases, tables, etc.
- It helps business users make business-focused decisions. It describes how different processes or sources of information contribute to business-relevant data.
- High-level business lineage helps in data governance and compliance
Process Level
This is one level below a high-level lineage where the focus is on individual processes rather than the business area as a whole.
- It describes process-level data dependencies on upstream and downstream processes.
- It elaborates a process into sub-processes and articulates the data flow at each level.
- This lineage is particularly useful for Business Analysts to understand/describe various processes in detail.
Technical Lineage
This can be further divided into Component Level and Attribute level.
Entity/Component Level
This lineage provides a detailed view of data flows at a technical level.
- Technical lineage at this level focuses on the details of technical data flow, including the source of the data and the high-level transformations it undergoes.
- It talks about how data is joined, inclusion and exclusion applied, which data is dominant in joins, and so on.
- It talks about technologies that facilitate processing — detailing databases, applications, and data integration tools.
- This lineage serves as an Input for reverse engineering in Technology Migration Programs.
Attribute Level Lineage
This lineage is one level below entity level lineage where the grain of the lineage is at individual attribute in a table.
- This is the most granular level lineage but, at the same time, a comprehensive lineage.
- It details the transformation/derivation applied to develop a target attribute.
- It also lists the source attribute(s) that have contributed to the derivation logic of the target attribute. In lineage terms, these are called direct contributors or direct lineage.
- This lineage also describes the attributes made up using hardcoded values – which are also called constants in lineage terms.
- In addition to direct lineage, attribute-level lineage also covers indirect lineage. These attributes are not part of transformation logic but are used in conditions, case statements, filter conditions, group by clause, etc., and hence are indirect contributors. This is explained with examples later in the document.
- Indirect lineage is also referred to as dependency lineage or conditional lineage.
- Adding indirect lineage to lineage analysis may complicate the lineage and, in some cases, would add huge attributes to the analysis. Hence, a good data lineage tool must provide a filter to include or exclude indirect lineage from presentation/lineage output.
Indirect Lineage Examples
Example 1
IF preferred_contact_method = 'mobile' THEN
contact_number = mobile_number
ELSE
contact_number = landline_number
END IF
In this example, the target attribute contact_number
will either hold the value of mobile_number
or landline_number
. So, these two are direct contributors, and hence, it is a direct lineage. However, in order to determine which one to use – one needs to evaluate the condition of the attribute preferred_contact_method
. So preferred_contact_method value
— although not flowing — has an impact on which value will flow and is the indirect lineage contributor.
Let’s take another example of INSERT
statement.
INSERT INTO my_table (student_id,student_name)
SELECT s_id,s_name from landing_table where age >=18;
In this example, Student_name
is populated from s_name
which is direct lineage.
However, the record needs to satisfy the age condition, and hence, age becomes an indirect lineage contributor.
Lineage Contributors and Non-Contributors
It is important to understand which codes contribute to lineage and which do not. As such, every data movement – where data changes – is a lineage contributor. However, there are some non-contributing codes as well. Let’s see these in detail.
Contributors
- Every insert/update /merge statement contributes to a data lineage.
- Any API calls facilitating data changes contribute to a data lineage.
- DDL statements such as Create View and Create Table AS Select form part of a lineage.
- Any row / cursor-based data manipulation creates a lineage if data movement occurs.
- Variables are populated via SELECT INTO statement and afterward used in a data flow to form part of a lineage.
- Code that dynamically generates other codes will not have a meaning. However, the generated code must be separately processed to derive lineage out of it.
Non-Contributors
- Delete statement does not have an impact on data movement. It affects the number of records only.
- DDL statements like Create table (just definition), Create index, Alter table
- Statements like statistics computation for performance
- Mere select statements sorting of data does not contribute to lineage
- Any code that is put in for auditing purposes can be skipped from the lineage
Lineage Use Cases
Impact and Dependency Analysis
Through attribute-level lineage, organizations gain a comprehensive understanding of the data flow, enabling them to make informed decisions about attribute modifications, system changes, or report modifications. Attribute level lineage can help in impact and dependency analysis. It helps answer questions like what will impact the system if a particular attribute gets removed. Using the lineage, you can identify which attributes are used in a particular report.
Compliances
Data lineage serves as an important tool in tracing the complete journey of an attribute from its inception to its final destination within the data flow. Regulatory compliance like SOX demands completeness and accuracy checks on KDEs throughout the data journey at crucial hops. Attribute-level lineage facilitates the implementation of the same.
System/Technology Migration
In migration or reverse engineering projects, attribute-level lineage becomes crucial as it offers an extensive repository of information with a granular breakdown of the data flow. It shows how attributes traverse through various stages, what transformations are applied, and relationships within the data. From an architect's point of view — lineage can help them design new systems by providing deep insights into the current world. For developers, lineage may serve as a low-level source to target mapping documents, which helps them identify transformation logic.
Root Cause Analysis/Debugging
Lineage can be useful in root cause analysis when something goes wrong. It helps in identifying the origin of incorrect data or the point where transformations might have gone wrong.
Conclusion
Data lineage plays a pivotal role across various aspects of data management, data governance, and system evolution. Through its detailed insights into data flow, it empowers organizations to make informed decisions when it comes to modifications to the system, impact and dependency analysis, and root cause analysis. This lineage is a crucial input for impact and dependency analysis. In regulated organizations, it helps in the implementation of controls and thus helps in achieving compliance. Organizations must give due consideration to lineage in today's data-driven world.
Opinions expressed by DZone contributors are their own.
Comments