Data lineage is very often mentioned in the discussions around data management, data governance, data quality, BI, reporting, and analytics. Let’s have a closer look at what it is and how it can be created and maintained.
Dataversity provides the following definition: “Data Lineage describes data origins, movements, characteristics, and quality. […] Meaningful Data Lineage needs to contain multiple dimensions: who, what, where, why, and how” or the 5 Ws. We want to know what the dataset is, who created or updated it and why, where it happened, and how to access it. We want to see the complete provenance path of the data from creation all the way to its consumption. If there are any interim repositories, we need to know what happened to the data there (again, the 5 Ws).
Source: Data Lineage Demystified
People: Data owners/stewards – providing timely updates on the status of their datasets, the source data used and the transformations performed to them.
Owners of automated solutions – providing transparent descriptions of the data transformations happening within their solutions.
Data and business architects – ensuring data flows reflect the flow of business processes.
Process: Data lineage can be observed/monitored in one place (via specialized technology), but it is managed at various places – wherever data is created, changed, or moved – therefore, it requires a solid, well-controlled process of timely updates of the parts comprising the whole data lineage picture, including updates triggered by sequential dependencies across different data management processes.
Technology: There are quite different technology options, which can be roughly classified as follows (Disclaimer: software products mentioned below are examples only and cannot be considered as being endorsed or recommended by Info-Tech without due analysis of the member’s requirements and strategy):
Technology Type |
Examples |
Pros |
Cons |
Low-tech tools |
Info-Tech’s Data Lineage Tool, Dataset Certificate |
$0 software cost. Can start today! |
Difficult to aggregate and see/analyze the complete picture 100% manual input |
Specialized tools |
TopBraid Enterprise Data Governance, Collibra, IBM InfoSphere Information Governance Catalog, Waterline Data, Alation |
Complete data governance suite. Lots of automation in data collection and lineage analysis. |
$$ software cost Disjointed from data management/ETL* |
Architecture tools |
$ software cost. Cost and usage can be shared across architecture & data governance. |
Disjointed from data management/ETL* Optimized for architects |
|
ETL or data management tools |
100% accurate lineage – as long as the data is flowing through this platform. Cost and usage can be shared across IT & data governance. |
$$$ software cost Not very business user friendly |
* Some tools can import metadata from data management/ETL but cannot provide input to the data management/ETL processes.
Determine the scope and depth of data lineage required for your organization before looking at the enabling technology.
Remember that data lineage can also be part of a data catalog or the metadata provided by an ETL or data management tool.
Graphical representation of the data lineage – with ability to drill down into details – is quite possible and should be the preferred way to document data lineage.
Collibra Announces Its Acquisition of Data Lineage Provider SQLdep
Restore Trust in Your Data Using a Business-Aligned Data Quality Management Approach
Build a Business-Aligned Data Architecture Optimization Strategy