Data lineage

Definition

Data Lineage is metadata that identifies the sources of data and the transformations through which it has passed up to the point of applying. (DAMA-NL, 2020)

Notes 1

Other definitions:

Notes 2

Data lineage answers the 5 W’s of data:

  1. Where does the data come from or where does it go?
  2. Who uses it?
  3. When was it created?
  4. What information does it contain? What transformations are executed?
  5. Why does it exist?

Synomym

Data chain

Purposes

Life cycle

Phase Activity
Plan * Define the scope
* Select a way to store the DL information, e.g., by an editor or in a DL tool
* Collect the relevant metadata
* Enter, change, or delete the metadata
* “Stitch the nodes”
Do * Use the DL for its purpose
Check * Evaluate the effectiveness of the DL
Act * Adapt the DL
* Maintain the DL

Characteristics and requirements

Characteristic Requirement
Completeness DL is complete regarding the scope.
Maintainability DL can be maintained efficiently.
Clarity DL can be interpreted easily (zooming, filtering)

Relations

Data lineage is parent of backward data lineage
Data lineage is parent of forward data lineage
Data lineage is parent of horizontal data lineage
Data lineage is parent of vertical data lineage
Data lineage is an element of a data quality management system
Data lineage is part of the business or technical metadata
Data lineage includes a set of data elements but especially critical data elements
Data lineage facilitates the root cause analysis of data issues

Example(s)

Example 1: Horizonal data lineage

Example 2: Horizontal data lineage

Example 3: Horizontal and vertical data lineage

Story

Legislation requires the Valencia bank to report monthly to its regulator, the central bank. The regulator, however, also wants to know how these reports have been produced and where the data comes from. This is to assess the quality of the data.

Because the reports are generated by complex data flows, the bank decides to apply data lineage to map these flows and make them visible. It soon turned out that fields with the same meaning had different names in the systems involved.

Nevertheless, it was possible to link the fields and it became clear where the reported data came from. The bank can now satisfactorily inform the supervisor about the origin of the reported data. A data steward is made responsible for the maintenance of the data lineage in the tool, so that the metadata is kept up to date.

Data lineage also proves to be useful when making changes to the systems. The impact of changes in the systems downstream can be understood more quickly.

Reference(s)