What is Data Lineage
Tracking data origin
What is Data Lineage
Data Lineage is tracking the complete path of data from source to consumer, including all transformations, aggregations, and movements between systems.
Lineage Types
| Type | Description | |------|-------------| | Technical Lineage | At table, column, SQL level | | Business Lineage | Business terms and KPIs | | Operational Lineage | Jobs, schedules, dependencies | | Column-level | Field-level transformations |
Why Data Lineage Matters
- Impact analysis — what breaks when changing
- Root cause analysis — where error originated
- Compliance — GDPR, SOX adherence
- Documentation — understanding data
- Migration — planning transitions
Tools
| Tool | Features | |------|----------| | Apache Atlas | Open-source, Hadoop | | OpenLineage | Standard, integrations | | DataHub | LinkedIn, graph-based | | Atlan | Modern data catalog | | Collibra | Enterprise |
Automatic Lineage Collection
- SQL parsing — query analysis
- API integrations — from Airflow, dbt, Spark
- Log analysis — from processing systems
- Metadata harvesting — from catalogs
Visualization
- Dependency graphs
- Upstream/downstream analysis
- Impact assessment
- Transformation timeline
Practical Applications
- Debugging data issues
- Compliance reporting
- Data migration planning
- New employee onboarding
- Data assets documentation