What is Data Lake
Raw data storage in any format
What is Data Lake
Data Lake is a centralized repository that allows you to store structured and unstructured data at any scale without preprocessing.
Data Lake vs Data Warehouse
| Characteristic | Data Lake | Data Warehouse | |----------------|-----------|----------------| | Data | Raw, unprocessed | Processed, structured | | Schema | Schema-on-read | Schema-on-write | | Users | Data Scientists, engineers | Business analysts | | Flexibility | High | Limited | | Cost | Low | High |
Data Lake Architecture
- Bronze Layer — raw data (as-is)
- Silver Layer — cleaned and validated
- Gold Layer — aggregated for analytics
Popular Platforms
| Platform | Features | |----------|----------| | AWS S3 + Athena | Serverless, pay-per-query | | Azure Data Lake | Power BI integration | | Google Cloud Storage | BigQuery integration | | Apache Hadoop HDFS | Open-source, on-premise | | Databricks Delta Lake | ACID transactions |
Storage Formats
- Parquet — columnar, compression, fast queries
- ORC — optimized for Hive
- Avro — row-based, schema evolution
- JSON/CSV — for simple scenarios
Benefits
- Store any data type
- Low storage costs
- Flexibility for ML/AI tasks
- Scalability to petabytes
- Preserve original data