Open Table Format = the metadata layer that makes those files behave like database tables.
What is a Data Lake?
A data lake is a storage system that keeps large amounts of raw and processed data in inexpensive object storage such as:
- Amazon S3
- Google Cloud Storage
- Azure Blob Storage
Instead of storing data inside a traditional database, the data is stored as files:
s3://company-data/
├── sales/
│ ├── 2026-01.parquet
│ ├── 2026-02.parquet
│ └── 2026-03.parquet
├── customers/
└── products/
Common file formats:
- Parquet
- ORC
- Avro
- JSON
- CSV
Why Data Lakes Became Popular
Advantages:
- Cheap storage
- Virtually unlimited scale
- Supports structured and unstructured data
- Decouples storage from compute
Multiple engines can read the same data:
- Apache Spark
- Trino
- Presto
- Apache Flink
- DuckDB
The Problem with Traditional Data Lakes
Imagine a table:
sales
stored as:
sales/
├── part-001.parquet
├── part-002.parquet
├── part-003.parquet
Questions become difficult:
- Which files belong to the latest version?
- What happens if a write fails halfway?
- How do you perform transactions?
- How do you update or delete records?
- How do you support schema evolution?
- How do multiple writers avoid corrupting data?
A plain data lake doesn’t answer these questions.
This is where Open Table Formats come in.
What is an Open Table Format (OTF)?
An Open Table Format adds a metadata layer on top of files stored in a data lake.
Examples:
- Apache Iceberg
- Delta Lake
- Apache Hudi
Think of it as:
Object Storage
+
Table Metadata
+
Transactions
+
Schema Management
Result:
Data Lake + OTF = Lakehouse
How Iceberg Works
Suppose we have a table:
sales
The actual files are:
sales/
├── data-1.parquet
├── data-2.parquet
├── data-3.parquet
Iceberg adds metadata files:
sales/
├── metadata/
│ ├── v1.metadata.json
│ ├── v2.metadata.json
│ └── manifest files
├── data/
│ ├── data-1.parquet
│ ├── data-2.parquet
│ └── data-3.parquet
Instead of scanning directories, query engines read Iceberg metadata.
The metadata tells them:
- Which files belong to the table
- Which snapshot is current
- Which partitions exist
- Column statistics
- Schema versions
What is an Iceberg Catalog?
The catalog is the entry point to Iceberg tables.
Think of it as the table registry.
Without a catalog:
Where is table sales?
With a catalog:
analytics.sales
The catalog knows:
analytics.sales
↓
metadata location
↓
s3://warehouse/sales/metadata/v25.json
Popular Iceberg catalogs:
- Apache Hive Metastore
- Project Nessie
- AWS Glue Data Catalog
- Apache Polaris
- Snowflake Open Catalog
Why OTFs are Important
1. ACID Transactions
Traditional data lake:
Write file
Write file
Write file
Crash
Table becomes inconsistent.
Iceberg:
Create new snapshot
Commit metadata atomically
Either all changes appear or none do.
2. Time Travel
You can query historical snapshots.
Example:
SELECT *
FROM sales
VERSION AS OF 12345;
Use cases:
- Auditing
- Debugging
- Reproducing reports
3. Schema Evolution
Traditional lakes often break when schemas change.
Iceberg allows:
ALTER TABLE sales
ADD COLUMN discount;
without rewriting all historical files.
4. Hidden Partitioning
Old-style partitioning:
sales/year=2026/month=05/day=20/
Users must know partition columns.
Iceberg manages partitions internally.
Users simply query:
SELECT *
FROM sales
WHERE order_date='2026-05-20';
5. Multi-Engine Interoperability
The same Iceberg table can be queried by:
- Teradata
- Trino
- Flink
- DuckDB
- Snowflake
- Databricks
- Spark
All engines read the same metadata.
No data duplication.
Architecture Example
A modern architecture might look like:
+----------------+
| Kafka |
+--------+-------+
|
v
+----------------+
| Flink / Spark |
+--------+-------+
|
v
+----------------+
| Iceberg Table |
+--------+-------+
|
-----------------------
| | |
v v v
Trino Snowflake DuckDB
The storage layer is shared while compute engines are independent.
Real-World Use Case
E-Commerce Analytics Platform
Imagine a company similar to Amazon.
Data Sources
- Website clicks
- Orders
- Payments
- Product catalog
- Customer events
Data volume:
10 TB/day
Billions of events
Traditional Approach
Data copied multiple times:
Spark Warehouse
↓
Data Warehouse
↓
Analytics DB
↓
ML Platform
Problems:
- Storage duplication
- Data latency
- Synchronization issues
Iceberg-Based Lakehouse
Data stored once:
S3
└── Iceberg Tables
Tables:
orders
customers
products
clickstream
payments
Consumers:
Analytics Team
Trino
Data Science Team
Spark
Real-Time Monitoring
Flink
BI Dashboards
Snowflake
All access the same Iceberg tables.
Benefits:
- Single source of truth
- ACID transactions
- Time travel
- Lower storage costs
- Multi-engine access
Why Many Companies Are Moving to Iceberg
Historically:
Data Warehouse
OR
Data Lake
Today:
Lakehouse
where:
Object Storage
+
Iceberg
+
Multiple Compute Engines
This gives the low cost and scalability of a data lake while adding database-like features such as transactions, schema evolution, governance, and time travel.
For many modern data platforms, Apache Iceberg has become the preferred open table format because it separates storage, metadata, and compute cleanly, allowing organizations to avoid vendor lock-in while sharing the same data across Spark, Flink, Trino, Snowflake, and other engines.
Leave a Reply