TDOpenTableFormat

Open Table Format = the metadata layer that makes those files behave like database tables.

What is a Data Lake?

A data lake is a storage system that keeps large amounts of raw and processed data in inexpensive object storage such as:

Amazon S3
Google Cloud Storage
Azure Blob Storage

Instead of storing data inside a traditional database, the data is stored as files:

s3://company-data/
├── sales/
│   ├── 2026-01.parquet
│   ├── 2026-02.parquet
│   └── 2026-03.parquet
├── customers/
└── products/

Common file formats:

Parquet
ORC
Avro
JSON
CSV

Why Data Lakes Became Popular

Advantages:

Cheap storage
Virtually unlimited scale
Supports structured and unstructured data
Decouples storage from compute

Multiple engines can read the same data:

Apache Spark
Trino
Presto
Apache Flink
DuckDB

The Problem with Traditional Data Lakes

Imagine a table:

sales

stored as:

sales/
├── part-001.parquet
├── part-002.parquet
├── part-003.parquet

Questions become difficult:

Which files belong to the latest version?
What happens if a write fails halfway?
How do you perform transactions?
How do you update or delete records?
How do you support schema evolution?
How do multiple writers avoid corrupting data?

A plain data lake doesn’t answer these questions.

This is where Open Table Formats come in.

What is an Open Table Format (OTF)?

An Open Table Format adds a metadata layer on top of files stored in a data lake.

Examples:

Apache Iceberg
Delta Lake
Apache Hudi

Think of it as:

Object Storage
      +
Table Metadata
      +
Transactions
      +
Schema Management

Result:

Data Lake + OTF = Lakehouse

How Iceberg Works

Suppose we have a table:

sales

The actual files are:

sales/
├── data-1.parquet
├── data-2.parquet
├── data-3.parquet

Iceberg adds metadata files:

sales/
├── metadata/
│   ├── v1.metadata.json
│   ├── v2.metadata.json
│   └── manifest files
├── data/
│   ├── data-1.parquet
│   ├── data-2.parquet
│   └── data-3.parquet

Instead of scanning directories, query engines read Iceberg metadata.

The metadata tells them:

Which files belong to the table
Which snapshot is current
Which partitions exist
Column statistics
Schema versions

What is an Iceberg Catalog?

The catalog is the entry point to Iceberg tables.

Think of it as the table registry.

Without a catalog:

Where is table sales?

With a catalog:

analytics.sales

The catalog knows:

analytics.sales
   ↓
metadata location
   ↓
s3://warehouse/sales/metadata/v25.json

Popular Iceberg catalogs:

Apache Hive Metastore
Project Nessie
AWS Glue Data Catalog
Apache Polaris
Snowflake Open Catalog

Why OTFs are Important

1. ACID Transactions

Traditional data lake:

Write file
Write file
Write file
Crash

Table becomes inconsistent.

Iceberg:

Create new snapshot
Commit metadata atomically

Either all changes appear or none do.

2. Time Travel

You can query historical snapshots.

Example:

SELECT *
FROM sales
VERSION AS OF 12345;

Use cases:

Auditing
Debugging
Reproducing reports

3. Schema Evolution

Traditional lakes often break when schemas change.

Iceberg allows:

ALTER TABLE sales
ADD COLUMN discount;

without rewriting all historical files.

4. Hidden Partitioning

Old-style partitioning:

sales/year=2026/month=05/day=20/

Users must know partition columns.

Iceberg manages partitions internally.

Users simply query:

SELECT *
FROM sales
WHERE order_date='2026-05-20';

5. Multi-Engine Interoperability

The same Iceberg table can be queried by:

Teradata
Trino
Flink
DuckDB
Snowflake
Databricks
Spark

All engines read the same metadata.

No data duplication.

Architecture Example

A modern architecture might look like:

                +----------------+
                | Kafka          |
                +--------+-------+
                         |
                         v

                +----------------+
                | Flink / Spark  |
                +--------+-------+
                         |
                         v

                +----------------+
                | Iceberg Table  |
                +--------+-------+
                         |
              -----------------------
              |          |          |
              v          v          v

            Trino    Snowflake   DuckDB

The storage layer is shared while compute engines are independent.

Real-World Use Case

E-Commerce Analytics Platform

Imagine a company similar to Amazon.

Data Sources

Website clicks
Orders
Payments
Product catalog
Customer events

Data volume:

10 TB/day
Billions of events

Traditional Approach

Data copied multiple times:

Spark Warehouse
      ↓
Data Warehouse
      ↓
Analytics DB
      ↓
ML Platform

Problems:

Storage duplication
Data latency
Synchronization issues

Iceberg-Based Lakehouse

Data stored once:

S3
  └── Iceberg Tables

Tables:

orders
customers
products
clickstream
payments

Consumers:

Analytics Team

Trino

Data Science Team

Spark

Real-Time Monitoring

Flink

BI Dashboards

Snowflake

All access the same Iceberg tables.

Benefits:

Single source of truth
ACID transactions
Time travel
Lower storage costs
Multi-engine access

Why Many Companies Are Moving to Iceberg

Historically:

Data Warehouse
    OR
Data Lake

Today:

Lakehouse

where:

Object Storage
      +
Iceberg
      +
Multiple Compute Engines

This gives the low cost and scalability of a data lake while adding database-like features such as transactions, schema evolution, governance, and time travel.

For many modern data platforms, Apache Iceberg has become the preferred open table format because it separates storage, metadata, and compute cleanly, allowing organizations to avoid vendor lock-in while sharing the same data across Spark, Flink, Trino, Snowflake, and other engines.