Icebergs Best Secret A Guide to Metadata Tables

Szehon Ho, October 4 2022
Apache Iceberg’s Best Secret
A Guide to Metadata Tables

Apache Iceberg Project
• Developed to address Hive shortcomings
• Apache Incubator 2018-2020
• 295 contributors from many companies
• Collaboration with Spark/Flink/Trino communities
• Wide adoption in 2022

What is Apache Iceberg?
In its own Words

What is Apache Iceberg?
• Hive: Directory contains all
fi
les in
tables and partitions
• Iceberg: Follow a tree of “Metadata
Files” that track data of Tables and
Partitions
“Table Format” = Layout of Files in Table

Metadata Files
Unlocking many new features: only some shown here
Category Hive Behavior
Iceberg Metadata
Feature
Atomicity on
Object Store (S3)
Inconsistent Listing
Non-Atomic
Data File Listings in
metadata
fi
le
Time Travel/
Rollback
Not supported Snapshot File
Isolation Level
Need Explicit
Directory Lock
Snapshot Info on each
Data File,
Check only con
fl
icts
Performance
(Predicate Pruning)
Partition (Directory)
level
fi
lter only
1. Partition stats at
multiple layers
2. Min/Max Column
Stats

“Open” Table Format
• Metadata Files are the basis for all of Iceberg’s advance feature-set
• Metadata Tables: Exposes all Metadata Files in user-friendly way
• Interface: Exposed as SQL as system tables
• Performance: Queries are much faster than data queries
• Full Transparency: Users/Systems can easily self-explore Metadata Tables to know how the
system works, and how to improve it
• Most tough problems can be debugged (at least partially) by Iceberg metadata tables
• Decide how to optimize the table pre-emptively
• Build monitoring, auditing, data quality checks beyond Iceberg

My First Metadata Table
Partitions Table
Partition table = “db.table.partitions”

Metadata Tables
The Full List
• history
• metadata_logs
• snapshots
• manifests
• all_manifests
• entries
• all_entries
•
fi
les
•data_
fi
les
•delete_
fi
les
•all_
fi
les
•all_data_
fi
les
•all_delete_
fi
les
Partitions is just an aggregate view of
fi
les table
Iceberg Metadata Tables:

Hierarchical Structure
• Catalog (atomic pointer to Root
Metadata)
• Metadata File (Root Metadata)
• Snapshot Files (Manifest List)
• Manifest Files
• Data Files
Metadata Files Review

Metadata Tables
Mapping to Metadata Files
Metadata Table Queries About
metadata_logs Last Metadata File Metadata File
snapshots Last Metadata File
Snaphot Files
(Manifest Lists)
manifests
Snapshot Files
(Manifest Lists)
Manifests
Files/Entries (see next
slide)
Manifests Data Files
SHOW TBLPROPERTIES
• Each Metadata Table has information about all or a
subset of one layer of “Metadata File”
• Table for a Metadata File doesn’t read that layer
metadata
fi
le, rather from the layer above it

Files/Entries Tables
Various Views of “Data Files” for User Convenience
• Partitions table is just an aggregate view of Files table
• Files/Entries: Equivalent. Manifest File Entry = metadata about a data
fi
le
• Files = “Files” part of Manifest Entry, only physical attributes of a
fi
le
• Entries = Complete row, including snapshot information of the
fi
le
• All_ tables: All_Manifests, All_Files, All_Entries
• all_x = All Metadata Files of X Layer
• x = Metadata Files of X layer that are pointed to by current snapshot
• Data/Delete: Data_Files, Delete_Files
• Delete Files a V2 concept for Merge-on-Read
• “Files” table selects both types of
fi
les

FAQ: Partition Information
• How many
fi
les per partition?
• Total size of each partition?
• Last update time per partition?
SELECT partition,
sum(
fi
le_size_in_bytes) AS partition_size,
FROM db.table.
fi
les
GROUP BY partition
SELECT
e.data_
fi
le.partition,
MAX(s.committed_at) AS last_modi
fi
ed_time
FROM db.table.snapshots s
JOIN db.table.entries e
WHERE s.snapshot_id = e.snapshot_id
GROUP_BY by e.data_
fi
le.partition
SELECT partition,
fi
le_count
FROM db.table.partitions
partition
fi
le_count
{"date":"2022-1
0-04","hour":5}
5
partition partition_size
{"date":"2022-1
0-04","hour":5}
937
partition last_modi
fi
ed_time
{“date":"2022
-10-04",
"hour":5}
2022-09-07
01:30:52.371

Closer Look at Snapshots
• Snapshot points to a list of
fi
les belonging to table at point in time
• Snapshot is also an operation on
fi
les (adding, removing)
• Entries table tracks which snapshot operated on the
fi
le
• entries.snapshot_id
• entries.status : 0=EXISTING, aka rewrite, 1= ADDED, 2 =DELETED
Two Meanings vis-a-vis Files

FAQ: Snapshot Questions
• What
fi
les are added by snapshot 8339536322928208593?
• What
fi
les are referenced by snapshot 8339536322928208593?
• Use time-travel (SQL Syntax)
SELECT data_
fi
le.
fi
le_path
FROM db.table.entries
WHERE snapshot_id=8339536322928208593
AND status=1;
SELECT
fi
le_path
FROM db.table.
fi
les
VERSION AS OF 8339536322928208593;

FAQs: How to Keep Iceberg Maintained
• Expire Snapshots (Cleanup)
• RewriteManifests (Metadata Files Optimization)
• RewriteFiles (Data Files Optimization)

FAQ: Disk Usage and Expire Snapshots
• User Question: I am hitting HDFS quotas. I ran compact
fi
les/deleted partitions, why do I still see quota limit?
• Answer: Expire snapshots
• Metadata Tables:
• all_manifests, all_
fi
les will show you everything reachable even from previous snapshots
• manifests,
fi
les will show everything reachable from current snapshot
• Useful Queries for Dashboards:
select sum(
fi
le_size_in_bytes) from db.table.all_
fi
les; // all reachable data
fi
les size
select sum(length) from db.table.all_manifests; //all reachable manifest
fi
les size
select sum(
fi
le_size_in_bytes) from db.table.
fi
les; // current snapshot
fi
les size
select sum(length) from db.table.manifests; // current snapshot manifest
fi
les size

FAQ: Disk Usage
Committed_at snapshot_id Summary
2022-08-24 14:01:43.191 4077543616265127980
{“added-data-
fi
les":"1",
“added-
fi
les-size":"904",
“added-records":"1",
“changed-partition-count":"1",
"spark.app.id":"local-1661374186213",
“total-data-
fi
les":"23",
“total-delete-
fi
les":"0",
“total-equality-deletes":"0",
“total-
fi
les-size":"20792",
“total-position-deletes":"0",
"total-records":"23"}
Snapshots Table Alternative
SELECT committed_at, snapshot_id, summary FROM db.table.snapshots;

FAQ: When to Optimize Metadata
• Improve query planning time, metadata table query time, by reducing overhead of reading metadata-
fi
les
// Which manifests?
SELECT path,
added_data_
fi
les_count +
existing_data_
fi
les_count +
deleted_data_
fi
les_count as
fi
les
FROM db.table.manifests;
path
fi
les
s3://my_bucket/db/table/… 2
s3://my_bucket/db/table/… 4
// How many manifests?
SELECT count(*)
count(1)
200
// Are manifests sorted?
SELECT path, partition_summaries
path partition_summaries
s3://my_bucket/db/table/…
{“lower_bound”:”2022-10-04”,
"upper_bound":"2022-10-04"}

FAQ: When to Optimize Data
• Improve query time by minimizing
fi
le-read overhead
• Sort to improve selectivity of
fi
les, and compression ratio of
fi
les
// Are data
fi
les sorted?
// Note: Column coming soon
SELECT
fi
le_path,
readable_metrics.emp.upper_bound,
readable_metrics.emp.lower_bound,
FROM db.table.
fi
les;
fi
le_path col.lower_bound col.upper_bound
s3://my_bucket/db/
table/…
Abigail Adams Mike Monroe
s3://my_bucket/db/
table/…
Nancy Nomura Zachary Zunich
// Too many small data
fi
les?
SELECT partition, count(*) as
fi
le_count,
sum(
fi
le_size_in_bytes)/count(*) as avg_size,
FROM db.table.
fi
les
GROUP BY partition
partition
fi
le_count avg_size
{"date":"2022-10-04
","hour":5}
100 5120000

Beyond Iceberg
• Measuring a system data completeness and latency is typically hard, but becomes do-able in Iceberg
• Incoming Dataset from Flink:
• (data string, event_time timestamp) partitioned by hour (event_time)
// Data Latency with custom UDF for calcuating time di
ff
erence.
// Will be easier with readable_metrics column
SELECT max(di
ff
(entries.data_
fi
le.lower_bounds[1], hour(snapshots.committed_at)) AS max_latency
FROM db.table.entries JOIN db.table.snapshots
ON entries.snapshot_id = snapshots.snapshot_id
GROUP BY entries.data_
fi
le.partition;
// Data Completeness
SELECT record_count AS received, partition
FROM db.table.partitions;
Use Case: Ingest Monitoring

Beyond Iceberg
Use Case: Data Quality Alerts
• Iceberg keeps interesting metrics per data
fi
le of every column:
• column_sizes
• value_counts
• null_values
• nan_values
• lower_bounds
• upper_bounds
• Can create alerts for partitions with nan_values
Select partition, (sum(to_int(
fi
les.nan_values[0])) AS nan_values
FROM db.table.
fi
les
GROUP BY
fi
les.partition

Future
Stay Tuned for Pu
ffi
n Files
• Pu
ffi
n Files introduced into Iceberg spec
• https://github.com/apache/iceberg/blob/master/format/pu
ffi
n-spec.md
• For (TBD)
• Bloom Filters
• Datasketches
• Apply to data
fi
le or set of data
fi
les (TBD)
• Can be used for data quality percentiles

Questions?
Thank you for attending!

Icebergs Best Secret A Guide to Metadata Tables

More Related Content

Similar to Icebergs Best Secret A Guide to Metadata Tables

More from Szehon Ho

Recently uploaded

Icebergs Best Secret A Guide to Metadata Tables