Szehon Ho, October 4 2022
Apache Iceberg’s Best Secret
A Guide to Metadata Tables
Apache Iceberg Project
• Developed to address Hive shortcomings
• Apache Incubator 2018-2020
• 295 contributors from many companies
• Collaboration with Spark/Flink/Trino communities
• Wide adoption in 2022
What is Apache Iceberg?
In its own Words
What is Apache Iceberg?
• Hive: Directory contains all
fi
les in
tables and partitions
• Iceberg: Follow a tree of “Metadata
Files” that track data of Tables and
Partitions
“Table Format” = Layout of Files in Table
Metadata Files
Unlocking many new features: only some shown here
Category Hive Behavior
Iceberg Metadata
Feature
Atomicity on
Object Store (S3)
Inconsistent Listing
Non-Atomic
Data File Listings in
metadata
fi
le
Time Travel/
Rollback
Not supported Snapshot File
Isolation Level
Need Explicit
Directory Lock
Snapshot Info on each
Data File,
Check only con
fl
icts
Performance
(Predicate Pruning)
Partition (Directory)
level
fi
lter only
1. Partition stats at
multiple layers
2. Min/Max Column
Stats
“Open” Table Format
• Metadata Files are the basis for all of Iceberg’s advance feature-set
• Metadata Tables: Exposes all Metadata Files in user-friendly way
• Interface: Exposed as SQL as system tables
• Performance: Queries are much faster than data queries
• Full Transparency: Users/Systems can easily self-explore Metadata Tables to know how the
system works, and how to improve it
• Most tough problems can be debugged (at least partially) by Iceberg metadata tables
• Decide how to optimize the table pre-emptively
• Build monitoring, auditing, data quality checks beyond Iceberg
My First Metadata Table
Partitions Table
Partition table = “db.table.partitions”
Metadata Tables
The Full List
• history
• metadata_logs
• snapshots
• manifests
• all_manifests
• entries
• all_entries
•
fi
les
•data_
fi
les
•delete_
fi
les
•all_
fi
les
•all_data_
fi
les
•all_delete_
fi
les
Partitions is just an aggregate view of
fi
les table
Iceberg Metadata Tables:
Hierarchical Structure
• Catalog (atomic pointer to Root
Metadata)
• Metadata File (Root Metadata)
• Snapshot Files (Manifest List)
• Manifest Files
• Data Files
Metadata Files Review
Metadata Tables
Mapping to Metadata Files
Metadata Table Queries About
metadata_logs Last Metadata File Metadata File
snapshots Last Metadata File
Snaphot Files
(Manifest Lists)
manifests
Snapshot Files
(Manifest Lists)
Manifests
Files/Entries (see next
slide)
Manifests Data Files
SHOW TBLPROPERTIES
• Each Metadata Table has information about all or a
subset of one layer of “Metadata File”
• Table for a Metadata File doesn’t read that layer
metadata
fi
le, rather from the layer above it
Files/Entries Tables
Various Views of “Data Files” for User Convenience
• Partitions table is just an aggregate view of Files table
• Files/Entries: Equivalent. Manifest File Entry = metadata about a data
fi
le
• Files = “Files” part of Manifest Entry, only physical attributes of a
fi
le
• Entries = Complete row, including snapshot information of the
fi
le
• All_ tables: All_Manifests, All_Files, All_Entries
• all_x = All Metadata Files of X Layer
• x = Metadata Files of X layer that are pointed to by current snapshot
• Data/Delete: Data_Files, Delete_Files
• Delete Files a V2 concept for Merge-on-Read
• “Files” table selects both types of
fi
les
FAQ: Partition Information
• How many
fi
les per partition?
• Total size of each partition?
• Last update time per partition?
SELECT partition,
sum(
fi
le_size_in_bytes) AS partition_size,
FROM db.table.
fi
les
GROUP BY partition
SELECT
e.data_
fi
le.partition,
MAX(s.committed_at) AS last_modi
fi
ed_time
FROM db.table.snapshots s
JOIN db.table.entries e
WHERE s.snapshot_id = e.snapshot_id
GROUP_BY by e.data_
fi
le.partition
SELECT partition,
fi
le_count
FROM db.table.partitions
partition
fi
le_count
{"date":"2022-1
0-04","hour":5}
5
partition partition_size
{"date":"2022-1
0-04","hour":5}
937
partition last_modi
fi
ed_time
{“date":"2022
-10-04",
"hour":5}
2022-09-07
01:30:52.371
Closer Look at Snapshots
• Snapshot points to a list of
fi
les belonging to table at point in time
• Snapshot is also an operation on
fi
les (adding, removing)
• Entries table tracks which snapshot operated on the
fi
le
• entries.snapshot_id
• entries.status : 0=EXISTING, aka rewrite, 1= ADDED, 2 =DELETED
Two Meanings vis-a-vis Files
FAQ: Snapshot Questions
• What
fi
les are added by snapshot 8339536322928208593?
• What
fi
les are referenced by snapshot 8339536322928208593?
• Use time-travel (SQL Syntax)
SELECT data_
fi
le.
fi
le_path
FROM db.table.entries
WHERE snapshot_id=8339536322928208593
AND status=1;
SELECT
fi
le_path
FROM db.table.
fi
les
VERSION AS OF 8339536322928208593;
FAQs: How to Keep Iceberg Maintained
• Expire Snapshots (Cleanup)
• RewriteManifests (Metadata Files Optimization)
• RewriteFiles (Data Files Optimization)
FAQ: Disk Usage and Expire Snapshots
• User Question: I am hitting HDFS quotas. I ran compact
fi
les/deleted partitions, why do I still see quota limit?
• Answer: Expire snapshots
• Metadata Tables:
• all_manifests, all_
fi
les will show you everything reachable even from previous snapshots
• manifests,
fi
les will show everything reachable from current snapshot
• Useful Queries for Dashboards:
select sum(
fi
le_size_in_bytes) from db.table.all_
fi
les; // all reachable data
fi
les size
select sum(length) from db.table.all_manifests; //all reachable manifest
fi
les size
select sum(
fi
le_size_in_bytes) from db.table.
fi
les; // current snapshot
fi
les size
select sum(length) from db.table.manifests; // current snapshot manifest
fi
les size
FAQ: Disk Usage
Committed_at snapshot_id Summary
2022-08-24 14:01:43.191 4077543616265127980
{“added-data-
fi
les":"1",
“added-
fi
les-size":"904",
“added-records":"1",
“changed-partition-count":"1",
"spark.app.id":"local-1661374186213",
“total-data-
fi
les":"23",
“total-delete-
fi
les":"0",
“total-equality-deletes":"0",
“total-
fi
les-size":"20792",
“total-position-deletes":"0",
"total-records":"23"}
Snapshots Table Alternative
SELECT committed_at, snapshot_id, summary FROM db.table.snapshots;
FAQ: When to Optimize Metadata
• Improve query planning time, metadata table query time, by reducing overhead of reading metadata-
fi
les
// Which manifests?
SELECT path,
added_data_
fi
les_count +
existing_data_
fi
les_count +
deleted_data_
fi
les_count as
fi
les
FROM db.table.manifests;
path
fi
les
s3://my_bucket/db/table/… 2
s3://my_bucket/db/table/… 4
// How many manifests?
SELECT count(*)
FROM db.table.manifests;
count(1)
200
// Are manifests sorted?
SELECT path, partition_summaries
FROM db.table.manifests;
path partition_summaries
s3://my_bucket/db/table/…
{“lower_bound”:”2022-10-04”,
"upper_bound":"2022-10-04"}
FAQ: When to Optimize Data
• Improve query time by minimizing
fi
le-read overhead
• Sort to improve selectivity of
fi
les, and compression ratio of
fi
les
// Are data
fi
les sorted?
// Note: Column coming soon
SELECT
fi
le_path,
readable_metrics.emp.upper_bound,
readable_metrics.emp.lower_bound,
FROM db.table.
fi
les;
fi
le_path col.lower_bound col.upper_bound
s3://my_bucket/db/
table/…
Abigail Adams Mike Monroe
s3://my_bucket/db/
table/…
Nancy Nomura Zachary Zunich
// Too many small data
fi
les?
SELECT partition, count(*) as
fi
le_count,
sum(
fi
le_size_in_bytes)/count(*) as avg_size,
FROM db.table.
fi
les
GROUP BY partition
partition
fi
le_count avg_size
{"date":"2022-10-04
","hour":5}
100 5120000
Beyond Iceberg
• Measuring a system data completeness and latency is typically hard, but becomes do-able in Iceberg
• Incoming Dataset from Flink:
• (data string, event_time timestamp) partitioned by hour (event_time)
// Data Latency with custom UDF for calcuating time di
ff
erence.
// Will be easier with readable_metrics column
SELECT max(di
ff
(entries.data_
fi
le.lower_bounds[1], hour(snapshots.committed_at)) AS max_latency
FROM db.table.entries JOIN db.table.snapshots
ON entries.snapshot_id = snapshots.snapshot_id
GROUP BY entries.data_
fi
le.partition;
// Data Completeness
SELECT record_count AS received, partition
FROM db.table.partitions;
Use Case: Ingest Monitoring
Beyond Iceberg
Use Case: Data Quality Alerts
• Iceberg keeps interesting metrics per data
fi
le of every column:
• column_sizes
• value_counts
• null_values
• nan_values
• lower_bounds
• upper_bounds
• Can create alerts for partitions with nan_values
Select partition, (sum(to_int(
fi
les.nan_values[0])) AS nan_values
FROM db.table.
fi
les
GROUP BY
fi
les.partition
Future
Stay Tuned for Pu
ffi
n Files
• Pu
ffi
n Files introduced into Iceberg spec
• https://github.com/apache/iceberg/blob/master/format/pu
ffi
n-spec.md
• For (TBD)
• Bloom Filters
• Datasketches
• Apply to data
fi
le or set of data
fi
les (TBD)
• Can be used for data quality percentiles
Questions?
Thank you for attending!

Icebergs Best Secret A Guide to Metadata Tables

  • 1.
    Szehon Ho, October4 2022 Apache Iceberg’s Best Secret A Guide to Metadata Tables
  • 2.
    Apache Iceberg Project •Developed to address Hive shortcomings • Apache Incubator 2018-2020 • 295 contributors from many companies • Collaboration with Spark/Flink/Trino communities • Wide adoption in 2022
  • 3.
    What is ApacheIceberg? In its own Words
  • 4.
    What is ApacheIceberg? • Hive: Directory contains all fi les in tables and partitions • Iceberg: Follow a tree of “Metadata Files” that track data of Tables and Partitions “Table Format” = Layout of Files in Table
  • 5.
    Metadata Files Unlocking manynew features: only some shown here Category Hive Behavior Iceberg Metadata Feature Atomicity on Object Store (S3) Inconsistent Listing Non-Atomic Data File Listings in metadata fi le Time Travel/ Rollback Not supported Snapshot File Isolation Level Need Explicit Directory Lock Snapshot Info on each Data File, Check only con fl icts Performance (Predicate Pruning) Partition (Directory) level fi lter only 1. Partition stats at multiple layers 2. Min/Max Column Stats
  • 6.
    “Open” Table Format •Metadata Files are the basis for all of Iceberg’s advance feature-set • Metadata Tables: Exposes all Metadata Files in user-friendly way • Interface: Exposed as SQL as system tables • Performance: Queries are much faster than data queries • Full Transparency: Users/Systems can easily self-explore Metadata Tables to know how the system works, and how to improve it • Most tough problems can be debugged (at least partially) by Iceberg metadata tables • Decide how to optimize the table pre-emptively • Build monitoring, auditing, data quality checks beyond Iceberg
  • 7.
    My First MetadataTable Partitions Table Partition table = “db.table.partitions”
  • 8.
    Metadata Tables The FullList • history • metadata_logs • snapshots • manifests • all_manifests • entries • all_entries • fi les •data_ fi les •delete_ fi les •all_ fi les •all_data_ fi les •all_delete_ fi les Partitions is just an aggregate view of fi les table Iceberg Metadata Tables:
  • 9.
    Hierarchical Structure • Catalog(atomic pointer to Root Metadata) • Metadata File (Root Metadata) • Snapshot Files (Manifest List) • Manifest Files • Data Files Metadata Files Review
  • 10.
    Metadata Tables Mapping toMetadata Files Metadata Table Queries About metadata_logs Last Metadata File Metadata File snapshots Last Metadata File Snaphot Files (Manifest Lists) manifests Snapshot Files (Manifest Lists) Manifests Files/Entries (see next slide) Manifests Data Files SHOW TBLPROPERTIES • Each Metadata Table has information about all or a subset of one layer of “Metadata File” • Table for a Metadata File doesn’t read that layer metadata fi le, rather from the layer above it
  • 11.
    Files/Entries Tables Various Viewsof “Data Files” for User Convenience • Partitions table is just an aggregate view of Files table • Files/Entries: Equivalent. Manifest File Entry = metadata about a data fi le • Files = “Files” part of Manifest Entry, only physical attributes of a fi le • Entries = Complete row, including snapshot information of the fi le • All_ tables: All_Manifests, All_Files, All_Entries • all_x = All Metadata Files of X Layer • x = Metadata Files of X layer that are pointed to by current snapshot • Data/Delete: Data_Files, Delete_Files • Delete Files a V2 concept for Merge-on-Read • “Files” table selects both types of fi les
  • 12.
    FAQ: Partition Information •How many fi les per partition? • Total size of each partition? • Last update time per partition? SELECT partition, sum( fi le_size_in_bytes) AS partition_size, FROM db.table. fi les GROUP BY partition SELECT e.data_ fi le.partition, MAX(s.committed_at) AS last_modi fi ed_time FROM db.table.snapshots s JOIN db.table.entries e WHERE s.snapshot_id = e.snapshot_id GROUP_BY by e.data_ fi le.partition SELECT partition, fi le_count FROM db.table.partitions partition fi le_count {"date":"2022-1 0-04","hour":5} 5 partition partition_size {"date":"2022-1 0-04","hour":5} 937 partition last_modi fi ed_time {“date":"2022 -10-04", "hour":5} 2022-09-07 01:30:52.371
  • 13.
    Closer Look atSnapshots • Snapshot points to a list of fi les belonging to table at point in time • Snapshot is also an operation on fi les (adding, removing) • Entries table tracks which snapshot operated on the fi le • entries.snapshot_id • entries.status : 0=EXISTING, aka rewrite, 1= ADDED, 2 =DELETED Two Meanings vis-a-vis Files
  • 14.
    FAQ: Snapshot Questions •What fi les are added by snapshot 8339536322928208593? • What fi les are referenced by snapshot 8339536322928208593? • Use time-travel (SQL Syntax) SELECT data_ fi le. fi le_path FROM db.table.entries WHERE snapshot_id=8339536322928208593 AND status=1; SELECT fi le_path FROM db.table. fi les VERSION AS OF 8339536322928208593;
  • 15.
    FAQs: How toKeep Iceberg Maintained • Expire Snapshots (Cleanup) • RewriteManifests (Metadata Files Optimization) • RewriteFiles (Data Files Optimization)
  • 16.
    FAQ: Disk Usageand Expire Snapshots • User Question: I am hitting HDFS quotas. I ran compact fi les/deleted partitions, why do I still see quota limit? • Answer: Expire snapshots • Metadata Tables: • all_manifests, all_ fi les will show you everything reachable even from previous snapshots • manifests, fi les will show everything reachable from current snapshot • Useful Queries for Dashboards: select sum( fi le_size_in_bytes) from db.table.all_ fi les; // all reachable data fi les size select sum(length) from db.table.all_manifests; //all reachable manifest fi les size select sum( fi le_size_in_bytes) from db.table. fi les; // current snapshot fi les size select sum(length) from db.table.manifests; // current snapshot manifest fi les size
  • 17.
    FAQ: Disk Usage Committed_atsnapshot_id Summary 2022-08-24 14:01:43.191 4077543616265127980 {“added-data- fi les":"1", “added- fi les-size":"904", “added-records":"1", “changed-partition-count":"1", "spark.app.id":"local-1661374186213", “total-data- fi les":"23", “total-delete- fi les":"0", “total-equality-deletes":"0", “total- fi les-size":"20792", “total-position-deletes":"0", "total-records":"23"} Snapshots Table Alternative SELECT committed_at, snapshot_id, summary FROM db.table.snapshots;
  • 18.
    FAQ: When toOptimize Metadata • Improve query planning time, metadata table query time, by reducing overhead of reading metadata- fi les // Which manifests? SELECT path, added_data_ fi les_count + existing_data_ fi les_count + deleted_data_ fi les_count as fi les FROM db.table.manifests; path fi les s3://my_bucket/db/table/… 2 s3://my_bucket/db/table/… 4 // How many manifests? SELECT count(*) FROM db.table.manifests; count(1) 200 // Are manifests sorted? SELECT path, partition_summaries FROM db.table.manifests; path partition_summaries s3://my_bucket/db/table/… {“lower_bound”:”2022-10-04”, "upper_bound":"2022-10-04"}
  • 19.
    FAQ: When toOptimize Data • Improve query time by minimizing fi le-read overhead • Sort to improve selectivity of fi les, and compression ratio of fi les // Are data fi les sorted? // Note: Column coming soon SELECT fi le_path, readable_metrics.emp.upper_bound, readable_metrics.emp.lower_bound, FROM db.table. fi les; fi le_path col.lower_bound col.upper_bound s3://my_bucket/db/ table/… Abigail Adams Mike Monroe s3://my_bucket/db/ table/… Nancy Nomura Zachary Zunich // Too many small data fi les? SELECT partition, count(*) as fi le_count, sum( fi le_size_in_bytes)/count(*) as avg_size, FROM db.table. fi les GROUP BY partition partition fi le_count avg_size {"date":"2022-10-04 ","hour":5} 100 5120000
  • 20.
    Beyond Iceberg • Measuringa system data completeness and latency is typically hard, but becomes do-able in Iceberg • Incoming Dataset from Flink: • (data string, event_time timestamp) partitioned by hour (event_time) // Data Latency with custom UDF for calcuating time di ff erence. // Will be easier with readable_metrics column SELECT max(di ff (entries.data_ fi le.lower_bounds[1], hour(snapshots.committed_at)) AS max_latency FROM db.table.entries JOIN db.table.snapshots ON entries.snapshot_id = snapshots.snapshot_id GROUP BY entries.data_ fi le.partition; // Data Completeness SELECT record_count AS received, partition FROM db.table.partitions; Use Case: Ingest Monitoring
  • 21.
    Beyond Iceberg Use Case:Data Quality Alerts • Iceberg keeps interesting metrics per data fi le of every column: • column_sizes • value_counts • null_values • nan_values • lower_bounds • upper_bounds • Can create alerts for partitions with nan_values Select partition, (sum(to_int( fi les.nan_values[0])) AS nan_values FROM db.table. fi les GROUP BY fi les.partition
  • 22.
    Future Stay Tuned forPu ffi n Files • Pu ffi n Files introduced into Iceberg spec • https://github.com/apache/iceberg/blob/master/format/pu ffi n-spec.md • For (TBD) • Bloom Filters • Datasketches • Apply to data fi le or set of data fi les (TBD) • Can be used for data quality percentiles
  • 23.