Warehouse Compaction¶
The incremental pipeline writes one new part file per (cell, year) bucket
per run. After N incremental runs a bucket has N part files, each
potentially containing duplicate id rows if the same STAC item appeared in
more than one inventory delta.
Compaction merges all part files in each bucket into a single, deduplicated, sorted file and rebuilds the Iceberg catalog.
When to run¶
- After each incremental run if strict deduplication is required.
- Weekly or monthly for most use cases — the incremental pipeline can run without compaction; compaction just cleans up the accumulation.
- After a large backfill to collapse all per-chunk part files into one file
per bucket (this is done automatically inside
run_backfillvia Level 1).
What compaction does¶
For each (grid_partition, year) bucket with ≥ threshold part files:
- Orphan sweep — delete any
.parquetfiles in the prefix that are not in the known part-file list (cleanup from interrupted previous runs). - Download all part files.
- Merge into a single Arrow table (schema metadata preserved from the first file).
- Deduplicate on
id: sort by(id ASC, updated DESC), keep the first occurrence of each id (i.e. the most recently updated version). - Sort by
(platform, datetime)for optimal Parquet predicate pushdown. - Write one consolidated GeoParquet file to the warehouse store.
- Delete all input part files.
After all buckets are compacted, the Iceberg catalog is rebuilt from scratch
(drop table → recreate → add_files(all_surviving_paths)), producing exactly
one clean snapshot.
Dedup scope
Deduplication is within each (cell, year) file, not globally unique
across the warehouse. Because of the fan-out, one id legitimately
appears in multiple cell files (one per intersecting H3 cell).
CLI usage¶
# Dry run — see what would be compacted without touching any files
python -m earthcatalog.maintenance.compact \
--warehouse /tmp/warehouse \
--catalog /tmp/catalog.db \
--dry-run
# Compact all buckets with ≥ 2 part files (default)
python -m earthcatalog.maintenance.compact \
--warehouse /tmp/warehouse \
--catalog /tmp/catalog.db
# Custom threshold (only compact buckets with ≥ 5 files)
python -m earthcatalog.maintenance.compact \
--warehouse /tmp/warehouse \
--catalog /tmp/catalog.db \
--threshold 5
# With S3 lock (recommended for production)
python -m earthcatalog.maintenance.compact \
--warehouse /tmp/warehouse \
--catalog /tmp/catalog.db \
--use-lock
Python API¶
from earthcatalog.maintenance.compact import compact_warehouse
summary = compact_warehouse(
warehouse_path="/tmp/warehouse",
catalog_path="/tmp/catalog.db",
threshold=2, # compact any bucket with ≥ 2 files
use_lock=False, # set True in production
dry_run=False,
)
print(summary)
# {
# "buckets_scanned": 150,
# "buckets_compacted": 45,
# "files_before": 195,
# "files_after": 150,
# }
Return value¶
compact_warehouse() returns a dict:
| Key | Type | Description |
|---|---|---|
buckets_scanned |
int | Total (cell, year) buckets found in the warehouse |
buckets_compacted |
int | Buckets that met the threshold and were compacted |
files_before |
int | Total part files before compaction |
files_after |
int | Total part files after compaction |
Scale considerations¶
At H3 resolution 1, compacting the most recent year involves ~100–200 buckets and runs sequentially in a few minutes. This is fast enough for a weekly maintenance job.
A full warehouse compaction (all 44 years × ~150 cells ≈ ~6,600 buckets) takes longer sequentially. For that scale, use the Level-1 compaction inside the backfill pipeline which runs all buckets in parallel with Dask.
Iceberg catalog rebuild¶
After physical compaction, compact_warehouse performs a full catalog rebuild:
- Drop the existing Iceberg table (removes all stale manifest entries).
- Recreate the table with the same schema and partition spec.
table.add_files(all_surviving_paths)— register every surviving Parquet file in a single new snapshot.
The result is exactly 1 Iceberg snapshot regardless of how many incremental runs produced the input files.
Brief table-not-found window
Between dropping and recreating the Iceberg table, the catalog has no table.
Any concurrent reader will see a NoSuchTableError. Use --use-lock in
production to prevent concurrent access during this window.