
earthcatalog¶
Spatially-partitioned STAC ingest pipeline backed by Apache Iceberg.
earthcatalog ingests STAC item catalogs from AWS S3 into a spatially-partitioned Apache Iceberg table. The resulting catalog can be queried with DuckDB or any Iceberg-compatible engine using efficient spatial and temporal predicate pushdown.
Project Status: Alpha — The schema, partition spec, and CLI are stable for the ITS_LIVE velocity-pair catalog. Public bucket access requires no AWS credentials.
What it does¶
earthcatalog transforms STAC items from public S3 buckets into a queryable Parquet catalog. Each STAC item is mapped to a DGGS (H3 by default) cells, then grouped by cell and year into Parquet files:
- Input: S3 Inventory manifest with
.stac.jsonkeys, the stac items - Spatial partitioning: One row per (item × H3 cell) — a point near a cell boundary maps to multiple cells
- Output: One Parquet file per
(grid_partition, year)bucket - Catalog: PyIceberg table backed by SQLite, hosted on S3
Why spatial partitioning matters →
File pruning happens at read time: A DuckDB query on a point queries only the Parquet files for that cell + year — no full scan required.
Why earthcatalog¶
Traditional STAC implementations use databases (PostgreSQL with PostGIS, Cloud SQL, etc.) to serve API queries. While fine for single-item lookups, they struggle with bulk exports — retrieving 100K+ rows means streaming through a database cursor with all the serialization overhead.
earthcatalog takes a different approach — spatially partitioned GeoParquet:
- No moving parts: Parquet files sit on S3, no database to maintain or sync
- Spatial partitioning: Queries with spatial filters open only relevant files — typically 2-10 files out of 5,000
- Zero serialization overhead: DuckDB reads directly from S3; bulk exports are limited only by network bandwidth
- SQLite on S3: No infrastructure (no RDS, no Glue, no REST API) — the catalog is a single SQLite file on S3
How to use¶
1. Install¶
2. Quick start¶
import earthcatalog
from obstore.store import S3Store
from shapely.geometry import box
import duckdb
# Open — returns EarthCatalog
store = S3Store(bucket="its-live-data", region="us-west-2", skip_signature=True)
catalog = earthcatalog.open(store=store, base="s3://its-live-data/test-space/stac/catalog")
# Search — rustac-powered with Iceberg file pruning
search = catalog.search(
intersects={"type": "Point", "coordinates": [0, 60]},
datetime="2020-01-01/2020-12-31",
filter={"op": "=", "args": [{"property": "platform"}, "sentinel-1"]},
max_items=10,
)
# we get pystac items back!
for items in search.items():
print(items)
# Or get results as a PyArrow table
table = catalog.search_to_arrow(bbox=[-60, 60, -20, 85])
3. Ingest¶
# Daily delta (single-node)
catalog.ingest("s3://bucket/delta.parquet", mode="delta", update_hash_index=True)
# Large backfill (Dask/Coiled)
catalog.bulk_ingest("s3://bucket/full.parquet", create_client=coiled.Client)
Key features¶
| Feature | Detail |
|---|---|
| obstore for all S3 I/O | No credentials needed for public buckets (skip_signature=True) |
| H3 spatial partitioning | Resolution 1 hex grid — 842 global cells at ~5M km² each |
| rustac GeoParquet | geo metadata and geoarrow.wkb extension handled automatically |
| PyIceberg + SQLite | Zero-infra catalog — no Glue, no REST server |
| Iceberg partition pruning | IdentityTransform(grid_partition) + YearTransform(datetime) |
| S3 atomic lock | If-None-Match: * conditional write — no DynamoDB required |
| Incremental delta ingest | Delta parquet with new items only |
| Hash index | xxh3_128 for O(log n) duplicate detection |
Built from commit 7a0a3f6 (2026-05-01)