Skip to content

earthcatalog

Spatially-partitioned STAC ingest pipeline backed by Apache Iceberg.

earthcatalog ingests STAC item catalogs from AWS S3 into a spatially-partitioned Apache Iceberg table. The resulting catalog can be queried with DuckDB or any Iceberg-compatible engine using efficient spatial and temporal predicate pushdown.

Project Status: Alpha — The schema, partition spec, and CLI are stable for the ITS_LIVE velocity-pair catalog. Public bucket access requires no AWS credentials.


What it does

earthcatalog transforms STAC items from public S3 buckets into a queryable Parquet catalog. Each STAC item is mapped to a DGGS (H3 by default) cells, then grouped by cell and year into Parquet files:

  • Input: S3 Inventory manifest with .stac.json keys, the stac items
  • Spatial partitioning: One row per (item × H3 cell) — a point near a cell boundary maps to multiple cells
  • Output: One Parquet file per (grid_partition, year) bucket
  • Catalog: PyIceberg table backed by SQLite, hosted on S3

Why spatial partitioning matters →

File pruning happens at read time: A DuckDB query on a point queries only the Parquet files for that cell + year — no full scan required.


Why earthcatalog

Traditional STAC implementations use databases (PostgreSQL with PostGIS, Cloud SQL, etc.) to serve API queries. While fine for single-item lookups, they struggle with bulk exports — retrieving 100K+ rows means streaming through a database cursor with all the serialization overhead.

earthcatalog takes a different approach — spatially partitioned GeoParquet:

  • No moving parts: Parquet files sit on S3, no database to maintain or sync
  • Spatial partitioning: Queries with spatial filters open only relevant files — typically 2-10 files out of 5,000
  • Zero serialization overhead: DuckDB reads directly from S3; bulk exports are limited only by network bandwidth
  • SQLite on S3: No infrastructure (no RDS, no Glue, no REST API) — the catalog is a single SQLite file on S3

How to use

1. Install

mamba env create -f environment.yml
mamba activate itslive-ingest
pip install -e .

2. Quick start

import earthcatalog
from obstore.store import S3Store
from shapely.geometry import box
import duckdb

# Open — returns EarthCatalog
store = S3Store(bucket="its-live-data", region="us-west-2", skip_signature=True)
catalog = earthcatalog.open(store=store, base="s3://its-live-data/test-space/stac/catalog")

# Search — rustac-powered with Iceberg file pruning
search = catalog.search(
    intersects={"type": "Point", "coordinates": [0, 60]},
    datetime="2020-01-01/2020-12-31",
    filter={"op": "=", "args": [{"property": "platform"}, "sentinel-1"]},
    max_items=10,
)

# we get pystac items back!
for items in search.items():
    print(items)

# Or get results as a PyArrow table
table = catalog.search_to_arrow(bbox=[-60, 60, -20, 85])

3. Ingest

# Daily delta (single-node)
catalog.ingest("s3://bucket/delta.parquet", mode="delta", update_hash_index=True)

# Large backfill (Dask/Coiled)
catalog.bulk_ingest("s3://bucket/full.parquet", create_client=coiled.Client)

Key features

Feature Detail
obstore for all S3 I/O No credentials needed for public buckets (skip_signature=True)
H3 spatial partitioning Resolution 1 hex grid — 842 global cells at ~5M km² each
rustac GeoParquet geo metadata and geoarrow.wkb extension handled automatically
PyIceberg + SQLite Zero-infra catalog — no Glue, no REST server
Iceberg partition pruning IdentityTransform(grid_partition) + YearTransform(datetime)
S3 atomic lock If-None-Match: * conditional write — no DynamoDB required
Incremental delta ingest Delta parquet with new items only
Hash index xxh3_128 for O(log n) duplicate detection


Built from commit 7a0a3f6 (2026-05-01)