Skip to content

Configuration

earthcatalog can be configured with a YAML file passed via --config or with individual CLI flags. A config file is recommended for reproducibility.


YAML config file

# config/h3_r1.yaml

catalog:
  db_path:   /tmp/earthcatalog.db          # local path to the SQLite catalog
  warehouse: /tmp/earthcatalog_warehouse   # local path to the Parquet warehouse

grid:
  type:       h3
  resolution: 1    # H3 resolution (0–15); 1 = production default

ingest:
  chunk_size:       500    # STAC items per ThreadPoolExecutor batch
  max_workers:      16     # concurrent S3 fetch threads
  batch_add_files:  false  # true = one Iceberg snapshot for entire run (backfill)

Pass it with:

earthcatalog incremental --config config/h3_r1.yaml --inventory /tmp/delta.csv

Config sections

catalog

Key Type Required Default Description
db_path string yes Local filesystem path to catalog.db (SQLite)
warehouse string yes Local filesystem path to the Parquet warehouse root

grid

Key Type Required Default Description
type h3 | geojson yes Partitioner type
resolution int no 1 H3 resolution (0–15). Only used when type = h3
boundaries_path string no Path to GeoJSON file. Only used when type = geojson
id_field string no "id" Feature property to use as partition key for GeoJSON

ingest

Key Type Required Default Description
chunk_size int no 500 Items fetched per batch
max_workers int no 8 Parallel S3 fetch threads
batch_add_files bool no false Accumulate all file paths and register in one Iceberg snapshot at the end of the run. Recommended for backfill.

H3 resolution guide

Resolution Avg. cell area Global cells Recommendation
0 ~4,250,000 km² 122 Very coarse; continental scale
1 ~607,220 km² 842 Production default
2 ~86,750 km² 5,882 Sub-regional
3 ~12,390 km² 41,162 Dense urban datasets

Test resolution

Integration tests use resolution 2 for faster H3 calculations. The production ITS_LIVE catalog uses resolution 1.


GeoJSON partitioner config

grid:
  type:            geojson
  boundaries_path: /path/to/regions.geojson
  id_field:        region_name

Each feature in the GeoJSON file becomes a partition cell, identified by the value of the id_field property.


Inventory file format

The inventory (and delta) file tells earthcatalog which STAC items to ingest. It must contain at minimum two columns — bucket and key — pointing to .stac.json files on S3.

Parquet

bucket: string    # S3 bucket (e.g. "its-live-data")
key:    string    # S3 object key ending in ".stac.json"

Optional column for since= filtering:

last_modified_date: timestamp  # used when --since is passed

CSV

Same columns, header row required when using --since:

bucket,key,last_modified_date
its-live-data,path/to/item.stac.json,2026-04-28T01:00:00.000Z

Delta files

Delta parquets use the same schema as the full inventory — only the rows differ (new/modified items only). Both ec.ingest() and ec.bulk_ingest() read any supported format.

Manifest (AWS S3 Inventory)

A manifest.json referencing multiple Parquet data files in a private destination bucket. earthcatalog reads credentials from env vars or ~/.aws/credentials.

Environment variables (S3 store)

earthcatalog reads AWS credentials from standard environment variables when writing to or reading from a private S3 bucket:

export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=us-west-2

For the public ITS_LIVE bucket no credentials are needed (skip_signature=True is set automatically).


CLI flags reference

earthcatalog incremental

--config       Path to YAML config file (optional)
--inventory    Path to S3 Inventory file (CSV, CSV.gz, Parquet, or manifest.json)
--catalog      Path to catalog.db (overrides config)
--warehouse    Path to warehouse root (overrides config)
--since        ISO date — skip items not modified since this date (e.g. 2026-04-01)
--limit        Maximum number of STAC items to ingest (useful for smoke tests)
--chunk-size   Items per batch (default: 500)
--max-workers  Fetch threads (default: 8)
--no-lock      Skip the S3 lock (for local development)
--resolution   H3 resolution (default: 1)

earthcatalog backfill

--inventory       S3 Inventory file path
--catalog         catalog.db path
--warehouse       Warehouse root path
--scheduler       dask scheduler: synchronous | local | coiled (default: synchronous)
--workers         Dask local-cluster worker count (default: 4)
--limit           Cap on STAC items to ingest
--chunk-size      Items per Dask task (default: 500)
--coiled-n-workers    Coiled cluster size
--coiled-software     Coiled software environment name
--coiled-region       AWS region for Coiled cluster