Querying the Catalog¶
Five search methods, from fastest to most flexible:
| Method | Returns | Speed | When to use |
|---|---|---|---|
duck_search() |
pandas.DataFrame (all columns) |
~2× | Fastest results, any query |
search_uris() |
DataFrame(id, uri) |
fastest for URLs | Bulk download URL extraction |
search() |
lazy EarthCatalogItemSearch → pystac.Item |
1× | Need pystac objects, lazy iteration |
search_to_arrow() |
pyarrow.Table |
1× | Arrow-native workflows |
search_files() |
list[str] (file paths) |
— | Custom DuckDB SQL |
Fastest — duck_search()¶
Uses DuckDB internally for parallel Parquet I/O. ~2× faster than
the other methods across all query types. Returns a pandas.DataFrame
with flat columns — no pystac conversion overhead.
import earthcatalog as ec
import cql2
from obstore.store import S3Store
store = S3Store(bucket='its-live-data', region='us-west-2', skip_signature=True)
catalog = ec.open(store=store, base='s3://its-live-data/test-space/stac/catalog')
df = catalog.duck_search(
intersects={"type": "Point", "coordinates": [0, 60]},
datetime="2020-01-01/2020-12-31",
filter=cql2.parse_text('platform = "sentinel-1"').to_json(),
max_items=100,
)
# df is a pandas.DataFrame — iterate or convert as needed
for _, row in df.iterrows():
print(row["id"], row["platform"])
max_items note¶
DuckDB's SQL LIMIT triggers a 7× slower query plan for multi-file
scans. duck_search() avoids this by fetching all matching rows and
truncating in Python. For max_items ≤ 100,000 the overhead is
negligible; for larger result sets use search_files() + hand-written
DuckDB SQL (see BYO section).
Lazy iteration — search()¶
Returns a lazy EarthCatalogItemSearch that yields pystac.Item
objects. Comparable to search_to_arrow() in speed. Best for
interactive use with max_items=100 where early exit avoids wasted
work (sequential per-file processing stops as soon as enough items
are found).
results = catalog.search(
intersects={"type": "Point", "coordinates": [0, 60]},
datetime="2020-01-01/2020-12-31",
filter=cql2.parse_text('platform = "sentinel-1"').to_json(),
max_items=100,
)
for item in results.items():
print(item.id, item.properties["platform"])
CQL2 filters¶
Filters use cql2.parse_text() for a natural SQL-like syntax:
import cql2
cql2.parse_text('platform = "sentinel-2"')
cql2.parse_text('percent_valid_pixels > 50')
cql2.parse_text('platform IN ("sentinel-2", "landsat-8", "landsat-9")')
cql2.parse_text('platform = "landsat-8" AND percent_valid_pixels > 70')
Temporal filtering uses the top-level datetime kwarg (STAC-standard).
Do not reference datetime inside CQL2 — rustac generates broken
SQL when datetime appears in a CQL2 expression:
# ✅ Correct
results = catalog.search(
datetime="2020-01-01/2020-12-31",
filter=cql2.parse_text('percent_valid_pixels >= 80').to_json(),
)
Raw CQL2 JSON dicts are also accepted:
Pagination and metadata¶
# pages() — one batch per file
for i, page in enumerate(results.pages()):
print(f"Page {i}: {len(page)} items")
# matched() — estimated upper bound from Iceberg manifest (no Parquet I/O)
print(f"Up to {results.matched():,} matching rows")
# stats() — file count and data volume from manifests
s = results.stats()
print(f"{s['files']} files, ~{s['rows_upper_bound']:,} rows")
PyArrow — search_to_arrow()¶
Returns a pyarrow.Table. Useful for zero-copy interchange with
other Arrow-native tools. Same speed as search().
Performance¶
See search_performance.md for detailed
benchmarks across all methods against the production catalog.
| Method | Narrow query (2 files) | Wide query (32 files) | Wide + 100k limit |
|---|---|---|---|---|
| duck_search() | ~2s | ~3s (11×) | ~28s (2×) |
| search() | ~2s | ~36s | ~55s |
| search_to_arrow() | ~2s | ~34s | ~47s |
How it works¶
- Iceberg partition pruning resolves the spatial/temporal query to a list of file paths (zero I/O on non-matching files).
- DuckDB or rustac then reads those files and applies CQL2 filters.
For narrow queries (few files), all methods are bottlenecked by the
S3 download + Parquet scan time (~1s per file). For wide queries with
sparse data spread across many files, duck_search() benefits
from DuckDB's internal parallel I/O while rustac reads files sequentially.
Performance Tips¶
- Use
duck_search()for fastest results - Use
search()for lazy iteration withmax_items=100— early exit avoids wasted work - Prefer temporal filters — the
yearpartition is heavily pruned - Spatial + temporal = fastest — both partitions are pruned before any file is opened
- Highly selective CQL2 filters pair with
max_itemsfor consistent latency
BYO Query Engine: DuckDB¶
If you need raw SQL access (aggregations, joins, arbitrary expressions),
use search_files() to get the file list, then query with DuckDB
directly. Remember to configure anonymous S3 access:
import duckdb
import earthcatalog as ec
from obstore.store import S3Store
from shapely.geometry import Polygon
store = S3Store(bucket='its-live-data', region='us-west-2', skip_signature=True)
catalog = ec.open(store=store, base='s3://its-live-data/test-space/stac/catalog')
greenland = Polygon([(-60, 60), (-20, 60), (-20, 85), (-60, 85), (-60, 60)])
paths = catalog.search_files(greenland, start_datetime='2020-01-01', end_datetime='2020-12-31')
con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
# Anonymous S3 — DuckDB inherits AWS creds from env; explicitly clear them
con.execute("SET s3_access_key_id='';")
con.execute("SET s3_secret_access_key='';")
con.execute("SET s3_session_token='';")
df = con.execute(f"""
SELECT id, platform, datetime,
ST_XMin(geometry) AS xmin, ST_YMin(geometry) AS ymin
FROM read_parquet({paths})
WHERE platform = 'sentinel-2'
AND ST_YMax(geometry) >= 60
ORDER BY datetime
LIMIT 100
""").df()
Time series for a single location¶
from shapely.geometry import Point
point = Point(-149.5, 63.5)
paths = catalog.search_files(point, start_datetime='2018-01-01', end_datetime='2023-12-31')
df = con.execute(f"""
SELECT DATE_TRUNC('month', datetime) AS month, COUNT(*) AS scenes
FROM read_parquet({paths})
WHERE ST_Intersects(geometry, ST_GeomFromText('{point.wkt}'))
GROUP BY month
ORDER BY month
""").df()
Spatial query options¶
# ST_Intersects
ST_Intersects(geometry, ST_GeomFromText('POINT(-133.99 58.74)'))
# Bounding box filter (faster, no geometry parsing)
SELECT id, platform, datetime
FROM read_parquet({paths})
WHERE ST_XMin(geometry) >= -140
AND ST_XMax(geometry) <= -130
Bulk URL extraction¶
Fastest way to get data download URLs for thousands of items matching
a spatial/temporal/property filter. Uses search_files() + a targeted
DuckDB query that reads only the columns needed.
import duckdb, json
import earthcatalog as ec
from obstore.store import S3Store
from shapely.geometry import box
store = S3Store(bucket='its-live-data', region='us-west-2', skip_signature=True)
catalog = ec.open(store=store, base='s3://its-live-data/test-space/stac/catalog')
# 1. Iceberg pruning — fast, zero I/O
greenland = box(-60, 60, -20, 85)
paths = catalog.search_files(greenland, start_datetime='2020-01-01', end_datetime='2020-12-31')
# 2. DuckDB — only reads assets + geometry + filter columns
con = duckdb.connect()
con.execute("INSTALL spatial; LOAD spatial;")
con.execute("SET s3_access_key_id='';")
con.execute("SET s3_secret_access_key='';")
con.execute("SET s3_session_token='';")
df = con.execute(f"""
SELECT id, assets
FROM read_parquet({paths})
WHERE percent_valid_pixels > 50
AND ST_Intersects(geometry, ST_GeomFromText('{greenland.wkt}'))
""").df()
# 3. Extract data URLs from the JSON assets column
urls = []
for _, row in df.iterrows():
assets = json.loads(row["assets"])
href = assets.get("data", {}).get("href")
if href:
urls.append(href)
print(f"{len(urls)} data URLs")
# e.g. 'https://its-live-data.s3.amazonaws.com/velocity_image_pair/...nc'
This is faster than duck_search() for URL extraction
because it reads only 2 columns (id, assets) instead of all 30+.
For large result sets the savings are significant.
If you prefer the simpler API at the cost of reading all columns:
df = catalog.duck_search(...)
urls = [json.loads(a).get("data", {}).get("href") for a in df["assets"] if a]
Or use the dedicated method that does all of the above:
df = catalog.search_uris(
intersects={"type": "Point", "coordinates": [-45, 70]},
datetime="2020-01-01/2020-12-31",
max_items=100,
)
# df has columns: id, uri (data download URL extracted from assets)
API Reference¶
EarthCatalog.info()¶
Returns CatalogInfo with grid metadata:
catalog.search_files(geometry, start_datetime, end_datetime)¶
Prunes files by H3 cell + year partition: