Search Performance¶

Comparison of the search methods against the production catalog (63.4M rows, 5,024 files, H3 resolution 1). All times are from isolated single-query runs against the real S3 catalog.

Search methods¶

Method	Engine	Returns	I/O
`search()`	rustac (DuckDB per file)	lazy `EarthCatalogItemSearch` → `pystac.Item`	Sequential
`search()`	DuckDB `read_parquet`	eager `list[pystac.Item]`	Parallel
`duck_search()`	DuckDB `read_parquet`	eager `pandas.DataFrame`	Parallel
`search_to_arrow()`	rustac → Arrow	eager `pyarrow.Table`	Sequential

Dense region — small polygon, northern Greenland¶

The densest cell cluster. Queries hit many files with abundant matching items across 34+ years.

Query	search	search()	duck_search()	search_to_arrow
1980–2017, no filter, max=100k	55.5s	57.0s	26.7s (2.1×)	47.2s
1980–2017, pvp>=1, max=100k	53.9s	56.6s	28.8s (1.9×)	57.2s
1980–2026, no filter, max=100k	53.1s	58.7s	28.0s (1.9×)	46.2s
1980–2026, pvp>=1, max=100k	—	—	—	—

duck_search() is ~2× faster across the board. The pystac conversion overhead (~25s) erases DuckDB's parallel-read advantage for duck_search(pystac), making it comparable to search().

The pvp>=1 filter has negligible effect — almost all items satisfy it.

Sparse region — small polygon, Alaska¶

126 files after pruning, ~1.4M est. rows. pvp<50 is selective (~5–10% of items satisfy it), forcing all methods to scan more rows before hitting the 100k limit.

Query	search	search()	duck_search()	search_to_arrow
1980–2026, no filter, max=100k	62.0s	60.0s	30.3s (2.0×)	54.6s
1980–2026, pvp<50, max=100k	108.2s	93.6s	62.5s (1.7×)	99.1s
1980–2026, date_dt>=50, max=100k	T/O	T/O	T/O	T/O

duck_search() is consistently ~2× faster. The selective pvp<50 filter reduces the gap slightly because DuckDB must scan all rows regardless of parallelism (the filter is applied after read).

The date_dt>=50 query timed out (>10 min) with all methods — highly selective filters over a wide temporal range can be expensive regardless of approach. Add a narrower temporal range or use max_items sparingly for such cases.

Sparse query — single point, 1980–2015¶

Few files per cell (32 files), each with sparse matching items.

Query	search	search()	duck_search()	search_to_arrow
Point Greenland, 1980–2015, pvp>=1, 642 items	36.5s	4.1s (8.9×)	3.3s (11×)	33.6s

DuckDB's advantage is maximized when items are spread thinly across many files — it reads them in parallel while rustac reads them one at a time.

Narrow queries — year-targeted¶

Iceberg prunes to 2 files regardless of geometry size. All methods are bounded by the S3 download + scan time for one or two files (~1s/file).

Query	search	search()	duck_search()	search_to_arrow
Point Greenland, year=2020, max=100	2.7s	2.8s	2.1s	2.1s
+ pvp>=80, max=100	2.2s	2.2s	2.0s	2.2s
Bbox Greenland, year=2020, max=100	2.5s	1.7s	1.8s	1.9s
Polygon Iceland, year=2020, max=100	2.2s	1.5s	1.4s	1.6s

When to use each¶

Scenario	Recommended
Interactive exploration, max_items=100	`search()` — same speed, lazy
Wide query, need pystac Items	`search()` (lazy) or `duck_search()` + manual conversion
Wide query, fastest total time	`duck_search()` — 2× faster
PyArrow table output	`search_to_arrow()`
Raw SQL / aggregations / joins	`search_files()` + DuckDB directly

Key insight¶

duck_search() is consistently ~2× faster across all query types due to DuckDB's parallel I/O. The pystac conversion overhead (~25–30s for 100k items) erases this advantage, making duck_search(pystac) comparable to search().

DuckDB's parallelism gives the biggest win when items are spread across many files (sparse queries across wide temporal ranges). For a point query over 34 years with sparse matching items, the gain reaches 8–11× because DuckDB reads all files concurrently while rustac reads them one at a time.

Selective filters (pvp<50, date_dt>=50) reduce the gain because DuckDB must scan all rows regardless of parallelism — the filter is applied after the read. For extreme cases (highly selective + wide temporal range), the query may time out with all methods.