Summary
~58,000 watershed boundary polygons (plus upstream river networks) have been delineated from MERIT-Hydro for the gauges already supported by RivRetrieve. This issue proposes making them accessible directly through the library.
What exists
hydra-shed (a pure-Rust reimplementation of delineator by Matt Heberger) was used to delineate catchments for every gauge station in the RivRetrieve catalog. The ~58k basins correspond to all gauges whose coordinates have more than 3 decimal places of precision and whose outlets fall inside the MERIT-Hydro basin domain. ~565 outlets could not be snapped to the MERIT-Hydro network and are excluded.
Coverage
| Country |
Gauges delineated |
| USA |
23,841 |
| Canada |
7,630 |
| Australia |
6,241 |
| France |
5,327 |
| Norway |
4,541 |
| Brazil |
4,579 |
| Poland |
1,301 |
| South Africa |
1,285 |
| Czech Republic |
825 |
| Japan |
816 |
| Slovenia |
713 |
| Chile |
529 |
| Germany (Berlin) |
189 |
| Lithuania |
95 |
| Portugal |
73 |
| Total |
~58,000 |
GeoPackage schema (two layers per file)
| Layer |
Key columns |
Geometry |
watersheds |
gauge_id, gauge_name, gauge_lat, gauge_lon, snap_lat, snap_lon, snap_dist, area, basin, country, low_res |
POLYGON |
rivers |
gauge_id, comid, uparea, strahler, shreve |
LINESTRING |
Total data size is ~31 GB across 15 GeoPackage files (USA alone is ~10 GB), so per-country lazy downloading is essential.
Proposed hosting
HuggingFace Datasets at rivretrieve/watersheds — free, programmatic access via huggingface_hub, built-in caching.
Proposed API
from rivretrieve import PortugalFetcher
fetcher = PortugalFetcher()
# Returns a GeoDataFrame with the watershed polygon for gauge "04K/04A"
watershed = fetcher.get_watershed("04K/04A")
# Optionally include the upstream river network
watershed, rivers = fetcher.get_watershed("04K/04A", include_rivers=True)
Implementation outline
-
COUNTRY_CODE class attribute on each fetcher (e.g. PortugalFetcher.COUNTRY_CODE = "portugal"). Fetchers without watershed data (Spain, UK-EA, UK-NRFA) still get the attribute — they produce a clear ValueError listing available countries.
-
get_watershed() concrete method on RiverDataFetcher base class — all fetchers inherit it automatically:
def get_watershed(self, gauge_id: str, include_rivers: bool = False):
from . import watersheds # lazy import
return watersheds.query_watershed(
country_code=self.COUNTRY_CODE,
gauge_id=gauge_id,
include_rivers=include_rivers,
)
-
New module rivretrieve/watersheds.py — thin wrapper around huggingface_hub.hf_hub_download and geopandas.read_file. Handles per-country gpkg download, caching, and gauge ID lookup.
-
geopandas as optional dependency:
# setup.py
extras_require={
"geo": ["geopandas>=0.12.0", "huggingface_hub>=0.20.0"],
}
pip install rivretrieve[geo]
-
Clear error messages for: gauge not found in database, country not covered, geopandas/huggingface_hub not installed.
-
Unit tests with a tiny fixture GeoPackage, mocking hf_hub_download.
-
Docs page docs/watersheds.rst.
Questions for discussion
- Is the HuggingFace hosting approach acceptable, or would you prefer a different mechanism (e.g. Zenodo, direct S3)?
- Should
COUNTRY_CODE be a required abstract class attribute, or is a softer convention acceptable?
- Should the rivers layer be exposed at all in v1, or keep it watershed-only?
Credits
Summary
~58,000 watershed boundary polygons (plus upstream river networks) have been delineated from MERIT-Hydro for the gauges already supported by RivRetrieve. This issue proposes making them accessible directly through the library.
What exists
hydra-shed (a pure-Rust reimplementation of delineator by Matt Heberger) was used to delineate catchments for every gauge station in the RivRetrieve catalog. The ~58k basins correspond to all gauges whose coordinates have more than 3 decimal places of precision and whose outlets fall inside the MERIT-Hydro basin domain. ~565 outlets could not be snapped to the MERIT-Hydro network and are excluded.
Coverage
GeoPackage schema (two layers per file)
watershedsgauge_id,gauge_name,gauge_lat,gauge_lon,snap_lat,snap_lon,snap_dist,area,basin,country,low_resriversgauge_id,comid,uparea,strahler,shreveTotal data size is ~31 GB across 15 GeoPackage files (USA alone is ~10 GB), so per-country lazy downloading is essential.
Proposed hosting
HuggingFace Datasets at
rivretrieve/watersheds— free, programmatic access viahuggingface_hub, built-in caching.Proposed API
Implementation outline
COUNTRY_CODEclass attribute on each fetcher (e.g.PortugalFetcher.COUNTRY_CODE = "portugal"). Fetchers without watershed data (Spain, UK-EA, UK-NRFA) still get the attribute — they produce a clearValueErrorlisting available countries.get_watershed()concrete method onRiverDataFetcherbase class — all fetchers inherit it automatically:New module
rivretrieve/watersheds.py— thin wrapper aroundhuggingface_hub.hf_hub_downloadandgeopandas.read_file. Handles per-country gpkg download, caching, and gauge ID lookup.geopandasas optional dependency:Clear error messages for: gauge not found in database, country not covered, geopandas/huggingface_hub not installed.
Unit tests with a tiny fixture GeoPackage, mocking
hf_hub_download.Docs page
docs/watersheds.rst.Questions for discussion
COUNTRY_CODEbe a required abstract class attribute, or is a softer convention acceptable?Credits