Skip to content

S4 — enrichment.py: numpy/pyarrow keyed gather (drop pandas merge) #89

Description

@Polichinel

Epic: #85 · S4 · [UNILATERAL — low priority]

Background

GaulLookupEnricher is OWN-CHOICE pandas: it loads the GAUL lookup via pd.read_parquet (enrichment.py:48) and attaches metadata with base.merge(left_on=pg_id_col, right_index=True, how="left") (enrichment.py:117-119). This is a keyed metadata-attach join, not frame algebra. It does not block samples (it attaches geo columns to whatever rows exist), so this is low priority.

Work

  • Replace pd.read_parquet(lookup) + .merge with a pandas-free keyed gather: load the lookup once into a priogrid_gid → row numpy/dict structure (via pyarrow.parquet, already imported for _read_version), then gather metadata by cell id, producing NaN/sentinel for misses to preserve the fail-loud null behaviour.
  • Explicitly do NOT push this into views-framesPredictionFrame carries values + a SpatioTemporalIndex, not arbitrary GAUL columns; modelling a 9-column geographic attach as a frame op would distort the value-object contract. The target is plain numpy/arrow keyed gather.

Acceptance criteria

  • Enricher no longer imports/uses pandas; output is column-for-column identical to the merge path, including NaN positions for unmapped cells.
  • The country_iso_a3.isna() warning path fires identically.

Parity / validation

tests/test_enrichment.py + tests/test_append_metadata.py as the oracle: feed identical inputs to old merge vs new gather, assert equality incl. NaN positions and dtypes.

Dependencies

Independent of S1–S3 (can land any time). Keep gaul_schema.py (the 9-column contract) as the single source of truth; coordinate with S5.

Files

views_postprocessing/unfao/enrichment.py, tests/test_enrichment.py, tests/test_append_metadata.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    implementationCode implementation workstoryA single reviewable unit of an epic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions