Skip to content

S5 — build_gaul_lookup.py: pyarrow-native build #90

Description

@Polichinel

Epic: #85 · S5 · [UNILATERAL — lowest priority]

Background

scripts/build_gaul_lookup.py builds the GAUL lookup parquet from views-datafactory's source parquets using pandas + pyarrow. It runs offline, outside the delivery path, so it touches nothing interconnected and has zero samples relevance. Pandas-free for completeness only.

Work

  • Make the builder pandas-free (pure pyarrow): the join/filter/dtype/sort steps currently done with pandas (build_gaul_lookup.py ~:71-166) reimplemented on arrow tables; keep the embedded provenance metadata that _read_version reads.

Acceptance criteria

  • The builder produces a lookup parquet identical to the current one from the same datafactory inputs (table contents + embedded provenance metadata).
  • No pandas import in the script.

Parity / validation

Build the lookup both ways from the same inputs; assert identical arrow table and identical provenance metadata. (Downstream tests/test_enrichment.py lookup-integrity assertions must still pass against the rebuilt file.)

Dependencies

Independent; lowest urgency. Keep in sync with S4 and gaul_schema.py (the 9-column contract + rename map).

Files

scripts/build_gaul_lookup.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    implementationCode implementation workstoryA single reviewable unit of an epic

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions