Epic: #85 · S5 · [UNILATERAL — lowest priority]
Background
scripts/build_gaul_lookup.py builds the GAUL lookup parquet from views-datafactory's source parquets using pandas + pyarrow. It runs offline, outside the delivery path, so it touches nothing interconnected and has zero samples relevance. Pandas-free for completeness only.
Work
- Make the builder pandas-free (pure
pyarrow): the join/filter/dtype/sort steps currently done with pandas (build_gaul_lookup.py ~:71-166) reimplemented on arrow tables; keep the embedded provenance metadata that _read_version reads.
Acceptance criteria
Parity / validation
Build the lookup both ways from the same inputs; assert identical arrow table and identical provenance metadata. (Downstream tests/test_enrichment.py lookup-integrity assertions must still pass against the rebuilt file.)
Dependencies
Independent; lowest urgency. Keep in sync with S4 and gaul_schema.py (the 9-column contract + rename map).
Files
scripts/build_gaul_lookup.py.
Epic: #85 · S5 ·
[UNILATERAL — lowest priority]Background
scripts/build_gaul_lookup.pybuilds the GAUL lookup parquet from views-datafactory's source parquets using pandas + pyarrow. It runs offline, outside the delivery path, so it touches nothing interconnected and has zero samples relevance. Pandas-free for completeness only.Work
pyarrow): the join/filter/dtype/sort steps currently done with pandas (build_gaul_lookup.py~:71-166) reimplemented on arrow tables; keep the embedded provenance metadata that_read_versionreads.Acceptance criteria
Parity / validation
Build the lookup both ways from the same inputs; assert identical arrow table and identical provenance metadata. (Downstream
tests/test_enrichment.pylookup-integrity assertions must still pass against the rebuilt file.)Dependencies
Independent; lowest urgency. Keep in sync with S4 and
gaul_schema.py(the 9-column contract + rename map).Files
scripts/build_gaul_lookup.py.