Skip to content

Add public transfer artifact manifest#377

Draft
MaxGhenis wants to merge 2 commits intomainfrom
codex/public-transfer-artifact-manifest
Draft

Add public transfer artifact manifest#377
MaxGhenis wants to merge 2 commits intomainfrom
codex/public-transfer-artifact-manifest

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

  • add a committed enhanced_cps_manifest_2025.json with row counts, checksums, git blob SHAs, exchange-rate/build assumptions, loss diagnostics, and weight diagnostics
  • add fast contract tests that compare the manifest, README, source CSV, H5 row counts, checksums, and weight diagnostics
  • document public transfer methodology/limitations and clarify that policybench_transfer_* files are legacy 1k artifacts while Python aliases point to the current enhanced_cps builder
  • add a separate manual workflow and Make target for uploading only the public transfer artifacts to the public Hugging Face repo

Why

The public transfer artifact is now part of the PolicyBench path, but its provenance should not depend on the private eFRS deployment workflow. This keeps public artifact publication separate from private UKDS-derived artifacts and makes misuse harder.

Validation

  • UV_FROZEN=0 uv run --python 3.13 ruff check policyengine_uk_data/datasets/enhanced_cps.py policyengine_uk_data/datasets/__init__.py policyengine_uk_data/datasets/policybench_transfer.py policyengine_uk_data/utils/enhanced_cps_manifest.py policyengine_uk_data/storage/write_enhanced_cps_manifest.py policyengine_uk_data/storage/upload_public_transfer_dataset.py policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py
  • git diff --check
  • UV_FROZEN=0 uv run --python 3.13 pytest policyengine_uk_data/tests/test_enhanced_cps_artifact_manifest.py policyengine_uk_data/tests/test_policybench_transfer.py policyengine_uk_data/tests/test_release_manifest.py policyengine_uk_data/tests/test_hf_destinations.py

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator

Safe to merge from a data-protection standpoint. This PR creates a parallel public upload path that touches none of the FRS-derived plumbing CLAUDE.md protects. Code quality is solid; tests are unusually thorough.

A few things worth a look before merge:

  1. Token scope. The workflow uses secrets.HUGGING_FACE_TOKEN. If that token has write access to -private, a future swap of PUBLIC_REPOPRIVATE_REPO could leak. Two follow-ups worth considering: un-xfail test_every_hf_upload_routes_through_guard_constants once remaining literal call sites are migrated, and/or provision a dedicated public-only token.

  2. POLICYBENCH_TRANSFER_SOURCE_FILE = ENHANCED_CPS_SOURCE_FILE is a footgun. Anyone importing the legacy alias silently gets the 28k CSV rather than the 1k file. Documented in the docstring, but either a louder warning or a runtime DeprecationWarning on import would surface this to consumers earlier.

  3. Legacy policybench_transfer_2025.h5 provenance. Was the 1,000-row legacy artifact ever FRS-derived in an earlier commit? Worth a one-liner confirmation from whoever produced it. Not a regression here.

  4. Repo size. enhanced_cps_2025.h5 is committed to git. If it's >50MB, git lfs would be friendlier; if it's small, ignore.

  5. Minor: _pick_region uses household_id * 2654435761 — a # Knuth multiplicative hash comment would help future readers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants