Skip to content

Add stage-organized pipeline artifact uploads#617

Open
anth-volk wants to merge 1 commit intomainfrom
add-pipeline-artifact-uploads
Open

Add stage-organized pipeline artifact uploads#617
anth-volk wants to merge 1 commit intomainfrom
add-pipeline-artifact-uploads

Conversation

@anth-volk
Copy link
Collaborator

Fixes #616

Summary

  • New utility module policyengine_us_data/utils/pipeline_artifacts.py with mirror_to_pipeline() as a single-call interface for uploading artifacts to policyengine/policyengine-us-data-pipeline
  • Each stage upload writes a manifest.json with SHA256 checksums, git provenance, and timestamp
  • Hook calls added at 4 existing upload points (purely additive — no changes to existing behavior):
    • upload_completed_datasets.py: stage_0_raw (policy_data.db) + stage_1_base (CPS/enhanced datasets)
    • remote_calibration_runner.py: stage_4_source_imputed + stage_6_weights
    • local_area.py: stage_7_local_area (manifest-only — files too large to double-upload)
  • All mirror uploads are failure-tolerant and never block the main pipeline
  • Verified upload works against the real HF repo

Test plan

  • 12 unit tests covering run ID generation, manifest schema, upload operations, manifest-only mode, error resilience, and folder structure
  • Manual integration test: uploaded a test file to the pipeline repo and verified folder structure, then cleaned up

🤖 Generated with Claude Code

New utility module (pipeline_artifacts.py) mirrors existing build
artifacts to policyengine/policyengine-us-data-pipeline with a
stage-organized folder structure. Each stage gets a manifest.json
with SHA256 checksums and git provenance.

Hook points added at 4 existing upload sites:
- upload_completed_datasets.py: stage_0_raw + stage_1_base
- remote_calibration_runner.py: stage_4_source_imputed + stage_6_weights
- local_area.py: stage_7_local_area (manifest-only)

All mirror uploads are additive and failure-tolerant — they never
block the main pipeline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@anth-volk anth-volk requested a review from juaristi22 March 17, 2026 22:22
@anth-volk anth-volk marked this pull request as ready for review March 17, 2026 22:25
Copy link
Collaborator

@juaristi22 juaristi22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Looks good — clean design with a single-call interface, good reuse of existing utilities, thorough tests, and consistent failure tolerance. Approving with a few small suggestions.

Suggestions

  1. Missing try/except around _upload_source_imputed mirror call (remote_calibration_runner.py:205-213): The import and mirror_to_pipeline() call are outside a try/except, unlike the other three hook sites. While mirror_to_pipeline swallows exceptions internally, the from ... import could fail (e.g., missing dependency in the Modal environment) and would crash _upload_source_imputed. The other three sites all wrap the import+call in try/except — this one should too for consistency.

  2. Temp file leak on exception (pipeline_artifacts.py:175-180): The manifest_path temp file is created with delete=False and only cleaned up at the end of the try block. If hf_create_commit_with_retry raises, os.unlink is skipped (control jumps to the except block). Consider a finally block for cleanup.

  3. File name collisions in manifest (pipeline_artifacts.py:125): manifest["files"] is keyed by p.name (basename only). If two files from different directories share the same name, the second silently overwrites the first. Unlikely with current stages but worth noting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add stage-organized pipeline artifact uploads to HuggingFace

2 participants