Conversation
New utility module (pipeline_artifacts.py) mirrors existing build artifacts to policyengine/policyengine-us-data-pipeline with a stage-organized folder structure. Each stage gets a manifest.json with SHA256 checksums and git provenance. Hook points added at 4 existing upload sites: - upload_completed_datasets.py: stage_0_raw + stage_1_base - remote_calibration_runner.py: stage_4_source_imputed + stage_6_weights - local_area.py: stage_7_local_area (manifest-only) All mirror uploads are additive and failure-tolerant — they never block the main pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
juaristi22
left a comment
There was a problem hiding this comment.
Review
Looks good — clean design with a single-call interface, good reuse of existing utilities, thorough tests, and consistent failure tolerance. Approving with a few small suggestions.
Suggestions
-
Missing
try/exceptaround_upload_source_imputedmirror call (remote_calibration_runner.py:205-213): The import andmirror_to_pipeline()call are outside a try/except, unlike the other three hook sites. Whilemirror_to_pipelineswallows exceptions internally, thefrom ... importcould fail (e.g., missing dependency in the Modal environment) and would crash_upload_source_imputed. The other three sites all wrap the import+call in try/except — this one should too for consistency. -
Temp file leak on exception (
pipeline_artifacts.py:175-180): Themanifest_pathtemp file is created withdelete=Falseand only cleaned up at the end of the try block. Ifhf_create_commit_with_retryraises,os.unlinkis skipped (control jumps to the except block). Consider afinallyblock for cleanup. -
File name collisions in manifest (
pipeline_artifacts.py:125):manifest["files"]is keyed byp.name(basename only). If two files from different directories share the same name, the second silently overwrites the first. Unlikely with current stages but worth noting.
Fixes #616
Summary
policyengine_us_data/utils/pipeline_artifacts.pywithmirror_to_pipeline()as a single-call interface for uploading artifacts topolicyengine/policyengine-us-data-pipelinemanifest.jsonwith SHA256 checksums, git provenance, and timestampupload_completed_datasets.py: stage_0_raw (policy_data.db) + stage_1_base (CPS/enhanced datasets)remote_calibration_runner.py: stage_4_source_imputed + stage_6_weightslocal_area.py: stage_7_local_area (manifest-only — files too large to double-upload)Test plan
🤖 Generated with Claude Code