Add extend_single_year_dataset for fast dataset year projection#7700
Add extend_single_year_dataset for fast dataset year projection#7700
Conversation
PR Review🔴 Critical (Must Fix)1. If both 2. HDFStore (PyTables) files accessed via with pd.HDFStore(file_path, mode="r") as store:
return bool(entity_names & {k.strip("/") for k in store.keys()})3. No handling of The dual-path detection handles 🟡 Should Address4. 5. 6. Test mocking strategy is fragile 7. No tests for file I/O paths 8. 🟢 Suggestions
Validation Summary
Recommendation: Address the To auto-fix issues: |
Adds USSingleYearDataset and USMultiYearDataset schema classes, extend_single_year_dataset() with multiplicative uprating from the parameter tree, and dual-path loading in Microsimulation that auto-detects entity-level HDFStore files and extends them without routing through the simulation engine. Legacy h5py files continue to work via the existing code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 tests covering _resolve_parameter, _apply_single_year_uprating, and end-to-end extend_single_year_dataset. Uses mock system objects to avoid loading the full tax-benefit system (~0.3s total runtime). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix USMultiYearDataset.__init__ if/if bug (use if/elif/else, reject both or neither args) - Fix validate_file_path to use pd.HDFStore instead of h5py - Fix USSingleYearDataset.load() to detect duplicate column names - Fix _is_hdfstore_format to use pd.HDFStore instead of h5py - Fix _resolve_dataset_path to raise FileNotFoundError instead of returning None silently - Add explicit USMultiYearDataset branch in Microsimulation.__init__ - Refactor test mocking to use patch.dict for thread safety - Add 12 new tests: init validation, duplicate keys, format detection, path resolution, and file I/O roundtrips Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1677593 to
4c98a8e
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review fixes appliedAll 8 review items have been addressed in commit Critical fixes1.
2.
3. No
Should-fix items4.
5.
6. Test mocking strategy is fragile (
7. No tests for file I/O paths (
8.
SummaryTotal new tests added: 12 (34 total, up from 22). All pass in ~2s. |
PR Review (Updated)Previous review findings were mostly addressed in the "Fix review items" commit. This is a re-review of the current state. 🔴 Critical (Must Fix)1. Missing 2. Bare 3. 🟡 Should Address4. 5. 6. 7. Entity constant duplication 8. No validation for 9. 10. 11. 14 unrelated whitespace-only changes 🟢 Suggestions
Validation Summary
Next StepsTo auto-fix issues: Or address manually and re-request review. |
- Add tables>=3.9 runtime dependency for pd.HDFStore (finding #1) - Narrow bare except Exception to specific types in _is_hdfstore_format, validate_file_path (findings #2, reviewer #3) - Open HDFStore in mode="r" in USSingleYearDataset and USMultiYearDataset constructors (findings #3, reviewer #2) - Make optional entities (spm_unit, family, marital_unit) fall back to empty DataFrame when absent from HDF5 file (reviewer #1) - Consolidate duplicate kwargs.get("dataset") in Microsimulation.__init__ and remove dead None check (findings #4, #5) - Accept system=None in extend_single_year_dataset and _apply_uprating to allow direct injection, eliminating sys.modules patching in tests (#6) - Import and reuse US_ENTITIES instead of inline duplication (#7) - Add end_year >= start_year validation in extend_single_year_dataset (#8) - Add duplicate-column detection in USMultiYearDataset.load() (#9) - Remove unused validate() method (#10) - Extract DEFAULT_END_YEAR constant (green suggestion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Re-review fixes (commit a820388)All 11 findings from the re-review have been addressed, plus 3 additional issues found during a follow-up code review. Original review findings
Also extracted Additional issues found (pre-existing, not regressions)These three issues were present in the original PR code and were caught by a follow-up code review. They predate the re-review:
|
Fixes #7699
Why this is needed
The API v2 alpha and the
policyenginePython package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable throughsim.calculate()viacreate_datasets()— a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.The UK avoids this entirely:
policyengine-uk-datapublishes entity-level HDFStore files directly, andpolicyengine-ukhasextend_single_year_dataset()which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.How it works
Dataset schema classes (
dataset_schema.py)USSingleYearDatasetholds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus atime_period. It can load from / save to Pandas HDFStore files, and provides.copy()for deep-copying all DataFrames.USMultiYearDatasetwraps adict[int, USSingleYearDataset]keyed by year. Its.load()returns data in{variable: {year: array}}format (time_period_arrays), which is what policyengine-core'sMicrosimulationexpects for multi-year datasets.Uprating logic (
economic_assumptions.py)extend_single_year_dataset(dataset, end_year=2035)takes a single base-year dataset and produces a multi-year dataset by:base_yearthroughend_yearsystem.variables[var].upratingto get a dotted parameter path (e.g."calibration.gov.irs.soi.employment_income"), resolves it againstsystem.parameters, and computesfactor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.age, entity IDs).This is the same approach used by
policyengine-uk. The uprating mapping is derived entirely fromsystem.variablesat runtime — the 62 variables with explicituprating = "..."and the 108 variables assigned viadefault_uprating.pyare all picked up automatically. No separate list to maintain.Dual-path loading (
system.py)Microsimulation.__init__now auto-detects dataset format before callingsuper().__init__():person,householdas top-level HDF5 keys): loads asUSSingleYearDataset, extends viaextend_single_year_dataset(), and passes the resultingUSMultiYearDatasetto policyengine-core.CoreMicrosimulationcode path, unchanged.Format detection (
_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.How we verify correctness
Unit tests (22 tests, ~0.3s)
The test suite in
tests/microsimulation/data/uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:_resolve_parameter(3 tests): valid dotted path, invalid path, partially valid path_apply_single_year_uprating(7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preservedextend_single_year_dataset(12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)Roundtrip validation (policyengine-us-data PR #568)
A separate one-off validation script in
-us-datareads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.Depends on
Test plan
make test-otherpasses (runs the 22 unit tests via pytest)Microsimulation(dataset="path/to/STATE.hdfstore.h5")— verify it loads and extends correctlyMicrosimulation(dataset="path/to/STATE.h5")— verify existing path still worksemployment_income) grow year-over-yearage) are carried forward unchanged🤖 Generated with Claude Code