Add extend_single_year_dataset for fast dataset year projection by anth-volk · Pull Request #7700 · PolicyEngine/policyengine-us

anth-volk · 2026-03-04T20:41:18Z

Why this is needed

The API v2 alpha and the policyengine Python package require entity-level Pandas HDFStore datasets (one table per entity: person, household, tax_unit, etc.) to run microsimulations. The current US data pipeline (policyengine-us-data) publishes variable-centric h5py files (variable/year → array), so converting between the two formats currently requires routing every variable through sim.calculate() via create_datasets() — a process that takes over an hour per state and doesn't scale to the 500+ geographic datasets we need to serve.

The UK avoids this entirely: policyengine-uk-data publishes entity-level HDFStore files directly, and policyengine-uk has extend_single_year_dataset() which projects a single base-year dataset to multiple years via simple multiplicative uprating on DataFrames — no simulation engine involved. This PR brings the same capability to the US.

How it works

Dataset schema classes (`dataset_schema.py`)

USSingleYearDataset holds six entity DataFrames (person, household, tax_unit, spm_unit, family, marital_unit) plus a time_period. It can load from / save to Pandas HDFStore files, and provides .copy() for deep-copying all DataFrames.

USMultiYearDataset wraps a dict[int, USSingleYearDataset] keyed by year. Its .load() returns data in {variable: {year: array}} format (time_period_arrays), which is what policyengine-core's Microsimulation expects for multi-year datasets.

Uprating logic (`economic_assumptions.py`)

extend_single_year_dataset(dataset, end_year=2035) takes a single base-year dataset and produces a multi-year dataset by:

Copying the base-year DataFrames for each year from base_year through end_year
Applying multiplicative uprating year-over-year: for each variable column, it looks up system.variables[var].uprating to get a dotted parameter path (e.g. "calibration.gov.irs.soi.employment_income"), resolves it against system.parameters, and computes factor = param(current_year) / param(previous_year). The column values are then multiplied by that factor.
Carrying forward variables without an uprating parameter unchanged (e.g. age, entity IDs).

This is the same approach used by policyengine-uk. The uprating mapping is derived entirely from system.variables at runtime — the 62 variables with explicit uprating = "..." and the 108 variables assigned via default_uprating.py are all picked up automatically. No separate list to maintain.

Dual-path loading (`system.py`)

Microsimulation.__init__ now auto-detects dataset format before calling super().__init__():

HDFStore format (entity names like person, household as top-level HDF5 keys): loads as USSingleYearDataset, extends via extend_single_year_dataset(), and passes the resulting USMultiYearDataset to policyengine-core.
Legacy h5py format (variable names as top-level keys): falls through to the existing CoreMicrosimulation code path, unchanged.

Format detection (_is_hdfstore_format) inspects the top-level HDF5 keys — entity names indicate HDFStore, variable names indicate h5py.

How we verify correctness

Unit tests (22 tests, ~0.3s)

The test suite in tests/microsimulation/data/ uses mock system objects (mock parameters, mock variables) to avoid loading the full tax-benefit system, keeping tests fast and deterministic. Coverage includes:

_resolve_parameter (3 tests): valid dotted path, invalid path, partially valid path
_apply_single_year_uprating (7 tests): correct multiplicative scaling, non-uprated variables unchanged, household entity uprating, unknown columns passed through, unresolvable uprating path, division-by-zero guard (previous param value = 0), zero base values preserved
extend_single_year_dataset (12 tests): correct year count, single-year edge case, default end year (2035), base year values unchanged, year 1 uprating, year 2 chaining (verifies uprating compounds from year N to N+1 to N+2, not from base), non-uprated variable identical across all years, row counts preserved, time_period correctness per year, return type, input dataset immutability, multi-entity uprating (person + household)

Roundtrip validation (policyengine-us-data PR #568)

A separate one-off validation script in -us-data reads an existing h5py state dataset (e.g. NV.h5), converts it to HDFStore using the same splitting logic, and compares all ~183 variables between the two formats. This passed 183/183 on the Nevada dataset.

Depends on

PolicyEngine/policyengine-us-data#568 — adds HDFStore output format alongside h5py in the data pipeline

Test plan

make test-other passes (runs the 22 unit tests via pytest)
Load an HDFStore file via Microsimulation(dataset="path/to/STATE.hdfstore.h5") — verify it loads and extends correctly
Load a legacy h5py file via Microsimulation(dataset="path/to/STATE.h5") — verify existing path still works
Verify uprated variables (e.g. employment_income) grow year-over-year
Verify non-uprated variables (e.g. age) are carried forward unchanged

🤖 Generated with Claude Code

PavelMakarchuk · 2026-03-17T15:30:24Z

PR Review

🔴 Critical (Must Fix)

1. USMultiYearDataset.__init__ uses if/if instead of if/elif — double-processing bug
dataset_schema.py:175-201

If both datasets and file_path are provided, both branches execute and file_path silently overwrites self.datasets. This should be elif. Also, if neither is provided, self.datasets is never set, causing an AttributeError on line 204.

2. _is_hdfstore_format may not work correctly with actual HDFStore files
system.py:218-239

HDFStore (PyTables) files accessed via h5py expose a different key structure than pd.HDFStore.keys(). Consider using pd.HDFStore directly for detection:

with pd.HDFStore(file_path, mode="r") as store:
    return bool(entity_names & {k.strip("/") for k in store.keys()})

3. No handling of USMultiYearDataset passed directly to Microsimulation
system.py:287-308

The dual-path detection handles str and USSingleYearDataset but not USMultiYearDataset. If a caller passes an already-extended multi-year dataset, it falls through to super().__init__() unhandled.

🟡 Should Address

4. validate_file_path validates with h5py but loads with pd.HDFStore
dataset_schema.py:45-68 vs 84-94 — Using different libraries for validation vs loading could cause mismatches. Use the same library for both.

5. _resolve_dataset_path returns None silently for non-HF, non-existent paths
system.py:199-215 — A typo'd path like "data/staet.h5" returns None, skips HDFStore check, and passes the string to super().__init__() producing a confusing error. Consider raising FileNotFoundError early.

6. Test mocking strategy is fragile
test_extend_single_year_dataset.py:736-760 — Direct sys.modules manipulation is thread-unsafe and can leak state. Use unittest.mock.patch.dict("sys.modules", ...) instead.

7. No tests for file I/O paths
The save() / load() / file-based __init__ for both USSingleYearDataset and USMultiYearDataset are untested — these are the paths used in production.

8. USSingleYearDataset.load() may produce duplicate keys across entities
dataset_schema.py:142-147 — If two entities share a column name, the second silently overwrites the first in the returned dict.

🟢 Suggestions

Changelog fragment is very long — consider shortening to "Add extend_single_year_dataset for fast multi-year dataset projection"
Consider adding __repr__ to dataset classes for easier debugging

Validation Summary

Check	Result
Code Patterns	3 critical issues
Test Coverage	2 gaps (no file I/O tests, fragile mocking)
CI Status	No checks found
Architecture	Sound — mirrors policyengine-uk approach
Documentation	PR description is excellent

Recommendation: Address the if/elif bug and HDFStore detection before merge. Core approach is solid.

To auto-fix issues: /fix-pr 7700

Adds USSingleYearDataset and USMultiYearDataset schema classes, extend_single_year_dataset() with multiplicative uprating from the parameter tree, and dual-path loading in Microsimulation that auto-detects entity-level HDFStore files and extends them without routing through the simulation engine. Legacy h5py files continue to work via the existing code path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

22 tests covering _resolve_parameter, _apply_single_year_uprating, and end-to-end extend_single_year_dataset. Uses mock system objects to avoid loading the full tax-benefit system (~0.3s total runtime). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix USMultiYearDataset.__init__ if/if bug (use if/elif/else, reject both or neither args) - Fix validate_file_path to use pd.HDFStore instead of h5py - Fix USSingleYearDataset.load() to detect duplicate column names - Fix _is_hdfstore_format to use pd.HDFStore instead of h5py - Fix _resolve_dataset_path to raise FileNotFoundError instead of returning None silently - Add explicit USMultiYearDataset branch in Microsimulation.__init__ - Refactor test mocking to use patch.dict for thread safety - Add 12 new tests: init validation, duplicate keys, format detection, path resolution, and file I/O roundtrips Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk · 2026-03-17T18:52:47Z

Review fixes applied

All 8 review items have been addressed in commit 4c98a8e. Here's what was done and how to verify each:

Critical fixes

1. USMultiYearDataset.__init__ if/if bug (dataset_schema.py)

Fix: Changed if/if to if/elif/else. Added explicit guard rejecting both args and raising ValueError when neither is provided.
Tests: TestUSMultiYearDatasetInit::test_given_neither_arg_then_raises_value_error, test_given_both_args_then_raises_value_error

2. _is_hdfstore_format uses h5py for PyTables files (system.py)

Fix: Replaced h5py.File with pd.HDFStore(file_path, mode="r"). Uses k.strip("/").split("/")[0] to handle both single-year and multi-year key formats.
Tests: TestIsHdfstoreFormat::test_entity_level_file_returns_true, test_variable_level_file_returns_false, test_nonexistent_file_returns_false

3. No USMultiYearDataset handling in Microsimulation.__init__ (system.py)

Fix: Added elif isinstance(dataset, USMultiYearDataset): pass branch so already-extended datasets are explicitly handled.
Verification: Visual inspection — the branch is a no-op passthrough. Full integration testing requires the tax-benefit system to load.

Should-fix items

4. validate_file_path uses h5py but __init__ loads with pd.HDFStore (dataset_schema.py)

Fix: Replaced h5py.File validation with pd.HDFStore(file_path, mode="r"). Removed import h5py from the module entirely.
Tests: Covered by TestFileIORoundtrips::test_single_year_save_and_load_roundtrip (validate runs during __init__).

5. _resolve_dataset_path returns None silently (system.py)

Fix: Changed return None to raise FileNotFoundError(f"Dataset file not found: {dataset_str}").
Tests: TestResolveDatasetPath::test_nonexistent_path_raises_file_not_found, test_existing_path_returns_path

6. Test mocking strategy is fragile (test_extend_single_year_dataset.py)

Fix: Replaced manual sys.modules save/restore with patch.dict(_sys.modules, ...) context manager.
Verification: All 22 original tests still pass — the refactored helper is used by every TestExtendSingleYearDataset test.

7. No tests for file I/O paths (test_extend_single_year_dataset.py)

Fix: Added TestFileIORoundtrips class with 3 tests.
Tests: test_single_year_save_and_load_roundtrip, test_multi_year_save_and_load_roundtrip, test_multi_year_load_returns_time_period_arrays

8. USSingleYearDataset.load() duplicate keys across entities (dataset_schema.py)

Fix: Added duplicate column detection — raises ValueError if a column name appears in multiple entity DataFrames.
Tests: TestSingleYearDatasetLoad::test_load_raises_on_duplicate_column_names, test_load_returns_all_entity_columns

Summary

Total new tests added: 12 (34 total, up from 22). All pass in ~2s.

pytest policyengine_us/tests/microsimulation/data/test_extend_single_year_dataset.py -v

PavelMakarchuk · 2026-03-17T22:20:13Z

PR Review (Updated)

Previous review findings were mostly addressed in the "Fix review items" commit. This is a re-review of the current state.

🔴 Critical (Must Fix)

1. Missing tables (pytables) dependency — CI is failing
pyproject.toml — pd.HDFStore requires the tables package. 5 tests fail in CI with ImportError: Import pytables failed. Add tables to dependencies.

2. Bare except Exception in _is_hdfstore_format
system.py — Silently swallows all errors (permission, memory, corruption) and returns False, making the file appear to be legacy format. Narrow to except (OSError, IOError, KeyError, ValueError) or at minimum add debug logging.

3. USSingleYearDataset.__init__ opens HDFStore in write mode
dataset_schema.py:~66 — pd.HDFStore(file_path) defaults to mode='a' (read-write). Should be mode='r' since it's only reading. validate_file_path correctly uses mode='r', but the constructor doesn't. Could fail on read-only filesystems.

🟡 Should Address

4. dataset read from kwargs twice in Microsimulation.__init__
system.py:~254-278 — First read checks for cps_2023, second read does HDFStore detection. If dataset is passed positionally via *args, the HDFStore detection is silently skipped. Consolidate into a single dataset resolution block.

5. _resolve_dataset_path return type inconsistency
system.py — The function raises FileNotFoundError for non-existent paths (never returns None), but the caller checks if local_path is not None. This dead check is misleading.

6. _apply_uprating imports full system at call time
economic_assumptions.py — The deferred from policyengine_us.system import system loads the full tax-benefit system on first call. Consider accepting system as an optional parameter to make the dependency explicit and eliminate the fragile sys.modules patching in tests.

7. Entity constant duplication
dataset_schema.py defines US_ENTITIES; system.py:_is_hdfstore_format redefines the same set inline. Import and reuse US_ENTITIES as single source of truth.

8. No validation for end_year >= start_year
economic_assumptions.py — If end_year < start_year, range() returns empty and only the base year is returned silently. Add a ValueError.

9. USMultiYearDataset.load() inconsistent duplicate-column handling
USSingleYearDataset.load() raises on duplicate columns across entities, but USMultiYearDataset.load() silently overwrites. Behavior should be consistent.

10. validate() method defined but never called
dataset_schema.py — The NaN validation method is unused. Either integrate into the loading flow or remove.

11. 14 unrelated whitespace-only changes
Removing blank lines after def formula across 14 unrelated files inflates the diff. Consider a separate formatting PR.

🟢 Suggestions

end_year=2035 default is a magic number — extract to a named constant
No integration test with the real tax-benefit system — all 22 tests use mocks
No test for hf:// URL path in _resolve_dataset_path
Changelog fragment is verbose — consider shortening

Validation Summary

Check	Result
CI Status	❌ FAILING (missing pytables dep, 5 tests)
Previous Review Items	✅ Most fixed
Code Patterns	3 critical, 8 should-address
Test Coverage	22 unit tests, good mocks, missing integration + edge cases
Unrelated Changes	14 whitespace-only files

Next Steps

To auto-fix issues: /fix-pr 7700

Or address manually and re-request review.

- Add tables>=3.9 runtime dependency for pd.HDFStore (finding #1) - Narrow bare except Exception to specific types in _is_hdfstore_format, validate_file_path (findings #2, reviewer #3) - Open HDFStore in mode="r" in USSingleYearDataset and USMultiYearDataset constructors (findings #3, reviewer #2) - Make optional entities (spm_unit, family, marital_unit) fall back to empty DataFrame when absent from HDF5 file (reviewer #1) - Consolidate duplicate kwargs.get("dataset") in Microsimulation.__init__ and remove dead None check (findings #4, #5) - Accept system=None in extend_single_year_dataset and _apply_uprating to allow direct injection, eliminating sys.modules patching in tests (#6) - Import and reuse US_ENTITIES instead of inline duplication (#7) - Add end_year >= start_year validation in extend_single_year_dataset (#8) - Add duplicate-column detection in USMultiYearDataset.load() (#9) - Remove unused validate() method (#10) - Extract DEFAULT_END_YEAR constant (green suggestion) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk · 2026-03-17T23:08:31Z

Re-review fixes (commit `a820388`)

All 11 findings from the re-review have been addressed, plus 3 additional issues found during a follow-up code review.

Original review findings

#	Finding	Resolution
1	Missing `tables` (pytables) dependency	Added `tables>=3.9` to runtime `dependencies` in `pyproject.toml`
2	Bare `except Exception` in `_is_hdfstore_format`	Narrowed to `except (OSError, IOError, KeyError, ValueError)`
3	`USSingleYearDataset.__init__` opens HDFStore in write mode	Changed to `mode="r"`
4	`dataset` read from kwargs twice in `Microsimulation.__init__`	Consolidated into a single read
5	`_resolve_dataset_path` return type — dead `None` check	Removed the dead `local_path is not None` guard
6	`_apply_uprating` imports full system at call time	Added `system=None` parameter to both `extend_single_year_dataset` and `_apply_uprating`; tests now pass `system` directly instead of patching `sys.modules`
7	Entity constant duplication	`_is_hdfstore_format` now imports and reuses `US_ENTITIES` from `dataset_schema.py`
8	No validation for `end_year >= start_year`	Added `ValueError` guard + new test
9	`USMultiYearDataset.load()` inconsistent duplicate-column handling	Added duplicate-column detection per year, matching `USSingleYearDataset.load()` behavior + new test
10	`validate()` method defined but never called	Removed dead code
11	Whitespace-only changes in unrelated files	No action — these are from `make format` (ruff), which is required by project guidelines

Also extracted DEFAULT_END_YEAR = 2035 constant (green suggestion).

Additional issues found (pre-existing, not regressions)

These three issues were present in the original PR code and were caught by a follow-up code review. They predate the re-review:

#	Finding	Resolution
A	`USSingleYearDataset.__init__` crashes on files missing optional entities — `save()` skips empty DataFrames but `__init__` unconditionally reads all 6, causing `KeyError` on roundtrip with minimal datasets	`spm_unit`, `family`, `marital_unit` now fall back to `pd.DataFrame()` when absent from the HDF5 file
B	`USMultiYearDataset.__init__` also opens HDFStore in append mode (same issue as finding #3 but in the multi-year class)	Changed to `mode="r"`
C	`validate_file_path` also has bare `except Exception` (same issue as finding #2)	Narrowed to `except (OSError, IOError, KeyError, ValueError)`

anth-volk mentioned this pull request Mar 5, 2026

Add entity-level HDFStore output format alongside h5py PolicyEngine/policyengine-us-data#568

Open

4 tasks

anth-volk marked this pull request as ready for review March 5, 2026 18:42

anth-volk and others added 6 commits March 17, 2026 19:20

style: Run black formatter on changed files

e8d5697

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add changelog fragment for extend_single_year_dataset

e28a266

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: Reformat with black -l 79 to match CI lint config

b481fb1

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk force-pushed the add-extend-single-year-dataset branch from 1677593 to 4c98a8e Compare March 17, 2026 18:22

anth-volk and others added 2 commits March 17, 2026 19:25

style: Run ruff format on rebased branch

5985afa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

style: Reformat with ruff 0.15.5 to match CI

7cc96f4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

anth-volk requested a review from PavelMakarchuk March 17, 2026 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add extend_single_year_dataset for fast dataset year projection#7700

Add extend_single_year_dataset for fast dataset year projection#7700
anth-volk wants to merge 9 commits intomainfrom
add-extend-single-year-dataset

anth-volk commented Mar 4, 2026 •

edited

Loading

Uh oh!

PavelMakarchuk commented Mar 17, 2026

Uh oh!

anth-volk commented Mar 17, 2026

Uh oh!

PavelMakarchuk commented Mar 17, 2026

Uh oh!

anth-volk commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anth-volk commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why this is needed

How it works

Dataset schema classes (dataset_schema.py)

Uprating logic (economic_assumptions.py)

Dual-path loading (system.py)

How we verify correctness

Unit tests (22 tests, ~0.3s)

Roundtrip validation (policyengine-us-data PR #568)

Depends on

Test plan

Uh oh!

PavelMakarchuk commented Mar 17, 2026

PR Review

🔴 Critical (Must Fix)

🟡 Should Address

🟢 Suggestions

Validation Summary

Uh oh!

anth-volk commented Mar 17, 2026

Review fixes applied

Critical fixes

Should-fix items

Summary

Uh oh!

PavelMakarchuk commented Mar 17, 2026

PR Review (Updated)

🔴 Critical (Must Fix)

🟡 Should Address

🟢 Suggestions

Validation Summary

Next Steps

Uh oh!

anth-volk commented Mar 17, 2026

Re-review fixes (commit a820388)

Original review findings

Additional issues found (pre-existing, not regressions)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

anth-volk commented Mar 4, 2026 •

edited

Loading

Dataset schema classes (`dataset_schema.py`)

Uprating logic (`economic_assumptions.py`)

Dual-path loading (`system.py`)

Re-review fixes (commit `a820388`)