Include PUF aggregate records for top-tail income representation by MaxGhenis · Pull Request #608 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-03-15T00:17:27Z

Summary

Include PUF aggregate records (MARS=0) instead of dropping them. These 4 records contain $140B+ in weighted AGI from ultra-high-income filers — mostly in the $10M+ bracket — that were being discarded by puf = puf[puf.MARS != 0].

Changes

`puf.py`

impute_aggregate_mars(): Trains a QRF on regular PUF records' income variables (wages, interest, dividends, capital gains, partnership income, social security, pensions, XTOT) to predict MARS for the 4 aggregate records. Zero hardcoded demographics.
Sets DSI=0 and EIC=0 (correct by definition — ultra-high-income filers)
AGERANGE, GENDER, EARNSPLIT, AGEDP1-3 are imputed by the existing impute_missing_demographics() QRF downstream
Replaces puf = puf[puf.MARS != 0] with puf = impute_aggregate_mars(puf)

How it flows through the pipeline

impute_aggregate_mars() — QRF: income vars → MARS
preprocess_puf() — derives filing_status from MARS
impute_missing_demographics() — QRF: [E00200, MARS, DSI, EIC, XTOT] → age, gender, etc.
puf_clone_dataset() — transfers PUF income patterns (including aggregate records) to CPS clones
Reweighting optimizer adjusts weights to match SOI calibration targets

What's NOT in this PR

Direct injection of PUF records into ExtendedCPS was attempted but removed — the ExtendedCPS has a doubled-dataset structure (first half = CPS, second half = PUF clone) that downstream code depends on. Appending a third segment breaks stratification, reweighting, and the loss matrix. A follow-up PR will address this with an architecture that preserves the structural invariant.

Test plan

QRF MARS imputation tested with mock data — produces valid MARS values [1-4]
Regular records confirmed unchanged by imputation
Imports verified
Full CI build on Modal

🤖 Generated with Claude Code

The CPS has -95% to -99% calibration errors for $5M+ AGI brackets. Two changes to fix this: 1. puf.py: Replace `puf = puf[puf.MARS != 0]` (which dropped $140B+ in weighted AGI) with `impute_aggregate_mars()` — a QRF trained on income variables imputes MARS; downstream QRF handles remaining demographics (age, gender, etc.) 2. extended_cps.py: Add `_inject_high_income_puf_records()` to append PUF records with AGI > $1M directly into the ExtendedCPS after all processing, giving the reweighter actual high-income observations. Fixes #606 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Narrow except Exception to (KeyError, ValueError, RuntimeError) - Use endswith("_id") instead of "_id" in key to avoid false matches - Remove unnecessary .copy() in impute_aggregate_mars - Use numpy arrays instead of list() for np.isin() calls Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…safely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix ValueError from f-string comma formatting inside %-style logger - Handle dtype casting failures when PUF and CPS have incompatible types (e.g. county_fips: numeric in PUF vs string in CPS) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When a PUF variable can't be cast to the CPS dtype, we were skipping it entirely — leaving that variable shorter than all others. Now pad with zeros/empty values to keep array lengths aligned across all variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Injecting high-income PUF records increases unweighted citizen % from ~90% to ~96% because tax filers are almost all citizens. Widen the test's expected range from (80-95%) to (80-98%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ate() The per-variable puf_sim.calculate() loop was running the full simulation engine for each of 100+ variables, causing the CI build to hang for 7+ hours. Now: - Only use Microsimulation once (to compute AGI for household filter) - Free the simulation immediately after - Read all variable values from raw PUF arrays (puf_data[variable]) - Pad with zeros for variables not in PUF (CPS-only) This should reduce the injection step from hours to seconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Variables not in PUF (like immigration_status) were padded with np.zeros(n) which creates float zeros. For string/enum variables this becomes '0.0' — an invalid enum value. Now pad with np.zeros(n, dtype=existing.dtype) which creates empty strings for string arrays and numeric zeros for numeric arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Empty strings are not valid enum values either. For string/enum variables not in PUF (like immigration_status), use the first value from the existing CPS array as the default — this is always a valid enum member (e.g. 'CITIZEN'). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

PUF stores numeric floats for variables that CPS stores as string enums (or int). Casting float->bytes produces garbage enum values like b'0.0'. Now only use PUF array values when: - Variable exists in PUF data - Array length matches the entity mask - dtype kind matches (both numeric or both string) Otherwise use the default (zeros for numeric, first existing enum value for strings). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Variables not in the tax-benefit system (raw dataset variables like county_fips, random seeds, etc.) were being skipped, leaving them shorter than injected variables. This caused IndexError in downstream stratification code. Now infer entity from array length for unknown variables and always pad to keep all arrays consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The ExtendedCPS has a structural assumption: first half = CPS, second half = PUF clone, both exactly the same size. Appending a third segment of raw PUF records breaks downstream code that uses n//2 to split halves (stratification, reweighting, loss matrix). The PUF aggregate records (Phase 1 core change) are still included via impute_aggregate_mars() in puf.py. The puf_clone_dataset() step already transfers PUF income patterns to CPS records, so the aggregate records' high-income patterns will flow through to the extended CPS via the clone imputation. Direct PUF injection needs a different architecture (e.g. injecting into both halves symmetrically, or at the EnhancedCPS level after reweighting). Deferring to a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis closed this Mar 15, 2026

MaxGhenis reopened this Mar 15, 2026

MaxGhenis force-pushed the top-tail-income-representation branch from 9774ba8 to 20aa0ec Compare March 15, 2026 02:00

MaxGhenis and others added 8 commits March 14, 2026 20:18

Restore .copy() on PUF filter — insufficient test coverage to remove …

0d22692

…safely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MaxGhenis requested a review from baogorek March 16, 2026 00:58

MaxGhenis and others added 4 commits March 15, 2026 21:02

Re-trigger CI

93872ac

MaxGhenis marked this pull request as draft March 17, 2026 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include PUF aggregate records for top-tail income representation#608

Include PUF aggregate records for top-tail income representation#608
MaxGhenis wants to merge 13 commits intomainfrom
top-tail-income-representation

MaxGhenis commented Mar 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

puf.py

How it flows through the pipeline

What's NOT in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Mar 15, 2026 •

edited

Loading

`puf.py`