Include PUF aggregate records for top-tail income representation#608
Draft
Include PUF aggregate records for top-tail income representation#608
Conversation
The CPS has -95% to -99% calibration errors for $5M+ AGI brackets. Two changes to fix this: 1. puf.py: Replace `puf = puf[puf.MARS != 0]` (which dropped $140B+ in weighted AGI) with `impute_aggregate_mars()` — a QRF trained on income variables imputes MARS; downstream QRF handles remaining demographics (age, gender, etc.) 2. extended_cps.py: Add `_inject_high_income_puf_records()` to append PUF records with AGI > $1M directly into the ExtendedCPS after all processing, giving the reweighter actual high-income observations. Fixes #606 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9774ba8 to
20aa0ec
Compare
- Narrow except Exception to (KeyError, ValueError, RuntimeError)
- Use endswith("_id") instead of "_id" in key to avoid false matches
- Remove unnecessary .copy() in impute_aggregate_mars
- Use numpy arrays instead of list() for np.isin() calls
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…safely Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ValueError from f-string comma formatting inside %-style logger - Handle dtype casting failures when PUF and CPS have incompatible types (e.g. county_fips: numeric in PUF vs string in CPS) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a PUF variable can't be cast to the CPS dtype, we were skipping it entirely — leaving that variable shorter than all others. Now pad with zeros/empty values to keep array lengths aligned across all variables. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Injecting high-income PUF records increases unweighted citizen % from ~90% to ~96% because tax filers are almost all citizens. Widen the test's expected range from (80-95%) to (80-98%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ate() The per-variable puf_sim.calculate() loop was running the full simulation engine for each of 100+ variables, causing the CI build to hang for 7+ hours. Now: - Only use Microsimulation once (to compute AGI for household filter) - Free the simulation immediately after - Read all variable values from raw PUF arrays (puf_data[variable]) - Pad with zeros for variables not in PUF (CPS-only) This should reduce the injection step from hours to seconds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Variables not in PUF (like immigration_status) were padded with np.zeros(n) which creates float zeros. For string/enum variables this becomes '0.0' — an invalid enum value. Now pad with np.zeros(n, dtype=existing.dtype) which creates empty strings for string arrays and numeric zeros for numeric arrays. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Empty strings are not valid enum values either. For string/enum variables not in PUF (like immigration_status), use the first value from the existing CPS array as the default — this is always a valid enum member (e.g. 'CITIZEN'). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PUF stores numeric floats for variables that CPS stores as string enums (or int). Casting float->bytes produces garbage enum values like b'0.0'. Now only use PUF array values when: - Variable exists in PUF data - Array length matches the entity mask - dtype kind matches (both numeric or both string) Otherwise use the default (zeros for numeric, first existing enum value for strings). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Variables not in the tax-benefit system (raw dataset variables like county_fips, random seeds, etc.) were being skipped, leaving them shorter than injected variables. This caused IndexError in downstream stratification code. Now infer entity from array length for unknown variables and always pad to keep all arrays consistent. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ExtendedCPS has a structural assumption: first half = CPS, second half = PUF clone, both exactly the same size. Appending a third segment of raw PUF records breaks downstream code that uses n//2 to split halves (stratification, reweighting, loss matrix). The PUF aggregate records (Phase 1 core change) are still included via impute_aggregate_mars() in puf.py. The puf_clone_dataset() step already transfers PUF income patterns to CPS records, so the aggregate records' high-income patterns will flow through to the extended CPS via the clone imputation. Direct PUF injection needs a different architecture (e.g. injecting into both halves symmetrically, or at the EnhancedCPS level after reweighting). Deferring to a follow-up. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Include PUF aggregate records (MARS=0) instead of dropping them. These 4 records contain $140B+ in weighted AGI from ultra-high-income filers — mostly in the $10M+ bracket — that were being discarded by
puf = puf[puf.MARS != 0].Changes
puf.pyimpute_aggregate_mars(): Trains a QRF on regular PUF records' income variables (wages, interest, dividends, capital gains, partnership income, social security, pensions, XTOT) to predict MARS for the 4 aggregate records. Zero hardcoded demographics.impute_missing_demographics()QRF downstreampuf = puf[puf.MARS != 0]withpuf = impute_aggregate_mars(puf)How it flows through the pipeline
impute_aggregate_mars()— QRF: income vars → MARSpreprocess_puf()— derives filing_status from MARSimpute_missing_demographics()— QRF: [E00200, MARS, DSI, EIC, XTOT] → age, gender, etc.puf_clone_dataset()— transfers PUF income patterns (including aggregate records) to CPS clonesWhat's NOT in this PR
Direct injection of PUF records into ExtendedCPS was attempted but removed — the ExtendedCPS has a doubled-dataset structure (first half = CPS, second half = PUF clone) that downstream code depends on. Appending a third segment breaks stratification, reweighting, and the loss matrix. A follow-up PR will address this with an architecture that preserves the structural invariant.
Test plan
🤖 Generated with Claude Code