Skip to content

Include PUF aggregate records for top-tail income representation#608

Draft
MaxGhenis wants to merge 13 commits intomainfrom
top-tail-income-representation
Draft

Include PUF aggregate records for top-tail income representation#608
MaxGhenis wants to merge 13 commits intomainfrom
top-tail-income-representation

Conversation

@MaxGhenis
Copy link
Contributor

@MaxGhenis MaxGhenis commented Mar 15, 2026

Summary

Include PUF aggregate records (MARS=0) instead of dropping them. These 4 records contain $140B+ in weighted AGI from ultra-high-income filers — mostly in the $10M+ bracket — that were being discarded by puf = puf[puf.MARS != 0].

Changes

puf.py

  • impute_aggregate_mars(): Trains a QRF on regular PUF records' income variables (wages, interest, dividends, capital gains, partnership income, social security, pensions, XTOT) to predict MARS for the 4 aggregate records. Zero hardcoded demographics.
  • Sets DSI=0 and EIC=0 (correct by definition — ultra-high-income filers)
  • AGERANGE, GENDER, EARNSPLIT, AGEDP1-3 are imputed by the existing impute_missing_demographics() QRF downstream
  • Replaces puf = puf[puf.MARS != 0] with puf = impute_aggregate_mars(puf)

How it flows through the pipeline

  1. impute_aggregate_mars() — QRF: income vars → MARS
  2. preprocess_puf() — derives filing_status from MARS
  3. impute_missing_demographics() — QRF: [E00200, MARS, DSI, EIC, XTOT] → age, gender, etc.
  4. puf_clone_dataset() — transfers PUF income patterns (including aggregate records) to CPS clones
  5. Reweighting optimizer adjusts weights to match SOI calibration targets

What's NOT in this PR

Direct injection of PUF records into ExtendedCPS was attempted but removed — the ExtendedCPS has a doubled-dataset structure (first half = CPS, second half = PUF clone) that downstream code depends on. Appending a third segment breaks stratification, reweighting, and the loss matrix. A follow-up PR will address this with an architecture that preserves the structural invariant.

Test plan

  • QRF MARS imputation tested with mock data — produces valid MARS values [1-4]
  • Regular records confirmed unchanged by imputation
  • Imports verified
  • Full CI build on Modal

🤖 Generated with Claude Code

@MaxGhenis MaxGhenis closed this Mar 15, 2026
@MaxGhenis MaxGhenis reopened this Mar 15, 2026
The CPS has -95% to -99% calibration errors for $5M+ AGI brackets.
Two changes to fix this:

1. puf.py: Replace `puf = puf[puf.MARS != 0]` (which dropped $140B+
   in weighted AGI) with `impute_aggregate_mars()` — a QRF trained on
   income variables imputes MARS; downstream QRF handles remaining
   demographics (age, gender, etc.)

2. extended_cps.py: Add `_inject_high_income_puf_records()` to append
   PUF records with AGI > $1M directly into the ExtendedCPS after all
   processing, giving the reweighter actual high-income observations.

Fixes #606

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis force-pushed the top-tail-income-representation branch from 9774ba8 to 20aa0ec Compare March 15, 2026 02:00
MaxGhenis and others added 8 commits March 14, 2026 20:18
- Narrow except Exception to (KeyError, ValueError, RuntimeError)
- Use endswith("_id") instead of "_id" in key to avoid false matches
- Remove unnecessary .copy() in impute_aggregate_mars
- Use numpy arrays instead of list() for np.isin() calls

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…safely

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix ValueError from f-string comma formatting inside %-style logger
- Handle dtype casting failures when PUF and CPS have incompatible
  types (e.g. county_fips: numeric in PUF vs string in CPS)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a PUF variable can't be cast to the CPS dtype, we were skipping
it entirely — leaving that variable shorter than all others. Now pad
with zeros/empty values to keep array lengths aligned across all
variables.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Injecting high-income PUF records increases unweighted citizen % from
~90% to ~96% because tax filers are almost all citizens. Widen the
test's expected range from (80-95%) to (80-98%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ate()

The per-variable puf_sim.calculate() loop was running the full
simulation engine for each of 100+ variables, causing the CI build
to hang for 7+ hours. Now:

- Only use Microsimulation once (to compute AGI for household filter)
- Free the simulation immediately after
- Read all variable values from raw PUF arrays (puf_data[variable])
- Pad with zeros for variables not in PUF (CPS-only)

This should reduce the injection step from hours to seconds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Variables not in PUF (like immigration_status) were padded with
np.zeros(n) which creates float zeros. For string/enum variables
this becomes '0.0' — an invalid enum value. Now pad with
np.zeros(n, dtype=existing.dtype) which creates empty strings
for string arrays and numeric zeros for numeric arrays.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Empty strings are not valid enum values either. For string/enum
variables not in PUF (like immigration_status), use the first value
from the existing CPS array as the default — this is always a valid
enum member (e.g. 'CITIZEN').

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis requested a review from baogorek March 16, 2026 00:58
MaxGhenis and others added 4 commits March 15, 2026 21:02
PUF stores numeric floats for variables that CPS stores as
string enums (or int). Casting float->bytes produces garbage
enum values like b'0.0'. Now only use PUF array values when:
- Variable exists in PUF data
- Array length matches the entity mask
- dtype kind matches (both numeric or both string)

Otherwise use the default (zeros for numeric, first existing
enum value for strings).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Variables not in the tax-benefit system (raw dataset variables like
county_fips, random seeds, etc.) were being skipped, leaving them
shorter than injected variables. This caused IndexError in downstream
stratification code.

Now infer entity from array length for unknown variables and always
pad to keep all arrays consistent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ExtendedCPS has a structural assumption: first half = CPS,
second half = PUF clone, both exactly the same size. Appending a
third segment of raw PUF records breaks downstream code that uses
n//2 to split halves (stratification, reweighting, loss matrix).

The PUF aggregate records (Phase 1 core change) are still included
via impute_aggregate_mars() in puf.py. The puf_clone_dataset() step
already transfers PUF income patterns to CPS records, so the
aggregate records' high-income patterns will flow through to the
extended CPS via the clone imputation.

Direct PUF injection needs a different architecture (e.g. injecting
into both halves symmetrically, or at the EnhancedCPS level after
reweighting). Deferring to a follow-up.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis marked this pull request as draft March 17, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant