Skip to content

Flowsheet ETL: import and sync flowsheet data from tubafrenzy#258

Open
jakebromberg wants to merge 5 commits intomainfrom
worktree-feature/flowsheet-etl
Open

Flowsheet ETL: import and sync flowsheet data from tubafrenzy#258
jakebromberg wants to merge 5 commits intomainfrom
worktree-feature/flowsheet-etl

Conversation

@jakebromberg
Copy link
Member

Summary

  • Add legacy_release_id, legacy_entry_id, and legacy_show_id columns with unique indexes for deduplication between Backend-Service and tubafrenzy
  • Update library-etl to populate legacy_release_id on insert and backfill existing rows
  • Update mirror middleware to persist legacy IDs after mirroring to tubafrenzy
  • Add @wxyc/flowsheet-etl job with bulk load (MySQL dump parser) and incremental sync (MirrorSQL) modes
  • 43 unit tests for dump parser and data transformer

Test plan

  • All 43 new unit tests pass (tests/unit/jobs/flowsheet-etl/)
  • Existing 360 unit tests still pass
  • Typecheck passes across all workspaces
  • Lint passes with 0 errors
  • Build succeeds for @wxyc/database, @wxyc/library-etl, and @wxyc/flowsheet-etl
  • Bulk load dry run against local Postgres with dump file
  • Spot check: show/entry counts, library linking, ordering
  • Incremental dedup: run after bulk load, verify no re-imports
  • Sequence check: insert new entry via Backend API after bulk load

Closes #257

Jake Bromberg added 5 commits March 21, 2026 20:00
Add legacy_release_id to library, legacy_entry_id to flowsheet, and legacy_show_id to shows tables with unique indexes. These columns map tubafrenzy IDs to Backend-Service IDs, enabling deduplication when the flowsheet ETL imports historical data and syncs ongoing entries.

Update library-etl to populate legacy_release_id on insert and backfill existing rows that are missing it. Update the mirror middleware to persist legacy_entry_id after mirroring entries to tubafrenzy and legacy_show_id after mirroring shows.
Implements the @wxyc/flowsheet-etl job with two modes:
- Bulk load: parses a MySQL dump file, imports ~71K shows and ~2.6M flowsheet entries
- Incremental sync: queries tubafrenzy via MirrorSQL for new shows and entries since last run

The ETL uses legacy_entry_id and legacy_show_id unique indexes for deduplication, ensuring entries mirrored from Backend-Service are not re-imported. Album IDs are resolved via the legacy_release_id mapping populated by the library-etl.

Includes MySQL dump parser (handles escaped strings, NULL, numeric values), data transformation layer (entry type mapping, timestamp conversion, string truncation), and unit tests for both modules.
Add unit tests verifying findExistingAlbum returns id + legacy_release_id when the album exists, returns null when it doesn't, and correctly returns legacy_release_id as null for albums that haven't been backfilled yet. Export findExistingAlbum for testability. Fix Prettier formatting across all new files.
Truncate show_name to 128 chars in transformShow (tubafrenzy allows 255, Backend allows 128). Validate show_id references against the set of imported shows before inserting entries, setting show_id to null for the 19 orphan entries that reference deleted shows.
@jakebromberg jakebromberg requested a review from AyBruno March 26, 2026 04:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flowsheet ETL: Import and sync flowsheet data from tubafrenzy

1 participant