Flowsheet ETL: import and sync flowsheet data from tubafrenzy by jakebromberg · Pull Request #258 · WXYC/Backend-Service

jakebromberg · 2026-03-22T04:13:53Z

Summary

Add legacy_release_id, legacy_entry_id, and legacy_show_id columns with unique indexes for deduplication between Backend-Service and tubafrenzy
Update library-etl to populate legacy_release_id on insert and backfill existing rows
Update mirror middleware to persist legacy IDs after mirroring to tubafrenzy
Add @wxyc/flowsheet-etl job with bulk load (MySQL dump parser) and incremental sync (MirrorSQL) modes
43 unit tests for dump parser and data transformer

Test plan

All 43 new unit tests pass (tests/unit/jobs/flowsheet-etl/)
Existing 360 unit tests still pass
Typecheck passes across all workspaces
Lint passes with 0 errors
Build succeeds for @wxyc/database, @wxyc/library-etl, and @wxyc/flowsheet-etl
Bulk load dry run against local Postgres with dump file
Spot check: show/entry counts, library linking, ordering
Incremental dedup: run after bulk load, verify no re-imports
Sequence check: insert new entry via Backend API after bulk load

Closes #257

Add legacy_release_id to library, legacy_entry_id to flowsheet, and legacy_show_id to shows tables with unique indexes. These columns map tubafrenzy IDs to Backend-Service IDs, enabling deduplication when the flowsheet ETL imports historical data and syncs ongoing entries. Update library-etl to populate legacy_release_id on insert and backfill existing rows that are missing it. Update the mirror middleware to persist legacy_entry_id after mirroring entries to tubafrenzy and legacy_show_id after mirroring shows.

Implements the @wxyc/flowsheet-etl job with two modes: - Bulk load: parses a MySQL dump file, imports ~71K shows and ~2.6M flowsheet entries - Incremental sync: queries tubafrenzy via MirrorSQL for new shows and entries since last run The ETL uses legacy_entry_id and legacy_show_id unique indexes for deduplication, ensuring entries mirrored from Backend-Service are not re-imported. Album IDs are resolved via the legacy_release_id mapping populated by the library-etl. Includes MySQL dump parser (handles escaped strings, NULL, numeric values), data transformation layer (entry type mapping, timestamp conversion, string truncation), and unit tests for both modules.

Add unit tests verifying findExistingAlbum returns id + legacy_release_id when the album exists, returns null when it doesn't, and correctly returns legacy_release_id as null for albums that haven't been backfilled yet. Export findExistingAlbum for testability. Fix Prettier formatting across all new files.

Truncate show_name to 128 chars in transformShow (tubafrenzy allows 255, Backend allows 128). Validate show_id references against the set of imported shows before inserting entries, setting show_id to null for the 19 orphan entries that reference deleted shows.

Jake Bromberg added 5 commits March 21, 2026 20:00

fix: wrap async setTimeout to satisfy no-misused-promises lint rule

f5c8a99

jakebromberg requested a review from AyBruno March 26, 2026 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flowsheet ETL: import and sync flowsheet data from tubafrenzy#258

Flowsheet ETL: import and sync flowsheet data from tubafrenzy#258
jakebromberg wants to merge 5 commits intomainfrom
worktree-feature/flowsheet-etl

jakebromberg commented Mar 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jakebromberg commented Mar 22, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant