feat: diff optimization by reybahl · Pull Request #401 · coursetable/ferry

reybahl · 2026-03-11T02:52:11Z

Summary by CodeRabbit

Refactor
- Improved database synchronization performance through optimized change detection and row comparison logic.

…kups

coderabbitai · 2026-03-11T02:52:24Z

📝 Walkthrough

Walkthrough

Database synchronization logic optimized through set-based lookups replacing O(n) index checks and vectorized column-change detection replacing loop-based comparisons. Hash-based change detection narrows per-cell comparisons to modified rows only, reducing computational overhead while preserving existing validation and diff structure.

Changes

Cohort / File(s)	Summary
Database Sync Optimization `ferry/database/sync_db_courses.py`	Refactors change detection from O(n) index checks to set-based lookups; introduces hash-based filtering to identify changed rows; replaces loop-based column-change detection with vectorized numpy operations; maintains existing validation and diff structure while optimizing computational pathways.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through databases with grace,
Sets replace loops—a faster pace!
Hashes guide us to rows that changed,
Vectors dance where loops once ranged,
Database sync now swift and lean,
The most efficient it's ever been! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: diff optimization' directly describes the main change—optimizing the diff computation in sync_db_courses.py through algorithmic improvements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch diff-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ferry/database/sync_db_courses.py`:
- Around line 120-124: The per-cell diff currently lets pd.NA propagate (via
subset_old == subset_new) so unequal_mask gets <NA> values; change the logic in
the block that computes subset_unequal (using subset_old, subset_new,
possibly_changed) to explicitly handle nulls—for example compute na_mismatch =
subset_old.isna() ^ subset_new.isna(), compute value_diff = (subset_old !=
subset_new).fillna(False), then set subset_unequal = value_diff | na_mismatch
and assign that into unequal_mask.loc[possibly_changed]; this ensures NA→value
and value→NA transitions are detected and yields a boolean mask.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9311fcf0-f37f-48b0-bf69-df641e6460dd

📥 Commits

Reviewing files that changed from the base of the PR and between 2e92672 and 39fee68.

📒 Files selected for processing (1)

ferry/database/sync_db_courses.py

coderabbitai · 2026-03-11T03:00:33Z

ferry/database/sync_db_courses.py

+            subset_unequal = ~(
+                (subset_old == subset_new)
+                | (subset_old.isna() & subset_new.isna())
            )
-            print(f"Old type: {old_types.iat[row, col]}")
-            print(f"New type: {new_types.iat[row, col]}")
-            print(f"Old value: {shared_rows_old.iat[row, col]}")
-            print(f"New value: {shared_rows_new.iat[row, col]}")
-            raise TypeError("Type mismatch")
-        unequal_mask = ~(
-            (shared_rows_old == shared_rows_new)
-            | (shared_rows_old.isna() & shared_rows_new.isna())
-        )
+            unequal_mask.loc[possibly_changed] = subset_unequal


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash python - <<'PY' import pandas as pd old = pd.DataFrame({"primary_crn": pd.Series([pd.NA], dtype="Int64")}) new = pd.DataFrame({"primary_crn": pd.Series([12345], dtype="Int64")}) current = ~((old == new) | (old.isna() & new.isna())) fixed = old.ne(new).fillna(True) & ~(old.isna() & new.isna()) print("current mask:") print(current) print("current any:", current.any(axis=1).tolist()) print("\nfixed mask:") print(fixed) print("fixed any:", fixed.any(axis=1).tolist()) PY

Repository: coursetable/ferry

Length of output: 179

🏁 Script executed:

#!/bin/bash pip install pandas --quiet python - <<'PY' import pandas as pd old = pd.DataFrame({"primary_crn": pd.Series([pd.NA], dtype="Int64")}) new = pd.DataFrame({"primary_crn": pd.Series([12345], dtype="Int64")}) current = ~((old == new) | (old.isna() & new.isna())) fixed = old.ne(new).fillna(True) & ~(old.isna() & new.isna()) print("current mask:") print(current) print("current any:", current.any(axis=1).tolist()) print("\nfixed mask:") print(fixed) print("fixed any:", fixed.any(axis=1).tolist()) PY

Repository: coursetable/ferry

Length of output: 431

🏁 Script executed:

cd /repo && cat -n ferry/database/sync_db_courses.py | sed -n '100,135p'

Repository: coursetable/ferry

Length of output: 116

🏁 Script executed:

cd /repo && cat -n ferry/database/sync_db_courses.py | sed -n '680,695p'

Repository: coursetable/ferry

Length of output: 116

🏁 Script executed:

find . -name "sync_db_courses.py" -type f

Repository: coursetable/ferry

Length of output: 96

🏁 Script executed:

find . -name "sync_db_courses.py" -type f -exec cat -n {} \; | sed -n '100,135p'

Repository: coursetable/ferry

Length of output: 2108

🏁 Script executed:

find . -name "sync_db_courses.py" -type f -exec cat -n {} \; | sed -n '680,695p'

Repository: coursetable/ferry

Length of output: 925

Make the per-cell diff mask NA-safe.

With pandas nullable dtypes, subset_old == subset_new returns pd.NA for null/value transitions. That propagates through the boolean expression, causing unequal_mask to contain <NA> cells. When unequal_mask.any(axis=1) is called at line 126, those NA values are treated as falsy, so rows with changes like primary_crn: <NA> → 12345 are silently dropped from the diff.

Suggested fix

- subset_unequal = ~( - (subset_old == subset_new) - | (subset_old.isna() & subset_new.isna()) - ) + both_na = subset_old.isna() & subset_new.isna() + subset_unequal = subset_old.ne(subset_new).fillna(True) & ~both_na

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

subset_unequal = ~(

(subset_old == subset_new)

| (subset_old.isna() & subset_new.isna())

)

print(f"Old type: {old_types.iat[row, col]}")

print(f"New type: {new_types.iat[row, col]}")

print(f"Old value: {shared_rows_old.iat[row, col]}")

print(f"New value: {shared_rows_new.iat[row, col]}")

raise TypeError("Type mismatch")

unequal_mask = ~(

(shared_rows_old == shared_rows_new)

| (shared_rows_old.isna() & shared_rows_new.isna())

)

unequal_mask.loc[possibly_changed] = subset_unequal

both_na = subset_old.isna() & subset_new.isna()

subset_unequal = subset_old.ne(subset_new).fillna(True) & ~both_na

unequal_mask.loc[possibly_changed] = subset_unequal

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ferry/database/sync_db_courses.py` around lines 120 - 124, The per-cell diff currently lets pd.NA propagate (via subset_old == subset_new) so unequal_mask gets <NA> values; change the logic in the block that computes subset_unequal (using subset_old, subset_new, possibly_changed) to explicitly handle nulls—for example compute na_mismatch = subset_old.isna() ^ subset_new.isna(), compute value_diff = (subset_old != subset_new).fillna(False), then set subset_unequal = value_diff | na_mismatch and assign that into unequal_mask.loc[possibly_changed]; this ensures NA→value and value→NA transitions are detected and yields a boolean mask.

reybahl added 2 commits March 10, 2026 22:08

fix: use schema-preserving empty dataframes in changelog junction loo…

e922406

…kups

feat: diff optimizations

c4b4f5a

Merge branch 'master' into diff-optimization

39fee68

coderabbitai bot reviewed Mar 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: diff optimization#401

feat: diff optimization#401
reybahl wants to merge 3 commits intomasterfrom
diff-optimization

reybahl commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Mar 11, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

reybahl commented Mar 11, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

reybahl commented Mar 11, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 11, 2026 •

edited

Loading