Skip to content

feat: diff optimization#401

Open
reybahl wants to merge 3 commits intomasterfrom
diff-optimization
Open

feat: diff optimization#401
reybahl wants to merge 3 commits intomasterfrom
diff-optimization

Conversation

@reybahl
Copy link
Member

@reybahl reybahl commented Mar 11, 2026

Summary by CodeRabbit

  • Refactor
    • Improved database synchronization performance through optimized change detection and row comparison logic.

@coderabbitai
Copy link

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Database synchronization logic optimized through set-based lookups replacing O(n) index checks and vectorized column-change detection replacing loop-based comparisons. Hash-based change detection narrows per-cell comparisons to modified rows only, reducing computational overhead while preserving existing validation and diff structure.

Changes

Cohort / File(s) Summary
Database Sync Optimization
ferry/database/sync_db_courses.py
Refactors change detection from O(n) index checks to set-based lookups; introduces hash-based filtering to identify changed rows; replaces loop-based column-change detection with vectorized numpy operations; maintains existing validation and diff structure while optimizing computational pathways.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hopping through databases with grace,
Sets replace loops—a faster pace!
Hashes guide us to rows that changed,
Vectors dance where loops once ranged,
Database sync now swift and lean,
The most efficient it's ever been!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: diff optimization' directly describes the main change—optimizing the diff computation in sync_db_courses.py through algorithmic improvements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch diff-optimization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ferry/database/sync_db_courses.py`:
- Around line 120-124: The per-cell diff currently lets pd.NA propagate (via
subset_old == subset_new) so unequal_mask gets <NA> values; change the logic in
the block that computes subset_unequal (using subset_old, subset_new,
possibly_changed) to explicitly handle nulls—for example compute na_mismatch =
subset_old.isna() ^ subset_new.isna(), compute value_diff = (subset_old !=
subset_new).fillna(False), then set subset_unequal = value_diff | na_mismatch
and assign that into unequal_mask.loc[possibly_changed]; this ensures NA→value
and value→NA transitions are detected and yields a boolean mask.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9311fcf0-f37f-48b0-bf69-df641e6460dd

📥 Commits

Reviewing files that changed from the base of the PR and between 2e92672 and 39fee68.

📒 Files selected for processing (1)
  • ferry/database/sync_db_courses.py

Comment on lines +120 to +124
subset_unequal = ~(
(subset_old == subset_new)
| (subset_old.isna() & subset_new.isna())
)
print(f"Old type: {old_types.iat[row, col]}")
print(f"New type: {new_types.iat[row, col]}")
print(f"Old value: {shared_rows_old.iat[row, col]}")
print(f"New value: {shared_rows_new.iat[row, col]}")
raise TypeError("Type mismatch")
unequal_mask = ~(
(shared_rows_old == shared_rows_new)
| (shared_rows_old.isna() & shared_rows_new.isna())
)
unequal_mask.loc[possibly_changed] = subset_unequal
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
python - <<'PY'
import pandas as pd

old = pd.DataFrame({"primary_crn": pd.Series([pd.NA], dtype="Int64")})
new = pd.DataFrame({"primary_crn": pd.Series([12345], dtype="Int64")})

current = ~((old == new) | (old.isna() & new.isna()))
fixed = old.ne(new).fillna(True) & ~(old.isna() & new.isna())

print("current mask:")
print(current)
print("current any:", current.any(axis=1).tolist())

print("\nfixed mask:")
print(fixed)
print("fixed any:", fixed.any(axis=1).tolist())
PY

Repository: coursetable/ferry

Length of output: 179


🏁 Script executed:

#!/bin/bash
pip install pandas --quiet
python - <<'PY'
import pandas as pd

old = pd.DataFrame({"primary_crn": pd.Series([pd.NA], dtype="Int64")})
new = pd.DataFrame({"primary_crn": pd.Series([12345], dtype="Int64")})

current = ~((old == new) | (old.isna() & new.isna()))
fixed = old.ne(new).fillna(True) & ~(old.isna() & new.isna())

print("current mask:")
print(current)
print("current any:", current.any(axis=1).tolist())

print("\nfixed mask:")
print(fixed)
print("fixed any:", fixed.any(axis=1).tolist())
PY

Repository: coursetable/ferry

Length of output: 431


🏁 Script executed:

cd /repo && cat -n ferry/database/sync_db_courses.py | sed -n '100,135p'

Repository: coursetable/ferry

Length of output: 116


🏁 Script executed:

cd /repo && cat -n ferry/database/sync_db_courses.py | sed -n '680,695p'

Repository: coursetable/ferry

Length of output: 116


🏁 Script executed:

find . -name "sync_db_courses.py" -type f

Repository: coursetable/ferry

Length of output: 96


🏁 Script executed:

find . -name "sync_db_courses.py" -type f -exec cat -n {} \; | sed -n '100,135p'

Repository: coursetable/ferry

Length of output: 2108


🏁 Script executed:

find . -name "sync_db_courses.py" -type f -exec cat -n {} \; | sed -n '680,695p'

Repository: coursetable/ferry

Length of output: 925


Make the per-cell diff mask NA-safe.

With pandas nullable dtypes, subset_old == subset_new returns pd.NA for null/value transitions. That propagates through the boolean expression, causing unequal_mask to contain <NA> cells. When unequal_mask.any(axis=1) is called at line 126, those NA values are treated as falsy, so rows with changes like primary_crn: <NA> → 12345 are silently dropped from the diff.

Suggested fix
-            subset_unequal = ~(
-                (subset_old == subset_new)
-                | (subset_old.isna() & subset_new.isna())
-            )
+            both_na = subset_old.isna() & subset_new.isna()
+            subset_unequal = subset_old.ne(subset_new).fillna(True) & ~both_na
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
subset_unequal = ~(
(subset_old == subset_new)
| (subset_old.isna() & subset_new.isna())
)
print(f"Old type: {old_types.iat[row, col]}")
print(f"New type: {new_types.iat[row, col]}")
print(f"Old value: {shared_rows_old.iat[row, col]}")
print(f"New value: {shared_rows_new.iat[row, col]}")
raise TypeError("Type mismatch")
unequal_mask = ~(
(shared_rows_old == shared_rows_new)
| (shared_rows_old.isna() & shared_rows_new.isna())
)
unequal_mask.loc[possibly_changed] = subset_unequal
both_na = subset_old.isna() & subset_new.isna()
subset_unequal = subset_old.ne(subset_new).fillna(True) & ~both_na
unequal_mask.loc[possibly_changed] = subset_unequal
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@ferry/database/sync_db_courses.py` around lines 120 - 124, The per-cell diff
currently lets pd.NA propagate (via subset_old == subset_new) so unequal_mask
gets <NA> values; change the logic in the block that computes subset_unequal
(using subset_old, subset_new, possibly_changed) to explicitly handle nulls—for
example compute na_mismatch = subset_old.isna() ^ subset_new.isna(), compute
value_diff = (subset_old != subset_new).fillna(False), then set subset_unequal =
value_diff | na_mismatch and assign that into
unequal_mask.loc[possibly_changed]; this ensures NA→value and value→NA
transitions are detected and yields a boolean mask.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant