perf: Element-wise comparison only for tolerance-requiring data types#26
perf: Element-wise comparison only for tolerance-requiring data types#26Marius Merkle (MariusMerkleQC) wants to merge 8 commits intobenchmarkfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## benchmark #26 +/- ##
===========================================
Coverage 100.00% 100.00%
===========================================
Files 10 10
Lines 758 780 +22
===========================================
+ Hits 758 780 +22 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
This PR optimizes condition_equal_columns for nested list/array columns by avoiding the expensive element-wise comparison path when tolerances/special handling aren’t needed, and updates the performance benchmark accordingly.
Changes:
- Add
_needs_element_wise_comparison()(plus helpers) to decide when list/array columns require element-wise comparison. - Shortcut list/array comparisons to
eq_missing()when element-wise handling is deemed unnecessary. - Update the performance test to assert comparable performance for
list<i64>comparisons.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
diffly/_conditions.py |
Introduces dtype-based gating to skip element-wise list/array comparisons unless tolerances/special handling are needed. |
tests/test_performance.py |
Updates benchmark expectations to ensure the optimized path is not significantly slower than direct eq_missing(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Otherwise, we simply compare all fields independently | ||
| return pl.all_horizontal( | ||
| [ | ||
| _compare_columns( | ||
| col_left=col_left.struct[field], | ||
| col_right=col_right.struct[field], | ||
| dtype_left=fields_left[field], | ||
| dtype_right=fields_right[field], | ||
| max_list_length=max_list_length, | ||
| abs_tol=abs_tol, | ||
| rel_tol=rel_tol, | ||
| abs_tol_temporal=abs_tol_temporal, | ||
| ) | ||
| for field in fields_left | ||
| ] | ||
| ) |
There was a problem hiding this comment.
Should we also shortcut the comparison here?
# Otherwise, we simply compare all fields independently
if _needs_element_wise_comparison(dtype_left, dtype_right):
return pl.all_horizontal(
[
_compare_columns(
col_left=col_left.struct[field],
col_right=col_right.struct[field],
dtype_left=fields_left[field],
dtype_right=fields_right[field],
max_list_length=max_list_length,
abs_tol=abs_tol,
rel_tol=rel_tol,
abs_tol_temporal=abs_tol_temporal,
)
for field in fields_left
]
)
return col_left.eq_missing(col_right)There was a problem hiding this comment.
Ok, I added a performance test in #25, which already passes without the above optimization, so we should definitely not add it.
Motivation
See this comment.
Changes
Introduce a function
_needs_element_wise_comparisonthat checks whether element-wise comparison needs to be performed; this is the case for(1) float vs numeric columns -> absolute and relative tolerances apply (->
_is_float_numeric_pair())(2) temporal columns -> absolute temporal tolerance applies (->
_is_temporal_pair())In all other cases, naive comparison suffices, and this shortcut is taken if the above helper returns
False. This avoids the expensive_compare_sequence_columns(). The performance improvement can be seen in the benchmark test.