Skip to content

(perf): zero-allocation RUNTIME_CHECK=1 hot path#37

Merged
mgyoo86 merged 3 commits intomasterfrom
improve/runtime_check
Mar 26, 2026
Merged

(perf): zero-allocation RUNTIME_CHECK=1 hot path#37
mgyoo86 merged 3 commits intomasterfrom
improve/runtime_check

Conversation

@mgyoo86
Copy link
Copy Markdown
Member

@mgyoo86 mgyoo86 commented Mar 26, 2026

Problem

RUNTIME_CHECK=1 allocated 48 bytes/acquire + 128 bytes per overlap check after warmup, from two sources:

  1. _check_wrapper_mutation!wrapper::Array (type-erased from Vector{Any}) caused MemoryRef and NTuple{N,Int} boxing
  2. _check_pointer_overlap — closure capturing 6 variables into heap-allocated object

Fix

  • Boxing: getfield(arr,:ref).memccall(:jl_array_ptr), prod(getfield(arr,:size))length(::Array{T})
  • Closure: extract into @noinline function + @generated unrolling over FIXED_SLOT_FIELDS
  • Applied to CPU, CUDA, Metal extensions
  • Added S=1 zero-alloc regression tests + @info on RUNTIME_CHECK enabled

Result

All S=1 patterns (single/multi-type, N-D, nested, overlap check): 0 bytes after warmup.

Eliminate heap allocations in S=1 safety checks after warmup:

- _check_wrapper_mutation!: replace MemoryRef boxing (getfield(arr,:ref).mem)
  with ccall(:jl_array_ptr) pointer comparison, and NTuple boxing
  (prod(getfield(arr,:size))) with length(::Array{T}) — 48→0 bytes/acquire
- _check_pointer_overlap: extract closure into @noinline _check_tp_pointer_overlap
  and use @generated _check_all_slots_pointer_overlap for zero-allocation
  dispatch over fixed slots — 128→0 bytes/call
- Apply same fixes to CUDA and Metal extensions
- Add S=1 zero-allocation tests (single/multi-type, N-D, overlap, nested)
- Show @info on load when RUNTIME_CHECK is enabled
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.65%. Comparing base (38d66a6) to head (cb5c64a).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #37      +/-   ##
==========================================
+ Coverage   96.61%   96.65%   +0.04%     
==========================================
  Files          14       14              
  Lines        2748     2753       +5     
==========================================
+ Hits         2655     2661       +6     
+ Misses         93       92       -1     
Files with missing lines Coverage Δ
src/debug.jl 95.79% <100.00%> (+0.09%) ⬆️
src/types.jl 89.06% <ø> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR eliminates runtime allocations in the RUNTIME_CHECK=1 hot path by removing boxing sources in wrapper mutation detection and by replacing an allocating overlap-check closure with generated, per-slot unrolled dispatch.

Changes:

  • Refactors pointer-overlap checks to avoid closure capture allocations (CPU/CUDA/Metal).
  • Removes boxing in wrapper mutation checks by using jl_array_ptr and length instead of MemoryRef / prod(dims).
  • Adds S=1 (RUNTIME_CHECK=1) zero-allocation regression tests.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/debug.jl Refactors overlap checking to generated per-slot calls; updates wrapper-mutation checks to avoid boxing.
src/types.jl Adds an informational log when RUNTIME_CHECK is enabled.
test/test_zero_allocation.jl Adds new zero-allocation regression tests for S=1 (runtime-check enabled) paths.
ext/AdaptiveArrayPoolsCUDAExt/debug.jl Mirrors CPU overlap-check refactor for CUDA pools to avoid do-block closure allocations.
ext/AdaptiveArrayPoolsMetalExt/debug.jl Mirrors CPU overlap-check refactor for Metal pools to avoid do-block closure allocations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mgyoo86 added 2 commits March 26, 2026 10:44
Replace length(wrapper::Array{T}) with _wrapper_prod_size(wrapper) function
barrier that reads getfield(wrapper, :size) directly. length() does not
reflect setfield!(:size) on Julia 1.11, causing mutation detection to miss
wrapper growth beyond backing vector.
- Add wrapper::Array assertion before ccall(:jl_array_ptr) to prevent
  segfault on corrupted wrapper (safe TypeError instead)
- Reduce S=1 zero-alloc test iterations from 1000 to 100 (align with
  existing tests, reduce CI time)
@mgyoo86 mgyoo86 merged commit eccb87c into master Mar 26, 2026
14 checks passed
@mgyoo86 mgyoo86 deleted the improve/runtime_check branch March 26, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants