Enable Paged Optimizer Support for XPU by jiqing-feng · Pull Request #1898 · bitsandbytes-foundation/bitsandbytes

jiqing-feng · 2026-03-12T05:36:39Z

Summary

Add paged optimizer support for Intel XPU devices using SYCL Unified Shared Memory (USM), enabling PagedAdamW, PagedAdam, and PagedLion on XPU. This brings feature parity with CUDA's paged optimizer implementation based on cudaMallocManaged.

Changes

C++ (csrc/pythonInterface.cpp)

Implement cget_managed_ptr, cprefetch, cfill_fp32, cfill_uint8 for XPU using SYCL USM APIs (sycl::malloc_shared, queue.prefetch, queue.fill)

Python

bitsandbytes/cextension.py: Add XpuBNBNativeLibrary class to properly set ctypes return types for the new XPU symbols
bitsandbytes/functional.py: Make device synchronization device-agnostic (CUDA/XPU) and rename cuda_ptr → managed_ptr
bitsandbytes/backends/triton/ops.py: Fix device context in optimizer wrappers to use g.device instead of state1.device (paged state tensors appear as CPU tensors)
tests/test_optim.py: Remove XPU paged optimizer skip

Examples (examples/xpu/)

paged_xpu_training.py: Real training case with LLaMA + Alpaca dataset
benchmark_paged_memory.py: Memory benchmark showing ~65% GPU memory reduction with paged optimizers

Test Results

Paged optimizer reduces GPU memory by 65.9% (2524 MB → 861 MB) on a ~220M parameter LLaMA model by offloading optimizer states to USM shared memory.

=====================================================================================
  RESULTS
=====================================================================================
                                             AdamW         AdamW8bit        PagedAdamW    PagedAdamW8bit
  ------------------------------  ----------------  ----------------  ----------------  ----------------
  Peak GPU Memory                        2524.7 MB         1287.4 MB          861.3 MB          867.8 MB
  Optimizer State on GPU                 1658.2 MB          421.3 MB            0.2 MB            6.8 MB
  Optimizer State on CPU (USM)              0.0 MB            0.0 MB         1658.0 MB          414.5 MB
  ------------------------------  ----------------  ----------------  ----------------  ----------------
  GPU Memory Saved vs AdamW               baseline  1237.4 MB (49.0%)  1663.5 MB (65.9%)  1657.0 MB (65.6%)
=====================================================================================

How to Verify

# Build with XPU backend
cmake -DCOMPUTE_BACKEND=xpu -S . && make
pip install -e .

# Run existing paged optimizer tests (previously skipped on XPU)
pytest tests/test_optim.py -k "paged"


# Run memory benchmark
python examples/xpu/benchmark_paged_memory.py

# Run training example
python examples/xpu/paged_xpu_training.py --compare

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-03-12T08:41:40Z

Hi @matthewdouglas . I have enabled paged optimizer for XPU. The xpu legend could be updated to full support after this PR is merged. Please review it. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions · 2026-03-12T14:12:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jiqing-feng · 2026-03-13T01:59:43Z

Fixed lint.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-03-16T05:36:30Z

Hi @matthewdouglas . I have reverted the wrong lint fix. The other failed tests seem like network issue, no related to my changes.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

matthewdouglas · 2026-03-17T14:55:10Z

bitsandbytes/backends/triton/ops.py

+    # Use g.device for device context: paged state tensors appear as CPU tensors
+    # but are backed by USM shared memory and accessible from the accelerator.
+    with torch_accelerator_module.device(g.device):
        optimizer_update_8bit_blockwise_impl(


Thanks for explaining this with the comment here. It's a good point!

matthewdouglas

PR #1898 Review: Enable Paged Optimizer Support for XPU

Classification: Feature — New backend capability (XPU paged memory via SYCL USM)
CI Status: All checks passing (builds, tests, lint, wheels, audit)
Files changed: 7 (4 modified, 2 new examples, 1 test change)
+714 / -9

Summary

Brings paged optimizer support (PagedAdamW, PagedAdam, PagedLion + 8-bit variants) to XPU devices using SYCL Unified Shared Memory (USM). This mirrors the existing CUDA cudaMallocManaged-based paged memory approach. The implementation adds cget_managed_ptr, cprefetch, cfill_fp32, and cfill_uint8 for XPU in C++, wraps them with XpuBNBNativeLibrary in Python, and makes device-agnostic fixes to synchronization and stream resolution.

Findings

Nits / Non-Blocking

1. [Low] Unused B parameter in XPU fill functions
csrc/pythonInterface.cpp — cfill_fp32 and cfill_uint8 accept B but don't use it. Matches the CUDA interface signature, so it's correct. A (void)B; would suppress compiler warnings.

2. [Low] Commit hygiene
17 commits with messages like "fix example", "update example", "restore", "dix lint". Recommend squash-merge.

3. [Nit] XpuBNBNativeLibrary is minimal
cextension.py:96-102 — The class only sets cget_managed_ptr.restype, same as CudaBNBNativeLibrary minus the CUDA-specific get_context and compiled_with_cuda bits. The hasattr guard is more defensive than the CUDA version — fine as-is.

What Looks Good

C++ implementation follows the same pattern as CUDA: sycl::malloc_shared parallels cudaMallocManaged, queue.prefetch parallels cudaMemPrefetchAsync, both use a default stream/queue (not PyTorch's) for these housekeeping ops. The cfill_uint8 memset workaround for the Max 1550 driver bug is well-documented.
g.device fix in triton/ops.py correctly handles paged state tensors appearing as CPU tensors. Well-commented. No regression risk for non-paged paths since state1.device == g.device in that case, confirmed by all-green CUDA CI.
_get_tensor_stream fallback for CPU tensors (paged states) correctly resolves to the current XPU/CUDA device stream.
elementwise_func synchronization is now device-agnostic (XPU or CUDA).
Test skip removal is appropriate — the one skip in test_optimizer32bit was the only XPU paged skip.
Examples are thorough — benchmark shows 65.9% GPU memory reduction, training example verifies loss convergence, and paged vs non-paged comparison shows < 0.01 loss divergence.

Downstream Impact

No public API changes. Existing paged optimizer classes are now enabled on XPU where they previously skipped. No impact on Transformers/PEFT/Accelerate/TGI/vLLM — none call paged memory functions directly.

Security

External contributor (known Intel engineer). Standard SYCL USM calls in C++, no shell/file/network ops. Example files are in examples/, not library code. No obfuscation or suspicious patterns.

Verdict

Approve — Clean feature PR that brings XPU paged optimizer parity with CUDA. CI is fully green. Recommend squash-merge for commit hygiene.

jiqing-feng added 4 commits March 12, 2026 12:50

enable xpu paged optimizer

314b58a

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

add examples for xpu

b5dcb92

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

add 8bit paged optimizer tests

d6aec21

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix current stream device

7d42452

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng marked this pull request as draft March 12, 2026 05:47

jiqing-feng added 8 commits March 12, 2026 13:57

fix 8bit paged optimizer tests

cf1f0fe

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix cfill_uint8

db94fb6

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

add 8bit

80e72d7

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

add 8bit for example

4b33aaa

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix example

8c8ea49

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update example

1b60619

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update tests

b1b9313

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

restore

f40d954

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng marked this pull request as ready for review March 12, 2026 08:40

jiqing-feng added 2 commits March 12, 2026 16:43

update example

0fd3f86

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update example

d0a903d

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

matthewdouglas added this to the v0.50.0 milestone Mar 12, 2026

matthewdouglas added the Intel label Mar 12, 2026

jiqing-feng added 2 commits March 13, 2026 09:39

dix lint

b80cc8a

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

Merge branch 'main' into bmg

75c652b

revert wrong lint fix

7e16504

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

matthewdouglas reviewed Mar 17, 2026

View reviewed changes

matthewdouglas approved these changes Mar 17, 2026

View reviewed changes

matthewdouglas added the Optimizers Issues or feature requests relating to optimizers label Mar 17, 2026

matthewdouglas merged commit ecf9ca1 into bitsandbytes-foundation:main Mar 17, 2026
139 of 140 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Paged Optimizer Support for XPU#1898

Enable Paged Optimizer Support for XPU#1898
matthewdouglas merged 17 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:bmg

jiqing-feng commented Mar 12, 2026 •

edited

Loading

Uh oh!

jiqing-feng commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

jiqing-feng commented Mar 13, 2026

Uh oh!

jiqing-feng commented Mar 16, 2026

Uh oh!

matthewdouglas Mar 17, 2026

Uh oh!

matthewdouglas left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jiqing-feng commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Results

How to Verify

Uh oh!

jiqing-feng commented Mar 12, 2026

Uh oh!

github-actions bot commented Mar 12, 2026

Uh oh!

jiqing-feng commented Mar 13, 2026

Uh oh!

jiqing-feng commented Mar 16, 2026

Uh oh!

matthewdouglas Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

matthewdouglas left a comment

Choose a reason for hiding this comment

PR #1898 Review: Enable Paged Optimizer Support for XPU

Summary

Findings

Nits / Non-Blocking

What Looks Good

Downstream Impact

Security

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiqing-feng commented Mar 12, 2026 •

edited

Loading