Skip to content

Enable Paged Optimizer Support for XPU#1898

Merged
matthewdouglas merged 17 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:bmg
Mar 17, 2026
Merged

Enable Paged Optimizer Support for XPU#1898
matthewdouglas merged 17 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:bmg

Conversation

@jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Mar 12, 2026

Summary

Add paged optimizer support for Intel XPU devices using SYCL Unified Shared Memory (USM), enabling PagedAdamW, PagedAdam, and PagedLion on XPU. This brings feature parity with CUDA's paged optimizer implementation based on cudaMallocManaged.

Changes

C++ (csrc/pythonInterface.cpp)

  • Implement cget_managed_ptr, cprefetch, cfill_fp32, cfill_uint8 for XPU using SYCL USM APIs (sycl::malloc_shared, queue.prefetch, queue.fill)

Python

  • bitsandbytes/cextension.py: Add XpuBNBNativeLibrary class to properly set ctypes return types for the new XPU symbols
  • bitsandbytes/functional.py: Make device synchronization device-agnostic (CUDA/XPU) and rename cuda_ptrmanaged_ptr
  • bitsandbytes/backends/triton/ops.py: Fix device context in optimizer wrappers to use g.device instead of state1.device (paged state tensors appear as CPU tensors)
  • tests/test_optim.py: Remove XPU paged optimizer skip

Examples (examples/xpu/)

  • paged_xpu_training.py: Real training case with LLaMA + Alpaca dataset
  • benchmark_paged_memory.py: Memory benchmark showing ~65% GPU memory reduction with paged optimizers

Test Results

Paged optimizer reduces GPU memory by 65.9% (2524 MB → 861 MB) on a ~220M parameter LLaMA model by offloading optimizer states to USM shared memory.

=====================================================================================
  RESULTS
=====================================================================================
                                             AdamW         AdamW8bit        PagedAdamW    PagedAdamW8bit
  ------------------------------  ----------------  ----------------  ----------------  ----------------
  Peak GPU Memory                        2524.7 MB         1287.4 MB          861.3 MB          867.8 MB
  Optimizer State on GPU                 1658.2 MB          421.3 MB            0.2 MB            6.8 MB
  Optimizer State on CPU (USM)              0.0 MB            0.0 MB         1658.0 MB          414.5 MB
  ------------------------------  ----------------  ----------------  ----------------  ----------------
  GPU Memory Saved vs AdamW               baseline  1237.4 MB (49.0%)  1663.5 MB (65.9%)  1657.0 MB (65.6%)
=====================================================================================

How to Verify

# Build with XPU backend
cmake -DCOMPUTE_BACKEND=xpu -S . && make
pip install -e .

# Run existing paged optimizer tests (previously skipped on XPU)
pytest tests/test_optim.py -k "paged"


# Run memory benchmark
python examples/xpu/benchmark_paged_memory.py

# Run training example
python examples/xpu/paged_xpu_training.py --compare

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as draft March 12, 2026 05:47
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as ready for review March 12, 2026 08:40
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas . I have enabled paged optimizer for XPU. The xpu legend could be updated to full support after this PR is merged. Please review it. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Mar 12, 2026
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jiqing-feng
Copy link
Contributor Author

Fixed lint.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas . I have reverted the wrong lint fix. The other failed tests seem like network issue, no related to my changes.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Comment on lines +238 to 241
# Use g.device for device context: paged state tensors appear as CPU tensors
# but are backed by USM shared memory and accessible from the accelerator.
with torch_accelerator_module.device(g.device):
optimizer_update_8bit_blockwise_impl(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining this with the comment here. It's a good point!

Copy link
Member

@matthewdouglas matthewdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #1898 Review: Enable Paged Optimizer Support for XPU

Classification: Feature — New backend capability (XPU paged memory via SYCL USM)
CI Status: All checks passing (builds, tests, lint, wheels, audit)
Files changed: 7 (4 modified, 2 new examples, 1 test change)
+714 / -9


Summary

Brings paged optimizer support (PagedAdamW, PagedAdam, PagedLion + 8-bit variants) to XPU devices using SYCL Unified Shared Memory (USM). This mirrors the existing CUDA cudaMallocManaged-based paged memory approach. The implementation adds cget_managed_ptr, cprefetch, cfill_fp32, and cfill_uint8 for XPU in C++, wraps them with XpuBNBNativeLibrary in Python, and makes device-agnostic fixes to synchronization and stream resolution.


Findings

Nits / Non-Blocking

1. [Low] Unused B parameter in XPU fill functions
csrc/pythonInterface.cppcfill_fp32 and cfill_uint8 accept B but don't use it. Matches the CUDA interface signature, so it's correct. A (void)B; would suppress compiler warnings.

2. [Low] Commit hygiene
17 commits with messages like "fix example", "update example", "restore", "dix lint". Recommend squash-merge.

3. [Nit] XpuBNBNativeLibrary is minimal
cextension.py:96-102 — The class only sets cget_managed_ptr.restype, same as CudaBNBNativeLibrary minus the CUDA-specific get_context and compiled_with_cuda bits. The hasattr guard is more defensive than the CUDA version — fine as-is.


What Looks Good

  • C++ implementation follows the same pattern as CUDA: sycl::malloc_shared parallels cudaMallocManaged, queue.prefetch parallels cudaMemPrefetchAsync, both use a default stream/queue (not PyTorch's) for these housekeeping ops. The cfill_uint8 memset workaround for the Max 1550 driver bug is well-documented.
  • g.device fix in triton/ops.py correctly handles paged state tensors appearing as CPU tensors. Well-commented. No regression risk for non-paged paths since state1.device == g.device in that case, confirmed by all-green CUDA CI.
  • _get_tensor_stream fallback for CPU tensors (paged states) correctly resolves to the current XPU/CUDA device stream.
  • elementwise_func synchronization is now device-agnostic (XPU or CUDA).
  • Test skip removal is appropriate — the one skip in test_optimizer32bit was the only XPU paged skip.
  • Examples are thorough — benchmark shows 65.9% GPU memory reduction, training example verifies loss convergence, and paged vs non-paged comparison shows < 0.01 loss divergence.

Downstream Impact

No public API changes. Existing paged optimizer classes are now enabled on XPU where they previously skipped. No impact on Transformers/PEFT/Accelerate/TGI/vLLM — none call paged memory functions directly.

Security

External contributor (known Intel engineer). Standard SYCL USM calls in C++, no shell/file/network ops. Example files are in examples/, not library code. No obfuscation or suspicious patterns.


Verdict

Approve — Clean feature PR that brings XPU paged optimizer parity with CUDA. CI is fully green. Recommend squash-merge for commit hygiene.

@matthewdouglas matthewdouglas added the Optimizers Issues or feature requests relating to optimizers label Mar 17, 2026
@matthewdouglas matthewdouglas merged commit ecf9ca1 into bitsandbytes-foundation:main Mar 17, 2026
139 of 140 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Intel Optimizers Issues or feature requests relating to optimizers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants