Enable Paged Optimizer Support for XPU#1898
Enable Paged Optimizer Support for XPU#1898matthewdouglas merged 17 commits intobitsandbytes-foundation:mainfrom
Conversation
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
Hi @matthewdouglas . I have enabled paged optimizer for XPU. The xpu legend could be updated to full support after this PR is merged. Please review it. Thanks! |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Fixed lint. |
|
Hi @matthewdouglas . I have reverted the wrong lint fix. The other failed tests seem like network issue, no related to my changes. |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
| # Use g.device for device context: paged state tensors appear as CPU tensors | ||
| # but are backed by USM shared memory and accessible from the accelerator. | ||
| with torch_accelerator_module.device(g.device): | ||
| optimizer_update_8bit_blockwise_impl( |
There was a problem hiding this comment.
Thanks for explaining this with the comment here. It's a good point!
matthewdouglas
left a comment
There was a problem hiding this comment.
PR #1898 Review: Enable Paged Optimizer Support for XPU
Classification: Feature — New backend capability (XPU paged memory via SYCL USM)
CI Status: All checks passing (builds, tests, lint, wheels, audit)
Files changed: 7 (4 modified, 2 new examples, 1 test change)
+714 / -9
Summary
Brings paged optimizer support (PagedAdamW, PagedAdam, PagedLion + 8-bit variants) to XPU devices using SYCL Unified Shared Memory (USM). This mirrors the existing CUDA cudaMallocManaged-based paged memory approach. The implementation adds cget_managed_ptr, cprefetch, cfill_fp32, and cfill_uint8 for XPU in C++, wraps them with XpuBNBNativeLibrary in Python, and makes device-agnostic fixes to synchronization and stream resolution.
Findings
Nits / Non-Blocking
1. [Low] Unused B parameter in XPU fill functions
csrc/pythonInterface.cpp — cfill_fp32 and cfill_uint8 accept B but don't use it. Matches the CUDA interface signature, so it's correct. A (void)B; would suppress compiler warnings.
2. [Low] Commit hygiene
17 commits with messages like "fix example", "update example", "restore", "dix lint". Recommend squash-merge.
3. [Nit] XpuBNBNativeLibrary is minimal
cextension.py:96-102 — The class only sets cget_managed_ptr.restype, same as CudaBNBNativeLibrary minus the CUDA-specific get_context and compiled_with_cuda bits. The hasattr guard is more defensive than the CUDA version — fine as-is.
What Looks Good
- C++ implementation follows the same pattern as CUDA:
sycl::malloc_sharedparallelscudaMallocManaged,queue.prefetchparallelscudaMemPrefetchAsync, both use a default stream/queue (not PyTorch's) for these housekeeping ops. Thecfill_uint8memsetworkaround for the Max 1550 driver bug is well-documented. g.devicefix in triton/ops.py correctly handles paged state tensors appearing as CPU tensors. Well-commented. No regression risk for non-paged paths sincestate1.device == g.devicein that case, confirmed by all-green CUDA CI._get_tensor_streamfallback for CPU tensors (paged states) correctly resolves to the current XPU/CUDA device stream.elementwise_funcsynchronization is now device-agnostic (XPU or CUDA).- Test skip removal is appropriate — the one skip in
test_optimizer32bitwas the only XPU paged skip. - Examples are thorough — benchmark shows 65.9% GPU memory reduction, training example verifies loss convergence, and paged vs non-paged comparison shows < 0.01 loss divergence.
Downstream Impact
No public API changes. Existing paged optimizer classes are now enabled on XPU where they previously skipped. No impact on Transformers/PEFT/Accelerate/TGI/vLLM — none call paged memory functions directly.
Security
External contributor (known Intel engineer). Standard SYCL USM calls in C++, no shell/file/network ops. Example files are in examples/, not library code. No obfuscation or suspicious patterns.
Verdict
Approve — Clean feature PR that brings XPU paged optimizer parity with CUDA. CI is fully green. Recommend squash-merge for commit hygiene.
ecf9ca1
into
bitsandbytes-foundation:main
Summary
Add paged optimizer support for Intel XPU devices using SYCL Unified Shared Memory (USM), enabling
PagedAdamW,PagedAdam, andPagedLionon XPU. This brings feature parity with CUDA's paged optimizer implementation based oncudaMallocManaged.Changes
C++ (
csrc/pythonInterface.cpp)cget_managed_ptr,cprefetch,cfill_fp32,cfill_uint8for XPU using SYCL USM APIs (sycl::malloc_shared,queue.prefetch,queue.fill)Python
bitsandbytes/cextension.py: AddXpuBNBNativeLibraryclass to properly set ctypes return types for the new XPU symbolsbitsandbytes/functional.py: Make device synchronization device-agnostic (CUDA/XPU) and renamecuda_ptr→managed_ptrbitsandbytes/backends/triton/ops.py: Fix device context in optimizer wrappers to useg.deviceinstead ofstate1.device(paged state tensors appear as CPU tensors)tests/test_optim.py: Remove XPU paged optimizer skipExamples (
examples/xpu/)paged_xpu_training.py: Real training case with LLaMA + Alpaca datasetbenchmark_paged_memory.py: Memory benchmark showing ~65% GPU memory reduction with paged optimizersTest Results
Paged optimizer reduces GPU memory by 65.9% (2524 MB → 861 MB) on a ~220M parameter LLaMA model by offloading optimizer states to USM shared memory.
How to Verify