Skip to content

Fix LARS/LAMB optimizer support and non-contiguous tensor handling on XPU#1902

Open
jiqing-feng wants to merge 3 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:xpu
Open

Fix LARS/LAMB optimizer support and non-contiguous tensor handling on XPU#1902
jiqing-feng wants to merge 3 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:xpu

Conversation

@jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Mar 19, 2026

Changes

  • Enable LARS/LAMB optimizers on XPU: Added missing "lars" and "lamb" entries to name2optimizer_id, name2optimizer_32bit_fn, and name2optimizer_fn dicts in the Triton backend, and name2optimizer_id in the default backend. LARS maps to MOMENTUM (1-state) and LAMB maps to ADAM (2-state).

  • Fix Triton compilation error with fp16 gradients: In _optimizer_precondition_1state_32bit, the MOMENTUM branch's step == 1 path does s1_vals = g_vals (direct assignment). When gradients are fp16, this changes s1_vals from fp32 to fp16, conflicting with the else branch where arithmetic auto-promotes to fp32. Fixed by casting g_vals to fp32 at the assignment site.

  • Fix non-contiguous tensor support for blockwise quantization: Triton kernels use linear offsets to access memory, which is incorrect for non-contiguous tensors. Added .contiguous() calls in quantize_blockwise and dequantize_blockwise entry points.

Related tests:

pytest -k "xpu" -ra test_ops.py::TestNonContiguousInputs::test_quantize_blockwise_non_contiguous
pytest -k "xpu" -ra test_optim.py::test_optimizer32bit

Hi @matthewdouglas . Would you please review this PR? Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant