fix(sft): enable max-length filtering for messages datasets by none0663 · Pull Request #1841 · THUDM/slime

none0663 · 2026-04-17T06:57:35Z

Summary

Extend SFT long-sample filtering to support messages-style prompts (list[dict]) using chat-template tokenization.
Prevent overlong samples from reaching dynamic batching and violating max_tokens_per_gpu token budget.

Why

Previously, max-length filtering mainly covered text-style prompts.
For messages datasets, overlong samples could pass through and later fail in training-time micro-batch partitioning (or cause OOM).

Test plan

python3 -m py_compile slime/slime/utils/data.py
Run SFT with --rollout-max-prompt-len and confirm log includes:
Filtered X samples longer than max_length=...
Verify no runtime failure from over-budget single samples in dynamic batching.

… comments

fix(sft): filter overlong message prompts and add prompt-length guard…

79a3c27

… comments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sft): enable max-length filtering for messages datasets#1841

fix(sft): enable max-length filtering for messages datasets#1841
none0663 wants to merge 1 commit intoTHUDM:mainfrom
none0663:add-sft-data-filter

none0663 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

none0663 commented Apr 17, 2026

Summary

Why

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant