Skip to content

fix(sft): enable max-length filtering for messages datasets#1841

Open
none0663 wants to merge 1 commit intoTHUDM:mainfrom
none0663:add-sft-data-filter
Open

fix(sft): enable max-length filtering for messages datasets#1841
none0663 wants to merge 1 commit intoTHUDM:mainfrom
none0663:add-sft-data-filter

Conversation

@none0663
Copy link
Copy Markdown
Contributor

Summary

  • Extend SFT long-sample filtering to support messages-style prompts (list[dict]) using chat-template tokenization.
  • Prevent overlong samples from reaching dynamic batching and violating max_tokens_per_gpu token budget.

Why

Previously, max-length filtering mainly covered text-style prompts.
For messages datasets, overlong samples could pass through and later fail in training-time micro-batch partitioning (or cause OOM).

Test plan

  • python3 -m py_compile slime/slime/utils/data.py
  • Run SFT with --rollout-max-prompt-len and confirm log includes:
    Filtered X samples longer than max_length=...
  • Verify no runtime failure from over-budget single samples in dynamic batching.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant