Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation by polarG · Pull Request #800 · microsoft/superbenchmark

polarG · 2026-04-01T06:07:20Z

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

Copilot

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-01T06:11:45Z

superbench/benchmarks/model_benchmarks/megatron_gpt3.py

                        f'--tokenizer-type {self._args.tokenizer_type} '
                        f'--output-prefix {os.path.join(self._args.data_home, "dataset")} '
-                        f'--workers {str(self._args.num_workers)} '
+                        # num_workers=0 is valid for DataLoader (main process loads data),
+                        # but preprocess_data.py requires workers>=1 for multiprocessing.Pool.
+                        f'--workers {max(1, self._args.num_workers)} '
                        f'--vocab-file {self._vocab_path} '


--output-prefix is hardcoded to ${data_home}/dataset, but the rest of the method uses self._args.data_prefix (and exposes it as a CLI arg). If --data_prefix is customized and the dataset needs to be generated (no existing .bin/.idx), preprocessing will generate dataset_text_document.* while the code checks for {data_prefix}.*, leading to a false failure. Consider deriving --output-prefix from data_prefix (e.g., strip a _text_document suffix when present) so generation and subsequent checks are consistent.

Copilot · 2026-04-01T06:11:46Z

superbench/benchmarks/model_benchmarks/megatron_gpt3.py

+                        # num_workers=0 is valid for DataLoader (main process loads data),
+                        # but preprocess_data.py requires workers>=1 for multiprocessing.Pool.
+                        f'--workers {max(1, self._args.num_workers)} '


The new max(1, self._args.num_workers) clamp is a regression-critical behavior change (fixing the multiprocessing.Pool(0) crash), but there’s no unit test asserting that --workers passed to preprocess_data.py is at least 1 when --num_workers=0. Since this benchmark already has tests, consider adding a test that exercises _generate_dataset() with num_workers=0 and asserts the constructed preprocess command uses --workers 1.

Fix the num_workers usage.

203a883

polarG requested a review from a team as a code owner April 1, 2026 06:07

Copilot AI review requested due to automatic review settings April 1, 2026 06:07

Copilot started reviewing on behalf of polarG April 1, 2026 06:08 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG wants to merge 1 commit intomainfrom
dev/hongtaozhang/fix-num_workers-usage

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Copilot AI Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

polarG commented Apr 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants