Skip to content

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800

Open
polarG wants to merge 1 commit intomainfrom
dev/hongtaozhang/fix-num_workers-usage
Open

Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
polarG wants to merge 1 commit intomainfrom
dev/hongtaozhang/fix-num_workers-usage

Conversation

@polarG
Copy link
Copy Markdown
Contributor

@polarG polarG commented Apr 1, 2026

Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.

This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.

@polarG polarG requested a review from a team as a code owner April 1, 2026 06:07
Copilot AI review requested due to automatic review settings April 1, 2026 06:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).

Changes:

  • Clamp --workers for preprocess_data.py to max(1, self._args.num_workers) to avoid multiprocessing.Pool(0) failures.
  • Add inline clarification comment documenting why preprocessing clamps workers while other components may allow num_workers=0.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 659 to 664
f'--tokenizer-type {self._args.tokenizer_type} '
f'--output-prefix {os.path.join(self._args.data_home, "dataset")} '
f'--workers {str(self._args.num_workers)} '
# num_workers=0 is valid for DataLoader (main process loads data),
# but preprocess_data.py requires workers>=1 for multiprocessing.Pool.
f'--workers {max(1, self._args.num_workers)} '
f'--vocab-file {self._vocab_path} '
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--output-prefix is hardcoded to ${data_home}/dataset, but the rest of the method uses self._args.data_prefix (and exposes it as a CLI arg). If --data_prefix is customized and the dataset needs to be generated (no existing .bin/.idx), preprocessing will generate dataset_text_document.* while the code checks for {data_prefix}.*, leading to a false failure. Consider deriving --output-prefix from data_prefix (e.g., strip a _text_document suffix when present) so generation and subsequent checks are consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +661 to +663
# num_workers=0 is valid for DataLoader (main process loads data),
# but preprocess_data.py requires workers>=1 for multiprocessing.Pool.
f'--workers {max(1, self._args.num_workers)} '
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new max(1, self._args.num_workers) clamp is a regression-critical behavior change (fixing the multiprocessing.Pool(0) crash), but there’s no unit test asserting that --workers passed to preprocess_data.py is at least 1 when --num_workers=0. Since this benchmark already has tests, consider adding a test that exercises _generate_dataset() with num_workers=0 and asserts the constructed preprocess command uses --workers 1.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants