Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Enhancement - Fix preprocessing crash when num_workers=0 in Megatron GPT3 dataset generation#800
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a crash in the Megatron GPT-3 dataset preprocessing path when --num_workers=0 by ensuring the worker count passed to Megatron’s tools/preprocess_data.py is at least 1, while keeping num_workers unchanged for other consumers (e.g., PyTorch DataLoader).
Changes:
- Clamp
--workersforpreprocess_data.pytomax(1, self._args.num_workers)to avoidmultiprocessing.Pool(0)failures. - Add inline clarification comment documenting why preprocessing clamps workers while other components may allow
num_workers=0.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| f'--tokenizer-type {self._args.tokenizer_type} ' | ||
| f'--output-prefix {os.path.join(self._args.data_home, "dataset")} ' | ||
| f'--workers {str(self._args.num_workers)} ' | ||
| # num_workers=0 is valid for DataLoader (main process loads data), | ||
| # but preprocess_data.py requires workers>=1 for multiprocessing.Pool. | ||
| f'--workers {max(1, self._args.num_workers)} ' | ||
| f'--vocab-file {self._vocab_path} ' |
There was a problem hiding this comment.
--output-prefix is hardcoded to ${data_home}/dataset, but the rest of the method uses self._args.data_prefix (and exposes it as a CLI arg). If --data_prefix is customized and the dataset needs to be generated (no existing .bin/.idx), preprocessing will generate dataset_text_document.* while the code checks for {data_prefix}.*, leading to a false failure. Consider deriving --output-prefix from data_prefix (e.g., strip a _text_document suffix when present) so generation and subsequent checks are consistent.
| # num_workers=0 is valid for DataLoader (main process loads data), | ||
| # but preprocess_data.py requires workers>=1 for multiprocessing.Pool. | ||
| f'--workers {max(1, self._args.num_workers)} ' |
There was a problem hiding this comment.
The new max(1, self._args.num_workers) clamp is a regression-critical behavior change (fixing the multiprocessing.Pool(0) crash), but there’s no unit test asserting that --workers passed to preprocess_data.py is at least 1 when --num_workers=0. Since this benchmark already has tests, consider adding a test that exercises _generate_dataset() with num_workers=0 and asserts the constructed preprocess command uses --workers 1.
Description
preprocess_data.py uses multiprocessing.Pool(workers), which requires workers >= 1. When num_workers is set to 0 (valid for DataLoader, where it means "load in main process"), the preprocessing step crashes.
This change clamps the worker count to max(1, self._args.num_workers) before passing it to preprocess_data.py, while leaving the original num_workers value unchanged for other uses like DataLoader.