MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks
π News β’ π Links β’ π¬ Demo β’ π Conceptual Overview β’ π Results
β¨ Getting Started β’ ποΈ MAS-Orchestra β’ π MASBench β’ π Case Inspection
- [01/29/2026] We present the MAS-Orchestra [Project Page | Paper | Code]
- π [Project Page]
- π [Paper]
- π» [Code]
- π¬ [Demo] (video link)
A short illustration of MAS-Orchestra (AIME24 as an example).
Accuracy vs. cost Pareto front. MAS-Orchestra achieves Pareto-optimal performance with the highest accuracy at low cost.
MAS-Orchestra achieves state-of-the-art performance across both IID and OOD benchmarks while maintaining Pareto-optimal cost efficiency.
| Method | IID Tasks | OOD Task GPQA |
|||
|---|---|---|---|---|---|
| AIME24 | AIME25 | HotpotQA | BrowseComp+ | ||
| Standalone Agents | |||||
| CoTAgent | 50.00 | 45.00 | 33.56 | 1.12 | 60.54 |
| SCAgent | 57.50 | 51.67 | 35.50 | 0.75 | 62.88 |
| DebateAgent | 62.08 | 57.50 | 36.88 | 0.81 | 64.14 |
| ReflexionAgent | 60.83 | 50.42 | 36.63 | 1.00 | 62.37 |
| DeepResearchAgent | β | β | 46.44 | 8.56 | β |
| SoTA Inference-time Orchestration | |||||
| AFlow | 62.50 | 53.33 | β | β | 65.43 |
| MaAS | 32.50 | 40.83 | β | β | 40.78 |
| MAS-Zero | No valid MAS generated with 7B orchestrator | ||||
| SoTA Public Training-time Orchestration | |||||
| MAS-GPT | 58.75 | 43.33 | β | β | 63.51 |
| ToolOrchestra | 23.33 | 11.25 | 37.44 | 1.38 | 29.80 |
| SoTA LLM as Orchestrator | |||||
| GPT-5 | 55.00 | 47.72 | 25.87 | 0.50 | 59.01 |
| Claude-Sonnet-4.5 | 45.56 | 35.00 | 38.00 | 0.50 | 21.72 |
| Ours | |||||
| MAS-Orchestra | 66.25 | 61.25 | 49.00 | 11.00 | 65.21 |
Performance comparison across IID and OOD benchmarks. MAS-Orchestra achieves the best results on all tasks.
conda create -n mas-orchestra python=3.10
conda activate mas-orchestra
apt update && apt install -y wget curl
cd ./verl
./install.sh
pip install --no-deps -e .
pip install ray==2.49.2 --force-reinstall
pip install protobuf==4.25.8 --force-reinstall
pip install together
pip install math-verify[antlr4_13_2]
pip install antlr4-python3-runtime==4.9.3
pip install langchain-core langchain-together langchain-community duckduckgo-search tavily-python pydantic ddgs langchain_brightdata bs4
pip install pyserini faiss-gpu
pip install git+https://github.com/texttron/tevatron.git
| Task | Model |
|---|---|
| Math (AIME) | harmony-grpo-7b-global-step-180 |
| HotpotQA | harmony-medium-grpo-7b-hotpot-global-step-250 |
| BrowseComp+ | harmony-medium-grpo-7b-browse-comp-plus-global-step-140 |
export OPENAI_API_KEY={YourKey}
export TOGETHER_API_KEY={YourKey}
export WANDB_API_KEY={YourKey}
LOG_FILE={YourLogFile}
python -u -m mas_r1_reasoner.main_mas_r1 \
--config-path=configs \
--config-name=grpo_trainer \
data.max_prompt_length=15000 \
data.max_validation_prompt_length=15000 \
data.val_files=data/browse_comp/test_subset_200.parquet \
data.train_files=data/browse_comp/train_subset_1066.parquet \
azr.mas_r1.use_llm_judge=True \
data.raw_data=True \
data.train_batch_size=64 \
actor_rollout_ref.rollout.n=32 \
azr.mas_r1.execution_success_weight=0.0 \
azr.mas_r1.final_answer_weight=1.0 \
azr.mas_r1.agent.model_name=gpt-oss-120b\
azr.mas_r1.multiply_processes=0 \
azr.mas_r1.max_ray_workers=1 \
azr.problem_type=harmony_medium \
azr.mas_r1.agent.init_archive=['COT','COT_SC','Reflexion','LLM_debate','WebSearch'] \
trainer.val_before_train=True \
trainer.test_freq=5 \
trainer.save_freq=10 \
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
trainer.experiment_name=harmony_medium_grpo_7b_gpt_oss_120b_browse_comp_plus \
$@ 2>&1 | tee -a "$LOG_FILE"MASBench is a controlled benchmark that characterizes tasks along five structural axes to rigorously study when and why multi-agent systems outperform single-agent systems.
| Axis | Definition |
|---|---|
| Depth | Length of the longest dependency chain |
| Horizon | Number of intermediate sub-tasks whose answers are needed |
| Breadth | Maximum in-degree, i.e., maximum dependencies of a sub-task |
| Parallel | Number of independent sub-task components in the task |
| Robustness | Number of sub-tasks with adversarial attacks |
The benchmark covers all five axes with axis values ranging from 2 to 12, and provides axis-specific training and test splits. The dataset is available on Hugging Face.
Browse real, generated multi-agent designs produced by MAS-Orchestra. Each example shows the full orchestration trace β how the orchestrator decomposes a task, selects sub-agents, and aggregates their outputs.
Highlights from the case studies:
- AIME24 (Low DoM): MAS-Orchestra learns to delegate entirely to a single strong sub-agent (100% delegation after 20 training steps), dynamically selecting ReflexionAgent or DebateAgent β the best-performing standalone baselines.
- BrowseComp+ (High DoM): MAS-Orchestra generates substantially more sub-agents, invoking SearchAgent with 3β4 parallel search processes per question.
- General Pattern: MAS-Orchestra dynamically adapts to each task by proposing MAS designs that align with the underlying sub-task structure and delegating execution to the most effective agent configurations.
- MAS-Zero: Designing Multi-Agent Systems with Zero Supervision β an inference-time self-refinement framework for automatic MAS design.
- MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems β analysis of process verification for multi-agent systems.
- SkillOrchestra: Learning to Route Agents via Skill Transfer β skill-based agent routing.
- LLM Reasoning Survey: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems.
If you find MAS-Orchestra helpful, please consider starring this repo and citing our work. We would be very grateful!
@misc{Ke2026MASOrchestra,
title = {MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
author = {Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
year = {2026},
eprint = {2601.14652},
archivePrefix= {arXiv},
primaryClass = {cs.AI},
note = {Preprint; Work in Progress},
}This project received help from many researchers at Salesforce AI Research. We also thank the authors of verl for their excellent contributions to the community!
Feel free to contact Zixuan Ke via email: zixuan.ke@salesforce.com




