MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

🎉 News • 🔗 Links • 🎬 Demo • 📝 Conceptual Overview • 📊 Results

✨ Getting Started • 🏋️ MAS-Orchestra • 📐 MASBench • 🔍 Case Inspection

🎈 Citation • 🌻 Acknowledgement • 📧 Contact

[01/29/2026] We present the MAS-Orchestra [Project Page | Paper | Code]

🔗 Links

🏠 [Project Page]
📜 [Paper]
💻 [Code]
🎬 [Demo] (video link)

🎬 Demo

A short illustration of MAS-Orchestra (AIME24 as an example).

📊 Results

Accuracy vs. cost Pareto front. MAS-Orchestra achieves Pareto-optimal performance with the highest accuracy at low cost.

MAS-Orchestra achieves state-of-the-art performance across both IID and OOD benchmarks while maintaining Pareto-optimal cost efficiency.

Method	IID Tasks				OOD Task GPQA
Method	AIME24	AIME25	HotpotQA	BrowseComp+	OOD Task GPQA
Standalone Agents
CoTAgent	50.00	45.00	33.56	1.12	60.54
SCAgent	57.50	51.67	35.50	0.75	62.88
DebateAgent	62.08	57.50	36.88	0.81	64.14
ReflexionAgent	60.83	50.42	36.63	1.00	62.37
DeepResearchAgent	—	—	46.44	8.56	—
SoTA Inference-time Orchestration
AFlow	62.50	53.33	—	—	65.43
MaAS	32.50	40.83	—	—	40.78
MAS-Zero	No valid MAS generated with 7B orchestrator
SoTA Public Training-time Orchestration
MAS-GPT	58.75	43.33	—	—	63.51
ToolOrchestra	23.33	11.25	37.44	1.38	29.80
SoTA LLM as Orchestrator
GPT-5	55.00	47.72	25.87	0.50	59.01
Claude-Sonnet-4.5	45.56	35.00	38.00	0.50	21.72
Ours
MAS-Orchestra	66.25	61.25	49.00	11.00	65.21

Performance comparison across IID and OOD benchmarks. MAS-Orchestra achieves the best results on all tasks.

✨ Getting Started

🎄 Environment Setup

conda create -n mas-orchestra python=3.10
conda activate mas-orchestra

apt update && apt install -y wget curl

cd ./verl
./install.sh
pip install --no-deps -e .
pip install ray==2.49.2 --force-reinstall
pip install protobuf==4.25.8 --force-reinstall
pip install together
pip install math-verify[antlr4_13_2]
pip install antlr4-python3-runtime==4.9.3

pip install langchain-core langchain-together langchain-community duckduckgo-search tavily-python pydantic ddgs langchain_brightdata bs4
pip install pyserini faiss-gpu
pip install git+https://github.com/texttron/tevatron.git

📦 (Optional) Download Trained Orchestrators

Task	Model
Math (AIME)	harmony-grpo-7b-global-step-180
HotpotQA	harmony-medium-grpo-7b-hotpot-global-step-250
BrowseComp+	harmony-medium-grpo-7b-browse-comp-plus-global-step-140

🏋️ MAS-Orchestra

♟️ Example Training Script

export OPENAI_API_KEY={YourKey}
export TOGETHER_API_KEY={YourKey}
export WANDB_API_KEY={YourKey}
LOG_FILE={YourLogFile}

python -u -m mas_r1_reasoner.main_mas_r1 \
    --config-path=configs \
    --config-name=grpo_trainer \
    data.max_prompt_length=15000 \
    data.max_validation_prompt_length=15000 \
    data.val_files=data/browse_comp/test_subset_200.parquet \
    data.train_files=data/browse_comp/train_subset_1066.parquet \
    azr.mas_r1.use_llm_judge=True \
    data.raw_data=True \
    data.train_batch_size=64 \
    actor_rollout_ref.rollout.n=32 \
    azr.mas_r1.execution_success_weight=0.0 \
    azr.mas_r1.final_answer_weight=1.0 \
    azr.mas_r1.agent.model_name=gpt-oss-120b\
    azr.mas_r1.multiply_processes=0 \
    azr.mas_r1.max_ray_workers=1 \
    azr.problem_type=harmony_medium \
    azr.mas_r1.agent.init_archive=['COT','COT_SC','Reflexion','LLM_debate','WebSearch'] \
    trainer.val_before_train=True \
    trainer.test_freq=5 \
    trainer.save_freq=10 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.experiment_name=harmony_medium_grpo_7b_gpt_oss_120b_browse_comp_plus \
    $@ 2>&1 | tee -a "$LOG_FILE"

📐 MASBench

MASBench is a controlled benchmark that characterizes tasks along five structural axes to rigorously study when and why multi-agent systems outperform single-agent systems.

A Five-Axis Evaluation Framework

Axis	Definition
Depth	Length of the longest dependency chain
Horizon	Number of intermediate sub-tasks whose answers are needed
Breadth	Maximum in-degree, i.e., maximum dependencies of a sub-task
Parallel	Number of independent sub-task components in the task
Robustness	Number of sub-tasks with adversarial attacks

Benchmark Statistics

The benchmark covers all five axes with axis values ranging from 2 to 12, and provides axis-specific training and test splits. The dataset is available on Hugging Face.

🔍 Case Inspection

Browse real, generated multi-agent designs produced by MAS-Orchestra. Each example shows the full orchestration trace — how the orchestrator decomposes a task, selects sub-agents, and aggregates their outputs.

Highlights from the case studies:

AIME24 (Low DoM): MAS-Orchestra learns to delegate entirely to a single strong sub-agent (100% delegation after 20 training steps), dynamically selecting ReflexionAgent or DebateAgent — the best-performing standalone baselines.
BrowseComp+ (High DoM): MAS-Orchestra generates substantially more sub-agents, invoking SearchAgent with 3–4 parallel search processes per question.
General Pattern: MAS-Orchestra dynamically adapts to each task by proposing MAS designs that align with the underlying sub-task structure and delegating execution to the most effective agent configurations.

🤖 Check Out Our MAS Series

MAS-Zero: Designing Multi-Agent Systems with Zero Supervision — an inference-time self-refinement framework for automatic MAS design.
MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems — analysis of process verification for multi-agent systems.
SkillOrchestra: Learning to Route Agents via Skill Transfer — skill-based agent routing.
LLM Reasoning Survey: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems.

🎈 Citation

If you find MAS-Orchestra helpful, please consider starring this repo and citing our work. We would be very grateful!

@misc{Ke2026MASOrchestra,
        title        = {MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
        author       = {Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
        year         = {2026},
        eprint       = {2601.14652},
        archivePrefix= {arXiv},
        primaryClass = {cs.AI},
        note         = {Preprint; Work in Progress},
      }

🌻 Acknowledgement

This project received help from many researchers at Salesforce AI Research. We also thank the authors of verl for their excellent contributions to the community!

📧 Contact

Feel free to contact Zixuan Ke via email: zixuan.ke@salesforce.com

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
figures		figures
mas_r1_reasoner		mas_r1_reasoner
verl		verl
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
__init__.py		__init__.py
how_to_license.md		how_to_license.md
requirements.txt		requirements.txt
verl_setup.txt		verl_setup.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

🔗 Links

🎬 Demo

📊 Results

✨ Getting Started

🎄 Environment Setup

📦 (Optional) Download Trained Orchestrators

🏋️ MAS-Orchestra

♟️ Example Training Script

📐 MASBench

A Five-Axis Evaluation Framework

Benchmark Statistics

🔍 Case Inspection

🤖 Check Out Our MAS Series

🎈 Citation

🌻 Acknowledgement

📧 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

🔗 Links

🎬 Demo

📊 Results

✨ Getting Started

🎄 Environment Setup

📦 (Optional) Download Trained Orchestrators

🏋️ MAS-Orchestra

♟️ Example Training Script

📐 MASBench

A Five-Axis Evaluation Framework

Benchmark Statistics

🔍 Case Inspection

🤖 Check Out Our MAS Series

🎈 Citation

🌻 Acknowledgement

📧 Contact

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages