Skip to content

SalesforceAIResearch/MAS-Orchestra

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

Paper Project Page Github MAS Bench

Conceptual Overview

πŸ”— Links


🎬 Demo


A short illustration of MAS-Orchestra (AIME24 as an example).

MAS-Orchestra Demo Video

πŸ“Š Results


Pareto Front: Accuracy vs. Cost

Accuracy vs. cost Pareto front. MAS-Orchestra achieves Pareto-optimal performance with the highest accuracy at low cost.

MAS-Orchestra achieves state-of-the-art performance across both IID and OOD benchmarks while maintaining Pareto-optimal cost efficiency.

Method IID Tasks OOD Task
GPQA
AIME24 AIME25 HotpotQA BrowseComp+
Standalone Agents
CoTAgent50.0045.0033.561.1260.54
SCAgent57.5051.6735.500.7562.88
DebateAgent62.0857.5036.880.8164.14
ReflexionAgent60.8350.4236.631.0062.37
DeepResearchAgentβ€”β€”46.448.56β€”
SoTA Inference-time Orchestration
AFlow62.5053.33β€”β€”65.43
MaAS32.5040.83β€”β€”40.78
MAS-ZeroNo valid MAS generated with 7B orchestrator
SoTA Public Training-time Orchestration
MAS-GPT58.7543.33β€”β€”63.51
ToolOrchestra23.3311.2537.441.3829.80
SoTA LLM as Orchestrator
GPT-555.0047.7225.870.5059.01
Claude-Sonnet-4.545.5635.0038.000.5021.72
Ours
MAS-Orchestra66.2561.2549.0011.0065.21

Performance comparison across IID and OOD benchmarks. MAS-Orchestra achieves the best results on all tasks.

✨ Getting Started


πŸŽ„ Environment Setup

conda create -n mas-orchestra python=3.10
conda activate mas-orchestra

apt update && apt install -y wget curl

cd ./verl
./install.sh
pip install --no-deps -e .
pip install ray==2.49.2 --force-reinstall
pip install protobuf==4.25.8 --force-reinstall
pip install together
pip install math-verify[antlr4_13_2]
pip install antlr4-python3-runtime==4.9.3

pip install langchain-core langchain-together langchain-community duckduckgo-search tavily-python pydantic ddgs langchain_brightdata bs4
pip install pyserini faiss-gpu
pip install git+https://github.com/texttron/tevatron.git

πŸ“¦ (Optional) Download Trained Orchestrators

Task Model
Math (AIME) harmony-grpo-7b-global-step-180
HotpotQA harmony-medium-grpo-7b-hotpot-global-step-250
BrowseComp+ harmony-medium-grpo-7b-browse-comp-plus-global-step-140

πŸ‹οΈ MAS-Orchestra


β™ŸοΈ Example Training Script

export OPENAI_API_KEY={YourKey}
export TOGETHER_API_KEY={YourKey}
export WANDB_API_KEY={YourKey}
LOG_FILE={YourLogFile}

python -u -m mas_r1_reasoner.main_mas_r1 \
    --config-path=configs \
    --config-name=grpo_trainer \
    data.max_prompt_length=15000 \
    data.max_validation_prompt_length=15000 \
    data.val_files=data/browse_comp/test_subset_200.parquet \
    data.train_files=data/browse_comp/train_subset_1066.parquet \
    azr.mas_r1.use_llm_judge=True \
    data.raw_data=True \
    data.train_batch_size=64 \
    actor_rollout_ref.rollout.n=32 \
    azr.mas_r1.execution_success_weight=0.0 \
    azr.mas_r1.final_answer_weight=1.0 \
    azr.mas_r1.agent.model_name=gpt-oss-120b\
    azr.mas_r1.multiply_processes=0 \
    azr.mas_r1.max_ray_workers=1 \
    azr.problem_type=harmony_medium \
    azr.mas_r1.agent.init_archive=['COT','COT_SC','Reflexion','LLM_debate','WebSearch'] \
    trainer.val_before_train=True \
    trainer.test_freq=5 \
    trainer.save_freq=10 \
    actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
    trainer.experiment_name=harmony_medium_grpo_7b_gpt_oss_120b_browse_comp_plus \
    $@ 2>&1 | tee -a "$LOG_FILE"

πŸ“ MASBench


MASBench is a controlled benchmark that characterizes tasks along five structural axes to rigorously study when and why multi-agent systems outperform single-agent systems.

A Five-Axis Evaluation Framework

Five-Axis Evaluation Framework

Axis Definition
Depth Length of the longest dependency chain
Horizon Number of intermediate sub-tasks whose answers are needed
Breadth Maximum in-degree, i.e., maximum dependencies of a sub-task
Parallel Number of independent sub-task components in the task
Robustness Number of sub-tasks with adversarial attacks

Benchmark Statistics

MASBench Statistics

The benchmark covers all five axes with axis values ranging from 2 to 12, and provides axis-specific training and test splits. The dataset is available on Hugging Face.

πŸ” Case Inspection


Browse real, generated multi-agent designs produced by MAS-Orchestra. Each example shows the full orchestration trace β€” how the orchestrator decomposes a task, selects sub-agents, and aggregates their outputs.

Case Teaser

Browse Examples

Highlights from the case studies:

  • AIME24 (Low DoM): MAS-Orchestra learns to delegate entirely to a single strong sub-agent (100% delegation after 20 training steps), dynamically selecting ReflexionAgent or DebateAgent β€” the best-performing standalone baselines.
  • BrowseComp+ (High DoM): MAS-Orchestra generates substantially more sub-agents, invoking SearchAgent with 3–4 parallel search processes per question.
  • General Pattern: MAS-Orchestra dynamically adapts to each task by proposing MAS designs that align with the underlying sub-task structure and delegating execution to the most effective agent configurations.

πŸ€– Check Out Our MAS Series


  • MAS-Zero: Designing Multi-Agent Systems with Zero Supervision β€” an inference-time self-refinement framework for automatic MAS design.
  • MAS-ProVe: Understanding the Process Verification of Multi-Agent Systems β€” analysis of process verification for multi-agent systems.
  • SkillOrchestra: Learning to Route Agents via Skill Transfer β€” skill-based agent routing.
  • LLM Reasoning Survey: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems.

🎈 Citation


If you find MAS-Orchestra helpful, please consider starring this repo and citing our work. We would be very grateful!

@misc{Ke2026MASOrchestra,
        title        = {MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks},
        author       = {Zixuan Ke and Yifei Ming and Austin Xu and Ryan Chin and Xuan-Phi Nguyen and Prathyusha Jwalapuram and Semih Yavuz and Caiming Xiong and Shafiq Joty},
        year         = {2026},
        eprint       = {2601.14652},
        archivePrefix= {arXiv},
        primaryClass = {cs.AI},
        note         = {Preprint; Work in Progress},
      }

🌻 Acknowledgement


This project received help from many researchers at Salesforce AI Research. We also thank the authors of verl for their excellent contributions to the community!

πŸ“§ Contact


Feel free to contact Zixuan Ke via email: zixuan.ke@salesforce.com

Releases

No releases published

Packages

 
 
 

Contributors

Languages