[WIP][experimental] add agentic trace replay benchmark infrastructure by cquil11 · Pull Request #993 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-01T20:25:38Z

Trace replay benchmarking for agentic coding workloads using real Claude Code traces. Includes:

Trace replay scripts for H200, MI355X, B200 (vLLM-based)
kv-cache-tester submodule (trace replayer + 522 anonymized traces)
AIPerf submodule (alternative synthetic benchmarking)
Pareto frontier plotting and sweep aggregation
Metrics collector (prometheus scraper + visualization)
Workload distribution analysis
GitHub Actions workflow with per-TP sweep configs
MI355X runner SCRIPT_SUFFIX support

Trace replay benchmarking for agentic coding workloads using real Claude Code traces. Includes: - Trace replay scripts for H200, MI355X, B200 (vLLM-based) - kv-cache-tester submodule (trace replayer + 522 anonymized traces) - AIPerf submodule (alternative synthetic benchmarking) - Pareto frontier plotting and sweep aggregation - Metrics collector (prometheus scraper + visualization) - Workload distribution analysis - GitHub Actions workflow with per-TP sweep configs - MI355X runner SCRIPT_SUFFIX support Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-01T20:25:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

.github/workflows/multiturn-sweep.yml

+    runs-on: ubuntu-latest
+    outputs:
+      matrix: ${{ steps.gen.outputs.matrix }}
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        if: ${{ inputs.config_file != '' }}
+        with:
+          token: ${{ secrets.REPO_PAT }}
+          fetch-depth: 1
+          ref: ${{ inputs.ref || github.ref }}
+          sparse-checkout: ${{ inputs.config_file }}
+
+      - id: gen
+        run: |
+          pip install -q pyyaml
+          python3 << 'PYEOF'
+          import json, os, sys
+
+          config_file = "${{ inputs.config_file }}".strip()
+
+          if config_file:
+              import yaml
+              with open(config_file) as f:
+                  full_config = yaml.safe_load(f)
+
+              config_key = "${{ inputs.config_key }}".strip()
+
+              # If config_key specified, use that section; otherwise auto-detect
+              if config_key and config_key in full_config:
+                  config = full_config[config_key]
+              elif config_key:
+                  print(f"ERROR: config_key '{config_key}' not found. Available: {list(full_config.keys())}")
+                  sys.exit(1)
+              elif len(full_config) == 1:
+                  config = next(iter(full_config.values()))
+              else:
+                  # Check if top-level keys look like tp entries (tp2, tp4, etc.)
+                  if all(k.startswith("tp") for k in full_config):
+                      config = full_config
+                  else:
+                      print(f"ERROR: Multiple entries in config, specify --config_key. Available: {list(full_config.keys())}")
+                      sys.exit(1)
+
+              includes = []
+              for key, settings in config.items():
+                  tp = int(key.replace("tp", ""))
+                  users = settings.get("users", [])
+                  offloads = settings.get("offload", ["on", "off"])
+                  ep = settings.get("ep", 0)
+                  for u in users:
+                      for o in offloads:
+                          entry = {"tp": tp, "users": u, "offload": o}
+                          if ep > 0:
+                              entry["ep"] = ep
+                          includes.append(entry)
+          else:
+              tp_values = json.loads('${{ inputs.tp_values }}')
+              user_values = json.loads('${{ inputs.user_values }}')
+              offload_values = json.loads('${{ inputs.offload_values }}')
+              includes = []
+              for tp in tp_values:
+                  for u in user_values:
+                      for o in offload_values:
+                          includes.append({"tp": tp, "users": u, "offload": o})
+
+          matrix = {"include": includes}
+          print(f"Generated {len(includes)} matrix entries")
+          with open(os.environ["GITHUB_OUTPUT"], "a") as f:
+              f.write(f"matrix={json.dumps(matrix)}\n")
+          PYEOF
+
+  # ---------------------------------------------------------------------------
+  # Matrix benchmark jobs — each cell calls the multiturn template
+  # ---------------------------------------------------------------------------
+  sweep:


.github/workflows/multiturn-sweep.yml

+    needs: generate-matrix
+    uses: ./.github/workflows/benchmark-multiturn-tmpl.yml
+    name: sweep /
+    strategy:
+      fail-fast: false
+      matrix: ${{ fromJson(needs.generate-matrix.outputs.matrix) }}
+    secrets: inherit
+    with:
+      runner: ${{ inputs.runner }}
+      image: ${{ inputs.image }}
+      model: ${{ inputs.model }}
+      precision: ${{ inputs.precision }}
+      exp-name: "multiturn_tp${{ matrix.tp }}_users${{ matrix.users }}_offload${{ matrix.offload }}"
+      tp: "${{ matrix.tp }}"
+      users: "${{ matrix.users }}"
+      offload-mode: ${{ matrix.offload }}
+      duration: ${{ inputs.duration }}
+      request-rate: ${{ inputs.request_rate }}
+      total-cpu-dram-gb: ${{ inputs.total_cpu_dram_gb }}
+      script-suffix: ${{ inputs.script_suffix }}
+      ep: "${{ matrix.ep || inputs.ep }}"
+      ref: ${{ inputs.ref }}
+
+  # ---------------------------------------------------------------------------
+  # Collect & aggregate results
+  # ---------------------------------------------------------------------------
+  collect:


In general, fix this by explicitly setting a permissions block either at the workflow root (to cover all jobs) or on individual jobs, granting only the minimal scopes required. Since this workflow does not rely on GITHUB_TOKEN for write operations (it uses secrets.REPO_PAT for checkout and only handles artifacts), we can safely set contents: read at the workflow level. This documents intent and ensures GITHUB_TOKEN cannot be used for repo writes, even if org defaults are permissive.

The best minimal fix without changing existing behavior is: add a permissions: block at the top level, right after the existing name/run-name and before on:. Set contents: read as the default for all jobs. No additional imports, methods, or definitions are needed because this is purely a YAML configuration change. The jobs (generate-matrix, sweep, collect) will automatically inherit these restricted permissions unless they define their own permissions (which they currently do not).

Concretely:

Edit .github/workflows/multiturn-sweep.yml.

Insert:

permissions: contents: read

between the run-name: line and the on: block (after line 2 and before line 4 in the provided snippet). This satisfies CodeQL’s requirement and implements least-privilege defaults for GITHUB_TOKEN across the workflow.

.github/workflows/multiturn-sweep.yml

+    runs-on: ubuntu-latest
+    needs: sweep
+    if: always()
+    name: Collect results
+    steps:
+      - uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
+        with:
+          token: ${{ secrets.REPO_PAT }}
+          fetch-depth: 1
+          ref: ${{ inputs.ref || github.ref }}
+
+      - uses: actions/setup-python@v5
+        with:
+          python-version: '3.11'
+
+      - name: Install dependencies
+        run: pip install pandas matplotlib numpy
+
+      - name: Download all artifacts
+        uses: actions/download-artifact@v4
+        with:
+          pattern: 'multiturn_*'
+          path: results/
+
+      - name: Run aggregation
+        run: |
+          python experimental/multiturn/vllm_benchmark/scripts/collect_sweep_results.py results/ aggregated/
+
+      - name: Upload aggregated results
+        uses: actions/upload-artifact@b7c566a772e6b6bfb58ed0dc250532a479d7789f # v6.0.0
+        with:
+          name: multiturn_aggregated
+          path: aggregated/


In general, the fix is to declare an explicit permissions: block applying the least privileges required. This can be done at the workflow root (applies to all jobs) or per job. Since none of the shown jobs needs to write to the repo or to other resources via GITHUB_TOKEN, we can safely restrict it to read-only repository contents. The minimal recommended setting is permissions: contents: read at the workflow root.

Concretely, in .github/workflows/multiturn-sweep.yml, add a top-level permissions: block right after the run-name: (or after on:) so it applies to all jobs (generate-matrix, sweep, and collect). Set it to:

permissions: contents: read

This change does not alter existing functionality: the actions/checkout steps use an explicit PAT via token: ${{ secrets.REPO_PAT }}, and all other actions (download/upload-artifact, setup-python, etc.) work with read-only contents permissions on GITHUB_TOKEN. No additional imports, methods, or other definitions are needed.

Replaced by vLLM's native kv_offload metrics. Removes subprocess/threading imports and ~100 lines of dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add VLLMMetricsParser and SGLangMetricsParser with shared MetricsSnapshot. Backend is auto-detected from metrics prefix (vllm: vs sglang:) on first poll. sglang metrics mapped: - token_usage / num_used_tokens → kv_cache_usage - num_running_reqs → num_requests_running - num_queue_reqs → num_requests_waiting - cache_hit_rate × prompt_tokens → prefix_cache_hits/queries - num_retracted_reqs → num_preemptions - realtime_tokens_total mode=prefill_compute/prefill_cache → token source Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replays SWE-bench/GAIA/WildClaw traces from sammshen/lmcache-agentic-traces via AIPerf with mooncake_trace format. Downloads and converts traces at runtime. Supports concurrency sweep with offload on/off. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add --fixed-schedule to replay at exact trace timestamps - Remove --extra-inputs ignore_eos:true (let model stop naturally) - Remove unused REQUEST_RATE logic Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tion

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…cessing Drops ~18GB per artifact by excluding inputs.json, conversations.jsonl, responses.json, GPU telemetry, raw records, and full aiperf_artifacts/. Only uploads the specific files used by collect_sweep_results.py and plot_pareto.py. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The profile_export.jsonl with 233K records was ~10GB per artifact. Switch collect_sweep_results.py and plot_pareto.py to read from the pre-computed profile_export_aiperf.csv (~4KB) instead. Remove the JSONL from the artifact upload. Existing client CSV and trace_replay paths are unchanged. Also exclude low-FreeMem H100 nodes (1, 7, 18) to avoid cudaMallocHost/mlock failures during vLLM CPU KV cache allocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

vLLM v0.18.0 follows the newer OpenAI API spec where the 'system' message role was renamed to 'developer'. The LMCache traces use 'system', causing 100% 400 Bad Request errors. Also drop the 15GB profile_export_aiperf.json from artifact uploads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The LMCache traces include explicit null values for optional fields (tool_calls, tool_call_id, name) on every message. vLLM's strict Pydantic validation rejects these, causing 100% HTTP 400 errors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Avoids flaky streaming downloads that fail mid-transfer. The dataset is now cached via hf download (same as model weights) and read from the local parquet files. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Based on H100 aiperf script with B200-specific changes: - TORCH_CUDA_ARCH_LIST=10.0 (Blackwell) - B200 compilation config (FULL_DECODE_ONLY cudagraphs, custom ops) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The dataset was updated (24K → 74K rows) and now includes entries with empty message lists, causing aiperf MooncakeTrace validation to fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cquil11 added the experimental label Apr 1, 2026

github-project-automation bot added this to InferenceMAX Board Apr 1, 2026

github-advanced-security bot found potential problems Apr 1, 2026

View reviewed changes

cquil11 and others added 25 commits April 1, 2026 15:27

remove deprecated GpuTransferCollector from metrics collector

28991eb

Replaced by vLLM's native kv_offload metrics. Removes subprocess/threading imports and ~100 lines of dead code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove unused Protocol import

6a41d49

add H100 LMCache trace sweep config

ee76767

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

remove --fixed-schedule: use concurrency mode per Samuel's recommenda…

fc8e3cf

…tion

update yaml

6bbbfa9

fix H100 runner: add SCRIPT_SUFFIX support

a2e4fe6

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: mkdir RESULT_DIR before trace conversion

fee0278

add H200 LMCache trace benchmark and config

769532c

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

update yaml

02876af

fix H200-nb runner: add SCRIPT_SUFFIX support

2134fd8

fix all H200 runners: add SCRIPT_SUFFIX support

ab2812a

fix all runners: add SCRIPT_SUFFIX support

5aa993f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add exclusive

bd4ec30

add exclusive

a12cc9d

add exclusive

af49d11

debug

4f106b8

revert system->developer role conversion in LMCache traces

ede9bde

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix MetricsCollector missing gpu_transfer_collector attribute

a7ac440

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

cquil11 and others added 5 commits April 2, 2026 13:13

add B200 FP4 multiturn benchmark script using aiperf

195ca66

Based on H100 aiperf script with B200-specific changes: - TORCH_CUDA_ARCH_LIST=10.0 (Blackwell) - B200 compilation config (FULL_DECODE_ONLY cudagraphs, custom ops) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

add entry for b200 ds

09e6ec1

add expert parallel support to B200 FP4 aiperf script

951326a

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

skip LMCache trace entries with empty messages

0100fa1

The dataset was updated (24K → 74K rows) and now includes entries with empty message lists, causing aiperf MooncakeTrace validation to fail. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@@ -1,5 +1,7 @@
             name: Multi-Turn Benchmark Sweep
             run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
+            permissions:
+              contents: read
             on:
               # push:

@@ -1,6 +1,9 @@
             name: Multi-Turn Benchmark Sweep
             run-name: "${{ inputs.run_name || format('Multi-Turn Sweep - tp={0} users={1} offload={2}', inputs.tp_values, inputs.user_values, inputs.offload_values) }}"
+            permissions:
+              contents: read
             on:
               # push:
               #   branches:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][experimental] add agentic trace replay benchmark infrastructure#993

[WIP][experimental] add agentic trace replay benchmark infrastructure#993
cquil11 wants to merge 31 commits intomainfrom
experimental/agentic-benchmark

cquil11 commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Check warning

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cquil11 commented Apr 1, 2026

Uh oh!

github-actions bot commented Apr 1, 2026

Uh oh!

Check warning

Check warning

Copilot Autofix

Check warning

Copilot Autofix

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants