[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90

Open

fffrog wants to merge 3 commits intopytorch:masterfrom

fffrog commented Mar 10, 2026 •

edited

Loading

This RFC has been under discussion for several weeks, you can visit this link to see previous discussions if you are interesed in.

And thanks a lot @ZainRizvi, @seemethere, @afrittoli, @zxiiro, @mikaylagawarecki, and @jewelkm89 for the valuable suggestions.

Click here to see a preview of this RFC.


          add new RFC for PyTorch to enable CI relay from PyTorch repo to downs…

94e303e

…trean repo

meta-cla bot added the cla signed label

fffrog mentioned this pull request

[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends pytorch/pytorch#175022

Open


          update workflow reference in downstream

e265fa6

albanD reviewed

View reviewed changes

Contributor

albanD left a comment

Great proposal!

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md


		The Relay Server writes all results from `L2` and above into `ClickHouse`, which powers the dedicated OOT HUD pages described above.

		## Alternative Architecture

Contributor

albanD Mar 13, 2026

Just to be clear, we don't plan to do this for now (only if we need the extra reliability). And we would be able to move from the current design to this without any change to the integration from oot backends?

Author

fffrog Mar 18, 2026 •

edited

Loading

Just to be clear, we don't plan to do this for now

I understand that the alternative architecture is merely a supplementary solution to further improve stability, and it's not needed at this stage.

And we would be able to move from the current design to this without any change to the integration from oot backends?

Absolutely. The migration from the current one to the alternative one would be fully transparent to downstream backends. The downstream integration surface consists of two touchpoints: receiving repository_dispatch events and calling the report-ci-result action for callbacks. Both remain identical in either architecture — the change is entirely internal to the Relay Server (how it processes webhooks and fans out dispatches). No downstream workflow modifications would be needed

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              | Change | Description |
+              | :--- | :--- |
+              | `L4 -> L3` | 1. Unstable CI / false positives <br/> 2. Slow or unresponsive oncall |
+              | `L2 -> L1` | 1. Insufficient CI stability <br/> 2. Sending excessive or abusive requests to the Relay Server, affecting system stability |

Contributor

albanD Mar 13, 2026

Also L3 -> L2 for similar reasons?

Author

fffrog Mar 18, 2026

Good catch and it is indeed necessary.

TBH, this proposal mainly focuses on the overall design. Regarding details such as the rules for changing accelerator levels, the information in this proposal is for reference only, as we need more suggestions from Maintainer and the Community.

Furthermore, I will try to refine these details based on your suggestions throughout the RFC. I will ping you once the RFC is updated.

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              | Change | Description |
+              | :--- | :--- |
+              | `L4 -> L3` | 1. Unstable CI / false positives <br/> 2. Slow or unresponsive oncall |

Contributor

albanD Mar 13, 2026

I expect we can make these oncall works. We will need more details on these before any backend is eligible for L4.
Let's just add a note about this here and not block this RFC on this discussion.

Author

fffrog Mar 18, 2026

Let's just add a note about this here and not block this RFC on this discussion.

Agreed. Will add a note right now.

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              | Phase | Level | Requirements |
+              | :--- | :--- | :--- |
+              | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
+              | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |

Contributor

albanD Mar 13, 2026

Since this is only visible on a separate hud page. I think we can ease the success rate requirement (I expect some projects just moving into observation might not hit it but that's ok).
I would add as a requirement that they follow the workflow above (don't spam the relay server) and the signal they send back is valid.

Author

fffrog Mar 18, 2026

Agreed, and I will add those udpates into this RFC.

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
+              | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
+              | **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% |
+              | **Mature** | `L4` | 1. Recognized and supported by the PyTorch community for the accelerator & CI/CD <br/> 2. Weekly downstream CI success rate > 99% <br/> 3. Effective and stable oncall rotation, issue triage SLA < 48h |

Contributor

albanD Mar 13, 2026

I think for 1. we can rely on Core Maintainers to approve the request based on community support and usage of the particular hw.

Author

fffrog Mar 18, 2026

Agreed. Will clarify that L4 eligibility is determined by Core Maintainer Approval based on community support and hardware usage

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
+              | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
+              | **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% |
+              | **Mature** | `L4` | 1. Recognized and supported by the PyTorch community for the accelerator & CI/CD <br/> 2. Weekly downstream CI success rate > 99% <br/> 3. Effective and stable oncall rotation, issue triage SLA < 48h |

Contributor

albanD Mar 13, 2026

Im not sure how realistic 99% success rate is. I don't think we're even there for trunk for most jobs haha.
@ZainRizvi we should leave these numbers floating and adapt with time.

Author

fffrog Mar 18, 2026

Fair point.

Will remove the hard 99% threshold and instead note that specific success rate targets for L4 will be calibrated based on real-world PyTorch CI performance and adjusted over time.

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              > \[!NOTE\]
+              > - Downstream repos that meet the requirements can apply to advance level by level (L1 → L2 → L3 → L4).
+              > - `L2`: Most downstream repos should be at this level.

Contributor

albanD Mar 13, 2026

I would also add that most backend should be at L3 long term as it is a good balance of early signal and good ressource usage.

Author

fffrog Mar 18, 2026

Good point, thank you.

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md Outdated

+              | :--- | :--- | :--- |
+              | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
+              | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
+              | **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% |

Contributor

albanD Mar 13, 2026

We should have a requirement for L3 and L4 to have enough hw availability to have <Xmin queue time for these jobs

Author

fffrog Mar 18, 2026

Agreed.

Will add a hardware availability requirement for L3 and L4 to ensures that PR-visible check results are delivered in a timely manner.

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md


		The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level.

		\| Phase \| Level \| Requirements \|

Contributor

albanD Mar 13, 2026

For the requirements, it might be helpful to separate Infra availability vs legitimate test breakage.

I think the requirement we want to have here is both:

Very strong requirement on Infra availability.
More relaxed requirement on test breakage

We want to encourage both for sure, but they will be managed very differently so we most likely want to provide signal to backend writers independently.

Author

fffrog Mar 18, 2026

Excellent point. Will update the requirements to separate two dimensions

RFC-0050-Cross-Repository-CI-Relay-for-PyTorch-Out-of-Tree-Backends.md


		## Demo

		An end-to-end prototype has been completed. A few key points are noted below.

Contributor

albanD Mar 13, 2026

Note that we use hud view page for most of the signal tracking on PRs.
For example pytorch/pytorch#177365 (comment) and https://hud.pytorch.org/pr/177365

Can we have an item to show these there in a friendly way as well?

Author

fffrog Mar 18, 2026 •

edited

Loading

Good call.

Will add a section covering the HUD PR view (hud.pytorch.org/pr/) integration.

For L3/L4 backends, the OOT check results should be displayed in the PR-level HUD view in a dedicated section

Author

fffrog commented Mar 18, 2026

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

Hi @albanD, so happy to get your approval, thank you.

The initial code for L1 and L2 is complete, and I will submit a PR soon. I'll let you know when it's finished.

can-gaa-hou mentioned this pull request

[WIP] Implement initial L1/L2 cross-repo CI relay pytorch/test-infra#7847

Draft

12 tasks


          update RFC named Cross-Repository CI Relay for PyTorch Out-of-Tree Ba…

5e13847

…ckends

Author

fffrog commented Mar 18, 2026

Hey @albanD, the new commit is ready, please help to review it again, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels