Skip to content

[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90

Open
fffrog wants to merge 3 commits intopytorch:masterfrom
fffrog:relay
Open

[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90
fffrog wants to merge 3 commits intopytorch:masterfrom
fffrog:relay

Conversation

@fffrog
Copy link

@fffrog fffrog commented Mar 10, 2026

This RFC has been under discussion for several weeks, you can visit this link to see previous discussions if you are interesed in.

And thanks a lot @ZainRizvi, @seemethere, @afrittoli, @zxiiro, @mikaylagawarecki, and @jewelkm89 for the valuable suggestions.

Click here to see a preview of this RFC.

Copy link
Contributor

@albanD albanD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great proposal!

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!


The Relay Server writes all results from `L2` and above into `ClickHouse`, which powers the dedicated OOT HUD pages described above.

## Alternative Architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, we don't plan to do this for now (only if we need the extra reliability). And we would be able to move from the current design to this without any change to the integration from oot backends?

Copy link
Author

@fffrog fffrog Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, we don't plan to do this for now

I understand that the alternative architecture is merely a supplementary solution to further improve stability, and it's not needed at this stage.

And we would be able to move from the current design to this without any change to the integration from oot backends?

Absolutely. The migration from the current one to the alternative one would be fully transparent to downstream backends. The downstream integration surface consists of two touchpoints: receiving repository_dispatch events and calling the report-ci-result action for callbacks. Both remain identical in either architecture — the change is entirely internal to the Relay Server (how it processes webhooks and fans out dispatches). No downstream workflow modifications would be needed

| Change | Description |
| :--- | :--- |
| `L4 -> L3` | 1. Unstable CI / false positives <br/> 2. Slow or unresponsive oncall |
| `L2 -> L1` | 1. Insufficient CI stability <br/> 2. Sending excessive or abusive requests to the Relay Server, affecting system stability |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also L3 -> L2 for similar reasons?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch and it is indeed necessary.

TBH, this proposal mainly focuses on the overall design. Regarding details such as the rules for changing accelerator levels, the information in this proposal is for reference only, as we need more suggestions from Maintainer and the Community.

Furthermore, I will try to refine these details based on your suggestions throughout the RFC. I will ping you once the RFC is updated.


| Change | Description |
| :--- | :--- |
| `L4 -> L3` | 1. Unstable CI / false positives <br/> 2. Slow or unresponsive oncall |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect we can make these oncall works. We will need more details on these before any backend is eligible for L4.
Let's just add a note about this here and not block this RFC on this discussion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just add a note about this here and not block this RFC on this discussion.

Agreed. Will add a note right now.

| Phase | Level | Requirements |
| :--- | :--- | :--- |
| **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
| **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is only visible on a separate hud page. I think we can ease the success rate requirement (I expect some projects just moving into observation might not hit it but that's ok).
I would add as a requirement that they follow the workflow above (don't spam the relay server) and the signal they send back is valid.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and I will add those udpates into this RFC.

| **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
| **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
| **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% |
| **Mature** | `L4` | 1. Recognized and supported by the PyTorch community for the accelerator & CI/CD <br/> 2. Weekly downstream CI success rate > 99% <br/> 3. Effective and stable oncall rotation, issue triage SLA < 48h |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for 1. we can rely on Core Maintainers to approve the request based on community support and usage of the particular hw.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Will clarify that L4 eligibility is determined by Core Maintainer Approval based on community support and hardware usage

| **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
| **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
| **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% |
| **Mature** | `L4` | 1. Recognized and supported by the PyTorch community for the accelerator & CI/CD <br/> 2. Weekly downstream CI success rate > 99% <br/> 3. Effective and stable oncall rotation, issue triage SLA < 48h |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im not sure how realistic 99% success rate is. I don't think we're even there for trunk for most jobs haha.
@ZainRizvi we should leave these numbers floating and adapt with time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point.

Will remove the hard 99% threshold and instead note that specific success rate targets for L4 will be calibrated based on real-world PyTorch CI performance and adjusted over time.


> \[!NOTE\]
> - Downstream repos that meet the requirements can apply to advance level by level (L1 → L2 → L3 → L4).
> - `L2`: Most downstream repos should be at this level.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add that most backend should be at L3 long term as it is a good balance of early signal and good ressource usage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thank you.

| :--- | :--- | :--- |
| **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator |
| **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR |
| **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have a requirement for L3 and L4 to have enough hw availability to have <Xmin queue time for these jobs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

Will add a hardware availability requirement for L3 and L4 to ensures that PR-visible check results are delivered in a timely manner.


The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level.

| Phase | Level | Requirements |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the requirements, it might be helpful to separate Infra availability vs legitimate test breakage.

I think the requirement we want to have here is both:

  • Very strong requirement on Infra availability.
  • More relaxed requirement on test breakage

We want to encourage both for sure, but they will be managed very differently so we most likely want to provide signal to backend writers independently.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. Will update the requirements to separate two dimensions


## Demo

An end-to-end prototype has been completed. A few key points are noted below.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we use hud view page for most of the signal tracking on PRs.
For example pytorch/pytorch#177365 (comment) and https://hud.pytorch.org/pr/177365

Can we have an item to show these there in a friendly way as well?

Copy link
Author

@fffrog fffrog Mar 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call.

Will add a section covering the HUD PR view (hud.pytorch.org/pr/) integration.

For L3/L4 backends, the OOT check results should be displayed in the PR-level HUD view in a dedicated section

@fffrog
Copy link
Author

fffrog commented Mar 18, 2026

Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!

Hi @albanD, so happy to get your approval, thank you.

The initial code for L1 and L2 is complete, and I will submit a PR soon. I'll let you know when it's finished.

@fffrog
Copy link
Author

fffrog commented Mar 18, 2026

Hey @albanD, the new commit is ready, please help to review it again, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants