[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90
[RFC] Cross-Repository CI Relay for PyTorch Out-of-Tree Backends#90fffrog wants to merge 3 commits intopytorch:masterfrom
Conversation
albanD
left a comment
There was a problem hiding this comment.
Great proposal!
Left a few comments, but the general architecture sounds great to me.
Most of my comments are about setting up the specific rules and doesn't block the first steps!
|
|
||
| The Relay Server writes all results from `L2` and above into `ClickHouse`, which powers the dedicated OOT HUD pages described above. | ||
|
|
||
| ## Alternative Architecture |
There was a problem hiding this comment.
Just to be clear, we don't plan to do this for now (only if we need the extra reliability). And we would be able to move from the current design to this without any change to the integration from oot backends?
There was a problem hiding this comment.
Just to be clear, we don't plan to do this for now
I understand that the alternative architecture is merely a supplementary solution to further improve stability, and it's not needed at this stage.
And we would be able to move from the current design to this without any change to the integration from oot backends?
Absolutely. The migration from the current one to the alternative one would be fully transparent to downstream backends. The downstream integration surface consists of two touchpoints: receiving repository_dispatch events and calling the report-ci-result action for callbacks. Both remain identical in either architecture — the change is entirely internal to the Relay Server (how it processes webhooks and fans out dispatches). No downstream workflow modifications would be needed
| | Change | Description | | ||
| | :--- | :--- | | ||
| | `L4 -> L3` | 1. Unstable CI / false positives <br/> 2. Slow or unresponsive oncall | | ||
| | `L2 -> L1` | 1. Insufficient CI stability <br/> 2. Sending excessive or abusive requests to the Relay Server, affecting system stability | |
There was a problem hiding this comment.
Also L3 -> L2 for similar reasons?
There was a problem hiding this comment.
Good catch and it is indeed necessary.
TBH, this proposal mainly focuses on the overall design. Regarding details such as the rules for changing accelerator levels, the information in this proposal is for reference only, as we need more suggestions from Maintainer and the Community.
Furthermore, I will try to refine these details based on your suggestions throughout the RFC. I will ping you once the RFC is updated.
|
|
||
| | Change | Description | | ||
| | :--- | :--- | | ||
| | `L4 -> L3` | 1. Unstable CI / false positives <br/> 2. Slow or unresponsive oncall | |
There was a problem hiding this comment.
I expect we can make these oncall works. We will need more details on these before any backend is eligible for L4.
Let's just add a note about this here and not block this RFC on this discussion.
There was a problem hiding this comment.
Let's just add a note about this here and not block this RFC on this discussion.
Agreed. Will add a note right now.
| | Phase | Level | Requirements | | ||
| | :--- | :--- | :--- | | ||
| | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator | | ||
| | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR | |
There was a problem hiding this comment.
Since this is only visible on a separate hud page. I think we can ease the success rate requirement (I expect some projects just moving into observation might not hit it but that's ok).
I would add as a requirement that they follow the workflow above (don't spam the relay server) and the signal they send back is valid.
There was a problem hiding this comment.
Agreed, and I will add those udpates into this RFC.
| | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator | | ||
| | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR | | ||
| | **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% | | ||
| | **Mature** | `L4` | 1. Recognized and supported by the PyTorch community for the accelerator & CI/CD <br/> 2. Weekly downstream CI success rate > 99% <br/> 3. Effective and stable oncall rotation, issue triage SLA < 48h | |
There was a problem hiding this comment.
I think for 1. we can rely on Core Maintainers to approve the request based on community support and usage of the particular hw.
There was a problem hiding this comment.
Agreed. Will clarify that L4 eligibility is determined by Core Maintainer Approval based on community support and hardware usage
| | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator | | ||
| | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR | | ||
| | **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% | | ||
| | **Mature** | `L4` | 1. Recognized and supported by the PyTorch community for the accelerator & CI/CD <br/> 2. Weekly downstream CI success rate > 99% <br/> 3. Effective and stable oncall rotation, issue triage SLA < 48h | |
There was a problem hiding this comment.
Im not sure how realistic 99% success rate is. I don't think we're even there for trunk for most jobs haha.
@ZainRizvi we should leave these numbers floating and adapt with time.
There was a problem hiding this comment.
Fair point.
Will remove the hard 99% threshold and instead note that specific success rate targets for L4 will be calibrated based on real-world PyTorch CI performance and adjusted over time.
|
|
||
| > \[!NOTE\] | ||
| > - Downstream repos that meet the requirements can apply to advance level by level (L1 → L2 → L3 → L4). | ||
| > - `L2`: Most downstream repos should be at this level. |
There was a problem hiding this comment.
I would also add that most backend should be at L3 long term as it is a good balance of early signal and good ressource usage.
| | :--- | :--- | :--- | | ||
| | **Onboarding** | `L1` | 1. Provide verifiable accelerator hardware information <br/> 2. Provide a downstream adaptation repo for the accelerator | | ||
| | **Observation** | `L2` | 1. Weekly downstream CI success rate > 70% <br/> 2. All workflow runs complete within 3h/PR | | ||
| | **Stable** | `L3` | 1. Pass the PyTorch core test suite, related [RFC](https://github.com/pytorch/pytorch/issues/174469) @mikaylagawarecki <br/> 2. Weekly downstream CI success rate > 90% | |
There was a problem hiding this comment.
We should have a requirement for L3 and L4 to have enough hw availability to have <Xmin queue time for these jobs
There was a problem hiding this comment.
Agreed.
Will add a hardware availability requirement for L3 and L4 to ensures that PR-visible check results are delivered in a timely manner.
|
|
||
| The allowlist is designed to naturally support gradual progression from experimental participation to mature participation. The table below lists the requirements for advancing to each level. | ||
|
|
||
| | Phase | Level | Requirements | |
There was a problem hiding this comment.
For the requirements, it might be helpful to separate Infra availability vs legitimate test breakage.
I think the requirement we want to have here is both:
- Very strong requirement on Infra availability.
- More relaxed requirement on test breakage
We want to encourage both for sure, but they will be managed very differently so we most likely want to provide signal to backend writers independently.
There was a problem hiding this comment.
Excellent point. Will update the requirements to separate two dimensions
|
|
||
| ## Demo | ||
|
|
||
| An end-to-end prototype has been completed. A few key points are noted below. |
There was a problem hiding this comment.
Note that we use hud view page for most of the signal tracking on PRs.
For example pytorch/pytorch#177365 (comment) and https://hud.pytorch.org/pr/177365
Can we have an item to show these there in a friendly way as well?
There was a problem hiding this comment.
Good call.
Will add a section covering the HUD PR view (hud.pytorch.org/pr/) integration.
For L3/L4 backends, the OOT check results should be displayed in the PR-level HUD view in a dedicated section
Hi @albanD, so happy to get your approval, thank you. The initial code for L1 and L2 is complete, and I will submit a PR soon. I'll let you know when it's finished. |
|
Hey @albanD, the new commit is ready, please help to review it again, thank you. |
This RFC has been under discussion for several weeks, you can visit this link to see previous discussions if you are interesed in.
And thanks a lot @ZainRizvi, @seemethere, @afrittoli, @zxiiro, @mikaylagawarecki, and @jewelkm89 for the valuable suggestions.
Click here to see a preview of this RFC.