diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md new file mode 100644 index 0000000..3afebcf --- /dev/null +++ b/incidents/2026-03-04.md @@ -0,0 +1,69 @@ +# 2026-03-04 Incident Report + +- Incident Commander: @ryanaslett +- Severity Level: P1 + +For several days following the release announcement, the macOS installer package (`.pkg`) for Node.js v22.22.1 served a duplicate file with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. While having a different hash, this file has been generated and signed legitimately by Node.js' CI and was safe to run. + +## Timeline + +- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct (backup origin server) and the R2 dist-staging bucket. + +- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket. + +- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket (which serves `nodejs.org`), including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. + +- **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. + +- **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. + +- **2026-03-08 23:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began. + +- **2026-03-09 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC). + +- **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to direct but failed to sync to R2, causing version mismatch. + +- **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. Impact resolved shortly after. + +## Impact + +Users downloading the macOS installer package from `https://nodejs.org/dist/v22.22.1/node-v22.22.1.pkg` received a file whose SHA256 checksum (`1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) did not match the checksum published in [`SHASUMS256.txt`](https://nodejs.org/dist/latest-v22.x/SHASUMS256.txt) (`ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`). + +Both files were legitimately signed by the Node.js Foundation Apple Developer account, but represented different build artifacts from separate Jenkins runs. The file served from direct.nodejs.org was correct, but Cloudflare R2 (serving most users via the release worker) contained the outdated version. + +## Root Cause + +A workflow issue in the Jenkins release process allowed files to become out of sync between direct.nodejs.org (www) and the R2 bucket. + +The release process works as follows: +1. Jenkins builds the macOS package and signs it +2. The package is copied to direct via `scp` +3. Jenkins SSHs into direct and uses `rclone` to copy the file to the R2 dist-staging bucket +4. Releaser runs script which SSHs into direct and copies files from the R2 dist-staging bucket to the R2 dist-prod bucket +5. Script generates `SHASUMS256.txt` based on files on direct, not R2, and writes this to the R2 dist-prod bucket + +During the v22.22.1 release: +1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2 staging +2. The job was re-run, producing a new signed package at 21:00 UTC +3. The second run successfully copied the new package to direct +4. The `rclone` step to R2 staging failed with `kex_exchange_identification: Connection closed by remote host` +5. The Jenkins job marked the build as failed but did not roll back the direct upload +6. Releaser ran script, which promoted the original package from R2 staging to prod, but generated `SHASUMS256.txt` based on the regenerated package on direct + +This left direct with matching package and `SHASUMS256.txt` files, but the R2 prod bucket with the outdated package file, creating a checksum mismatch for most users. + +## Fix + +The immediate fix was to manually sync the correct file from direct to the R2 dist-staging bucket using `rclone copyto`, and then to the R2 dist-prod bucket. + +## Follow-up Work + +- Improve Jenkins workflow to prevent partial uploads when rclone fails + - Either roll back direct uploads if R2 sync fails, or upload to both destinations atomically + - Add verification step to compare checksums between direct and R2 before marking build as complete +- Add monitoring/alerting for checksum mismatches between distribution sources +- Investigate why the rclone SSH connection failed mid-release +- Consider adding checksum verification as part of the promotion workflow +- Generate checksums based on R2 dist-prod contents rather than direct +- Add better logging/auditing for release builds to track which artifacts were uploaded where and when +- Create or make known what documentation/sources of truth to point to for any further incidents like this