From 422f00c9a3e8f04d40e8195df0e8805fac533419 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:27:47 -0400 Subject: [PATCH 01/18] feat(incident): Add incident report for macOS installer version mismatch Documented the incident regarding a macOS installer package version mismatch due to a Jenkins job failure. Included timeline, impact, root cause, fix, and follow-up work. --- incidents/2026-03-20.md | 62 +++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 incidents/2026-03-20.md diff --git a/incidents/2026-03-20.md b/incidents/2026-03-20.md new file mode 100644 index 0000000..66b7685 --- /dev/null +++ b/incidents/2026-03-20.md @@ -0,0 +1,62 @@ +# 2026-03-04 Incident Report + +- Incident Commander: @ryanaslett +- Severity Level: P1 + +For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22.22.1 served an incorrect version with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. + +## Timeline + +- **2026-03-04 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. + +- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (`www.`) with the outdated file. + +- **2026-03-05 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. + +- **2026-03-05 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. + +- **2026-03-05 11:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began. + +- **2026-03-05 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC). + +- **2026-03-05 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to www but failed to sync to R2, causing version mismatch. + +- **2026-03-06 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. Impact resolved. + +## Impact + +Users downloading the macOS installer package from `https://nodejs.org/dist/v22.22.1/node-v22.22.1.pkg` received a file whose SHA256 checksum (`1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) did not match the checksum published in [`SHASUMS256.txt`](https://nodejs.org/dist/latest-v22.x/SHASUMS256.txt) (`ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`). + +Both files were legitimately signed by the Node.js Foundation Apple Developer account, but represented different build artifacts from separate Jenkins runs. The file served from direct.nodejs.org was correct, but Cloudflare R2 (serving most users via the release worker) contained the outdated version. + +## Root Cause + +A workflow issue in the Jenkins release process allowed files to become out of sync between direct.nodejs.org (www) and the R2 bucket. + +The release process works as follows: +1. Jenkins builds the macOS package and signs it +2. The package is copied to direct.nodejs.org via `scp` +3. Jenkins SSHs into www and uses `rclone` to copy the file from www to R2 dist-staging + +During the v22.22.1 release: +1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both www and R2 +2. The job was re-run, producing a new signed package at 21:00 UTC +3. The second run successfully copied the new package to www +4. The `rclone` step to R2 failed with `kex_exchange_identification: Connection closed by remote host` +5. The Jenkins job marked the build as failed but did not roll back the www upload + +This left www with the correct file (matching SHASUMS256.txt) while R2 served the outdated file, creating a checksum mismatch for most users. + +## Fix + +The immediate fix was to manually sync the correct file from direct.nodejs.org to the R2 dist-staging bucket using `rclone copyto`. + +## Follow-up Work + +- Improve Jenkins workflow to prevent partial uploads when rclone fails + - Either roll back www uploads if R2 sync fails, or upload to both destinations atomically + - Add verification step to compare checksums between www and R2 before marking build as complete +- Add monitoring/alerting for checksum mismatches between distribution sources +- Investigate why the rclone SSH connection failed mid-release +- Consider adding checksum verification as part of the promotion workflow +- Add better logging/auditing for release builds to track which artifacts were uploaded where and when From 179e14f2e069bd0b5461d23db7bfba2e1a170246 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:29:14 -0400 Subject: [PATCH 02/18] Update incident report date to 2026-03-08 --- incidents/{2026-03-20.md => 2026-03-08.md} | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) rename incidents/{2026-03-20.md => 2026-03-08.md} (99%) diff --git a/incidents/2026-03-20.md b/incidents/2026-03-08.md similarity index 99% rename from incidents/2026-03-20.md rename to incidents/2026-03-08.md index 66b7685..a1b222c 100644 --- a/incidents/2026-03-20.md +++ b/incidents/2026-03-08.md @@ -1,4 +1,4 @@ -# 2026-03-04 Incident Report +# 2026-03-08 Incident Report - Incident Commander: @ryanaslett - Severity Level: P1 From 099a14fecde02d033a4d3ea20bacb25ce2de18e0 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:30:23 -0400 Subject: [PATCH 03/18] Correct timeline dates for Node.js v22.22.1 incident Updated incident timeline to reflect correct dates for events related to the macOS installer package issue. --- incidents/2026-03-08.md | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-08.md index a1b222c..0b724ba 100644 --- a/incidents/2026-03-08.md +++ b/incidents/2026-03-08.md @@ -7,21 +7,23 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 ## Timeline -- **2026-03-04 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. +- **2026-03-08 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. -- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (`www.`) with the outdated file. +- **2026-03-08 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (`www.`) with the outdated file. -- **2026-03-05 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. +- **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. -- **2026-03-05 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. +- **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. -- **2026-03-05 11:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began. +- **2026-03-08 11:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began. -- **2026-03-05 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC). +- **2026-03-09 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC). -- **2026-03-05 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to www but failed to sync to R2, causing version mismatch. +- **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to www but failed to sync to R2, causing version mismatch. -- **2026-03-06 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. Impact resolved. +- **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. + +- **2026-03-09 01:29 UTC**: Cache purged. Impact resolved. ## Impact From a3d948edd65f188282f3e2839a232fb1bf6bd327 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:30:57 -0400 Subject: [PATCH 04/18] Update incidents/2026-03-08.md Co-authored-by: Matt Cowley --- incidents/2026-03-08.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-08.md index 0b724ba..e737484 100644 --- a/incidents/2026-03-08.md +++ b/incidents/2026-03-08.md @@ -3,7 +3,7 @@ - Incident Commander: @ryanaslett - Severity Level: P1 -For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22.22.1 served an incorrect version with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. +For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22.22.1 served a duplicate file with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. While having a different hash, this file has been generated and signed legitimately by Node.js' CI and was safe to run. ## Timeline From 488f523901ebf01e913eaad45039404ecf80e930 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:35:08 -0400 Subject: [PATCH 05/18] Update 2026-03-08.md --- incidents/2026-03-08.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-08.md index e737484..2e0583a 100644 --- a/incidents/2026-03-08.md +++ b/incidents/2026-03-08.md @@ -41,13 +41,13 @@ The release process works as follows: 3. Jenkins SSHs into www and uses `rclone` to copy the file from www to R2 dist-staging During the v22.22.1 release: -1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both www and R2 +1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2 2. The job was re-run, producing a new signed package at 21:00 UTC -3. The second run successfully copied the new package to www +3. The second run successfully copied the new package to direct 4. The `rclone` step to R2 failed with `kex_exchange_identification: Connection closed by remote host` -5. The Jenkins job marked the build as failed but did not roll back the www upload +5. The Jenkins job marked the build as failed but did not roll back the direct upload -This left www with the correct file (matching SHASUMS256.txt) while R2 served the outdated file, creating a checksum mismatch for most users. +This left `direct.` with the correct file (matching SHASUMS256.txt) while R2 served the outdated file, creating a checksum mismatch for most users. ## Fix From ca0945f5f7341d8b3f60d430f3d4695ffdc7c481 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:37:52 -0400 Subject: [PATCH 06/18] Update 2026-03-08.md --- incidents/2026-03-08.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-08.md index 2e0583a..0aedc2d 100644 --- a/incidents/2026-03-08.md +++ b/incidents/2026-03-08.md @@ -9,7 +9,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 - **2026-03-08 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. -- **2026-03-08 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (`www.`) with the outdated file. +- **2026-03-08 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (serving most users at `www.`) with the outdated file. - **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. @@ -38,7 +38,7 @@ A workflow issue in the Jenkins release process allowed files to become out of s The release process works as follows: 1. Jenkins builds the macOS package and signs it 2. The package is copied to direct.nodejs.org via `scp` -3. Jenkins SSHs into www and uses `rclone` to copy the file from www to R2 dist-staging +3. Jenkins SSHs into direct and uses `rclone` to copy the file from www to R2 dist-staging During the v22.22.1 release: 1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2 From 9c266efe4331a11f6d2f3bdbd97d00aea266a727 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 21:55:24 -0400 Subject: [PATCH 07/18] Update 2026-03-08.md --- incidents/2026-03-08.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-08.md index 0aedc2d..c76a624 100644 --- a/incidents/2026-03-08.md +++ b/incidents/2026-03-08.md @@ -19,7 +19,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 - **2026-03-09 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC). -- **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to www but failed to sync to R2, causing version mismatch. +- **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to direct but failed to sync to R2, causing version mismatch. - **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. From b86b5669cd985ceced03d7f777ec207fd22ac1f8 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:03:11 -0400 Subject: [PATCH 08/18] Update 2026-03-08.md --- incidents/2026-03-08.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-08.md index c76a624..d511f3b 100644 --- a/incidents/2026-03-08.md +++ b/incidents/2026-03-08.md @@ -1,4 +1,4 @@ -# 2026-03-08 Incident Report +# 2026-03-04 Incident Report - Incident Commander: @ryanaslett - Severity Level: P1 @@ -7,9 +7,9 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 ## Timeline -- **2026-03-08 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. +- **2026-03-04 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. -- **2026-03-08 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (serving most users at `www.`) with the outdated file. +- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (serving most users at `www.`) with the outdated file. - **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. From e10ed51c41031e2e28816f0dcd7eb2572bffec55 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:03:48 -0400 Subject: [PATCH 09/18] Rename 2026-03-08.md to 2026-03-04.md --- incidents/{2026-03-08.md => 2026-03-04.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename incidents/{2026-03-08.md => 2026-03-04.md} (100%) diff --git a/incidents/2026-03-08.md b/incidents/2026-03-04.md similarity index 100% rename from incidents/2026-03-08.md rename to incidents/2026-03-04.md From 370cb5cab5630f0ae755ce67ec54d6ca31fbc0e3 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:05:57 -0400 Subject: [PATCH 10/18] Update 2026-03-04.md --- incidents/2026-03-04.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index d511f3b..8710442 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -15,7 +15,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 - **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. -- **2026-03-08 11:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began. +- **2026-03-08 23:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began. - **2026-03-09 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC). From b51c4c2f872636d8b8532fcd60b153a31403dfc8 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:13:54 -0400 Subject: [PATCH 11/18] Apply suggestions from code review Co-authored-by: Matt Cowley --- incidents/2026-03-04.md | 24 ++++++++++++++---------- 1 file changed, 14 insertions(+), 10 deletions(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index 8710442..a7bf3ea 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -7,10 +7,11 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 ## Timeline -- **2026-03-04 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2. +- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct and the R2 dist-staging bucket. -- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (serving most users at `www.`) with the outdated file. +- **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket. +- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket, including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. - **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. - **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. @@ -37,27 +38,30 @@ A workflow issue in the Jenkins release process allowed files to become out of s The release process works as follows: 1. Jenkins builds the macOS package and signs it -2. The package is copied to direct.nodejs.org via `scp` -3. Jenkins SSHs into direct and uses `rclone` to copy the file from www to R2 dist-staging +2. The package is copied to direct via `scp` +3. Jenkins SSHs into direct and uses `rclone` to copy the file to the R2 dist-staging bucket +4. Releaser runs script which SSHs into direct and copies files from the R2 dist-staging bucket to the R2 dist-prod bucket +5. Script generates `SHASUMS256.txt` based on files on direct, not R2, and writes this to the R2 dist-prod bucket During the v22.22.1 release: -1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2 +1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2 staging 2. The job was re-run, producing a new signed package at 21:00 UTC 3. The second run successfully copied the new package to direct -4. The `rclone` step to R2 failed with `kex_exchange_identification: Connection closed by remote host` +4. The `rclone` step to R2 staging failed with `kex_exchange_identification: Connection closed by remote host` 5. The Jenkins job marked the build as failed but did not roll back the direct upload +6. Releaser ran script, which promoted the original package from R2 staging to prod, but generated `SHASUMS256.txt` based on the regenerated package on direct -This left `direct.` with the correct file (matching SHASUMS256.txt) while R2 served the outdated file, creating a checksum mismatch for most users. +This left direct with matching package and `SHASUMS256.txt` files, but the R2 prod bucket with the outdated package file, creating a checksum mismatch for most users. ## Fix -The immediate fix was to manually sync the correct file from direct.nodejs.org to the R2 dist-staging bucket using `rclone copyto`. +The immediate fix was to manually sync the correct file from direct to the R2 dist-staging bucket using `rclone copyto`, and then to the R2 dist-prod bucket. ## Follow-up Work - Improve Jenkins workflow to prevent partial uploads when rclone fails - - Either roll back www uploads if R2 sync fails, or upload to both destinations atomically - - Add verification step to compare checksums between www and R2 before marking build as complete + - Either roll back direct uploads if R2 sync fails, or upload to both destinations atomically + - Add verification step to compare checksums between direct and R2 before marking build as complete - Add monitoring/alerting for checksum mismatches between distribution sources - Investigate why the rclone SSH connection failed mid-release - Consider adding checksum verification as part of the promotion workflow From 76099bf7307698280ff5d53aa631f13ddb4192b0 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:14:23 -0400 Subject: [PATCH 12/18] Update incident report for March 2026 --- incidents/2026-03-04.md | 1 + 1 file changed, 1 insertion(+) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index a7bf3ea..b9c3419 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -12,6 +12,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 - **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket. - **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket, including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. + - **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. - **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged. From 82b7e0f1112536ba7b3433cff41f5fd1f52cd6da Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:22:09 -0400 Subject: [PATCH 13/18] Update 2026-03-04.md --- incidents/2026-03-04.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index b9c3419..ca0f118 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -7,11 +7,11 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 ## Timeline -- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct and the R2 dist-staging bucket. +- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct (`direct.nodejs.org`) and the R2 dist-staging bucket. - **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket. -- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket, including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. +- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket (`r2.nodejs.org`), including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. - **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. From 95682bd1d5856e676818f1866b43a2d4b2df251f Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:22:28 -0400 Subject: [PATCH 14/18] Update 2026-03-04.md Co-authored-by: Matt Cowley --- incidents/2026-03-04.md | 1 + 1 file changed, 1 insertion(+) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index ca0f118..1fa7b52 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -66,4 +66,5 @@ The immediate fix was to manually sync the correct file from direct to the R2 di - Add monitoring/alerting for checksum mismatches between distribution sources - Investigate why the rclone SSH connection failed mid-release - Consider adding checksum verification as part of the promotion workflow +- Generate checksums based on R2 dist-prod contents rather than direct - Add better logging/auditing for release builds to track which artifacts were uploaded where and when From 1a0db65f0a9e47207ca567e89d483a445b451b78 Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:23:30 -0400 Subject: [PATCH 15/18] Update 2026-03-04.md --- incidents/2026-03-04.md | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index 1fa7b52..caf8f09 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -23,9 +23,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 - **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to direct but failed to sync to R2, causing version mismatch. -- **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. - -- **2026-03-09 01:29 UTC**: Cache purged. Impact resolved. +- **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted. Impact resolved shortly after. ## Impact From 7bcd57a975e4e9100ac25f1b609cf6e52a002d9e Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:59:44 -0400 Subject: [PATCH 16/18] Apply suggestions from code review Co-authored-by: Matt Cowley Co-authored-by: flakey5 <73616808+flakey5@users.noreply.github.com> --- incidents/2026-03-04.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index caf8f09..b3e2e1f 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -11,7 +11,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 - **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket. -- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket (`r2.nodejs.org`), including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. +- **2026-03-05 14:30 UTC**: Start of impact. Promotion script ran, copying the release assets from the R2 dist-staging bucket to the R2 dist-prod bucket (which serves `nodejs.org`), including `node-v22.22.1.pkg` from the first Jenkins run. `SHASUMS256.txt` generated based on assets on direct, including `node-v22.22.1.pkg` from the second Jenkins run. - **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created. @@ -66,3 +66,4 @@ The immediate fix was to manually sync the correct file from direct to the R2 di - Consider adding checksum verification as part of the promotion workflow - Generate checksums based on R2 dist-prod contents rather than direct - Add better logging/auditing for release builds to track which artifacts were uploaded where and when +- Create or make known what documentation/sources of truth to point to for any further incidents like this From 008249dd67af11465f1d65756396868545d2c64b Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 22:59:57 -0400 Subject: [PATCH 17/18] Update incidents/2026-03-04.md Co-authored-by: Matt Cowley --- incidents/2026-03-04.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index b3e2e1f..88ca8bd 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -7,7 +7,7 @@ For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22 ## Timeline -- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct (`direct.nodejs.org`) and the R2 dist-staging bucket. +- **2026-03-04 17:14 UTC**: First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to direct (backup origin server) and the R2 dist-staging bucket. - **2026-03-04 21:00 UTC**: Second Jenkins build completed, uploading recreated `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct, but failing to write to the R2 dist-staging bucket. From 1c07d926354fdcbf82980cb2ba2398f9c7f9585a Mon Sep 17 00:00:00 2001 From: Aviv Keller Date: Sun, 8 Mar 2026 23:02:32 -0400 Subject: [PATCH 18/18] Update wording on time period --- incidents/2026-03-04.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/incidents/2026-03-04.md b/incidents/2026-03-04.md index 88ca8bd..3afebcf 100644 --- a/incidents/2026-03-04.md +++ b/incidents/2026-03-04.md @@ -3,7 +3,7 @@ - Incident Commander: @ryanaslett - Severity Level: P1 -For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22.22.1 served a duplicate file with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. While having a different hash, this file has been generated and signed legitimately by Node.js' CI and was safe to run. +For several days following the release announcement, the macOS installer package (`.pkg`) for Node.js v22.22.1 served a duplicate file with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. While having a different hash, this file has been generated and signed legitimately by Node.js' CI and was safe to run. ## Timeline