Skip to content

fix(security): block S3 bucket takeover via unsafe example defaults#139

Open
kamir wants to merge 2 commits intoKafScale:mainfrom
kamir:fix/s3-bucket-takeover-cve
Open

fix(security): block S3 bucket takeover via unsafe example defaults#139
kamir wants to merge 2 commits intoKafScale:mainfrom
kamir:fix/s3-bucket-takeover-cve

Conversation

@kamir
Copy link
Copy Markdown
Collaborator

@kamir kamir commented Apr 17, 2026

Summary

A security researcher (Milan Katwal) registered the kafscale-lfs S3 bucket on AWS — the exact name used as an example in our OpenAPI specs. Any deployment using this default silently routes all LFS uploads to an attacker-controlled bucket.

Impact: data exfiltration, content tampering via presigned URLs, integrity bypass (no checksum on download).

This PR fixes the vulnerability with three defense layers:

  • Remove unsafe examples from both OpenAPI specs and Go doc/test files
  • Runtime blocklist that rejects known-compromised bucket names at startup
  • Required field enforcementS3_BUCKET and S3_REGION must now be explicitly set

Why a blocklist?

Removing the examples from the spec fixes future copy-paste. But the bucket name kafscale-lfs has been public in this repo since commit 1b5dbd8. Every deployment that was configured by copying from those docs — Helm values, docker-compose files, CI scripts, customer PoCs — may still have kafscale-lfs hardwired in config files.

The blocklist is a runtime safety net for existing deployments: even if someone upgrades the proxy binary but doesn't touch their config, the proxy refuses to start instead of silently routing data to an attacker-controlled bucket.

The bucket name is permanently compromised — it can never be safe again, regardless of who controls it now. The code must reject it forever. This is the same pattern used by credential scanners that block known-leaked API keys even after rotation.

Changes

File Change
cmd/proxy/lfs.go Add blockedBucketNames slice + validation in initLFSModule(). Require S3_BUCKET and S3_REGION to be non-empty.
cmd/proxy/openapi.yaml Replace 6× kafscale-lfsmy-bucket
api/lfs-proxy/openapi.yaml Replace 6× kafscale-lfsmy-bucket
pkg/lfs/doc.go Replace 2× kafscale-lfsmy-bucket in doc examples
pkg/lfs/envelope_test.go Replace test fixture bucket → test-bucket

Breaking change

Deployments that relied on an empty KAFSCALE_LFS_PROXY_S3_BUCKET will now fail at startup with a clear error message. This is intentional — an empty bucket was never valid, it just failed silently downstream.

Test plan

  • go build ./cmd/proxy/ passes
  • go vet ./pkg/lfs/ passes
  • go test ./pkg/lfs/... passes
  • Start proxy with KAFSCALE_LFS_PROXY_S3_BUCKET=kafscale-lfs → exits with blocklist error
  • Start proxy with empty KAFSCALE_LFS_PROXY_S3_BUCKET → exits with "required" error
  • Start proxy with valid bucket + region → starts normally
  • Grep confirms no remaining kafscale-lfs bucket references (only in blocklist)

Related

  • Reported via email to security@novatechflow.com on 2026-04-17
  • Follows up on broader config hardening plan: scalytics-all-in-one/PLANS/plan-001.md

🤖 Generated with Claude Code

A security researcher registered the `kafscale-lfs` S3 bucket on AWS —
the same name used as an example throughout the OpenAPI specs and docs.
Any deployment using this default would silently route LFS uploads to an
attacker-controlled bucket, enabling data exfiltration and content
tampering.

Three changes:

1. Replace all `kafscale-lfs` example values in both OpenAPI specs
   (cmd/proxy/ and api/lfs-proxy/) with safe placeholder `my-bucket`.

2. Add runtime blocklist in initLFSModule(): the proxy now refuses to
   start if the configured S3 bucket matches any known-unsafe example
   name. This protects existing deployments that upgrade the binary but
   don't update their config.

3. Require KAFSCALE_LFS_PROXY_S3_BUCKET and KAFSCALE_LFS_PROXY_S3_REGION
   at startup when LFS is enabled. Previously empty values were silently
   accepted, causing downstream failures or unsafe fallback behavior.

Reported-by: Milan Katwal <milankatwal2056@gmail.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@kamir
Copy link
Copy Markdown
Collaborator Author

kamir commented Apr 17, 2026

Design Notes (preserved from PR preparation)

Why a blocklist?

Removing the examples from the spec fixes future copy-paste. But the bucket name kafscale-lfs has been public in this repo since commit 1b5dbd8. Every deployment that was configured by copying from those docs — Helm values, docker-compose files, CI scripts, customer PoCs — may still have kafscale-lfs hardwired in config files.

The blocklist is a runtime safety net for existing deployments: even if someone upgrades the proxy binary but doesn't touch their config, the proxy refuses to start instead of silently routing data to an attacker-controlled bucket.

The bucket name is permanently compromised — it can never be safe again, regardless of who controls it now. The code must reject it forever. This is the same pattern used by credential scanners that block known-leaked API keys even after rotation.

Breaking change

Deployments that relied on an empty KAFSCALE_LFS_PROXY_S3_BUCKET will now fail at startup with a clear error message. This is intentional — an empty bucket was never valid, it just failed silently downstream.

Test plan

  • go build ./cmd/proxy/ passes
  • go vet ./pkg/lfs/ passes
  • go test ./pkg/lfs/... passes
  • Start proxy with KAFSCALE_LFS_PROXY_S3_BUCKET=kafscale-lfs → exits with blocklist error
  • Start proxy with empty KAFSCALE_LFS_PROXY_S3_BUCKET → exits with "required" error
  • Start proxy with valid bucket + region → starts normally
  • Grep confirms no remaining kafscale-lfs bucket references (only in blocklist)

Context

This fix addresses a reported S3 bucket takeover vulnerability. A security researcher registered the kafscale-lfs bucket on AWS — the exact name used as examples in our OpenAPI specs. The broader config hardening plan (centralized config package, startup validation for all binaries, console auth fix) is tracked in scalytics-all-in-one/PLANS/plan-001.md.

@kamir
Copy link
Copy Markdown
Collaborator Author

kamir commented Apr 17, 2026

Scope expansion — adding LFS download integrity fix

Per review feedback, consolidating follow-up fixes into this PR rather than opening a separate one.

Additional changes being added:

LFS download integrity verification — closes the tampering vector that complements the exfiltration fix already in this PR.

Current behavior (still vulnerable after just the blocklist fix):

  • Any attacker/insider with S3 write access can tamper with LFS objects post-upload
  • Neither stream nor presign mode verifies the envelope's sha256 checksum

Planned fix:

  • Stream mode becomes default. Hashes bytes on the fly (io.TeeReader), compares to envelope checksum, aborts TCP connection on mismatch (no trailers — poor client support). Emits integrity_failure event to __lfs_ops_state tracker.
  • Presign mode is disabled by default. Opt-in via KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. When enabled, proxy logs a warning banner at startup and every 5 minutes. Response body includes integrity block with sha256/checksum_alg/size so SDK can verify client-side.
  • Advisory response headers on stream mode: X-Kafscale-LFS-Checksum and X-Kafscale-LFS-Content-Length — let advanced clients hash independently and fail-fast without waiting for connection close.
  • Updated OpenAPI spec reflects new default, presign_disabled error code, integrity block.

Trust model being enforced:

Kafka is the authority. S3 is untrusted storage.

The envelope lives in Kafka (attacker can't modify it). Downloaded bytes from S3 are verified against the Kafka-held checksum on every read.

Breaking change (intentional):

Default mode changes from presignstream. Existing clients that hardcoded mode: presign must also have the operator set KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. This is secure-by-default by design.

Full spec

Detailed API behavior, acceptance criteria, and out-of-scope items are in the forthcoming commit. The broader platform-wide integrity audit (broker segment CRC, RecordBatch CRC32C, S3 checksums) is tracked separately in scalytics-all-in-one/PLANS/plan-002.md and will be addressed in follow-up PRs.

Pushing commits shortly.

…hecksum

The LFS envelope records a SHA-256 checksum at upload time, but the download
handler never verified it. A consumer receiving tampered content from a
compromised S3 bucket had no way to detect the modification. This closes
the tampering vector that complements the exfiltration fix already in this
PR.

Trust model enforced:
  Kafka is the authority. S3 is untrusted storage.

The envelope lives in Kafka (attacker cannot modify it). Downloaded bytes
from S3 are verified against the Kafka-held checksum on every read.

Behavior changes:

* **Default mode is now `stream`** (was `presign`). Secure-by-default.
* **Stream mode hashes bytes on the fly** via io.TeeReader with SHA-256,
  compares to the client-supplied envelope checksum. On mismatch the TCP
  connection is aborted (http.ErrAbortHandler / Hijack+Close) so the client
  sees a truncated response and cannot mistake the bytes for verified
  content. HTTP trailers intentionally avoided — poor client support.
* **Presign mode is disabled by default**, opt-in via
  KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. When enabled, proxy logs a
  warning banner at startup and every 5 minutes. Response body includes
  an `integrity` block (sha256, checksum_alg, size) so client SDKs can
  verify downloaded bytes themselves.
* **Integrity claim is required** on all download requests. Missing
  integrity.sha256 → 400 missing_integrity.
* **Advisory response headers** on stream mode: X-Kafscale-LFS-Checksum
  and X-Kafscale-LFS-Content-Length. Set before body flush so advanced
  clients can hash independently and fail-fast.
* **New ops-tracker event** download_integrity_failed emits to
  __lfs_ops_state with expected/actual SHA-256, bytes read, and mode —
  gives operators a Kafka-native alerting signal for tampering attempts.

Breaking change (intentional): clients that did not send an integrity
block must add one. Clients that relied on the default mode being presign
must either request presign explicitly (and have the operator opt in) or
switch to stream.

OpenAPI specs (cmd/proxy/openapi.yaml and api/lfs-proxy/openapi.yaml)
updated with the new schema, IntegrityClaim component, presign_disabled
error code, and safe example values (no real bucket names).

Tests:
* TestHandleHTTPDownload_StreamVerifyMatch — happy path
* TestHandleHTTPDownload_StreamVerifyMismatch — aborts on tampering
* TestHandleHTTPDownload_PresignMode — requires integrity, echoes it back
* TestHandleHTTPDownload_PresignDisabledByDefault — 400 when not opted in
* TestHandleHTTPDownload_MissingIntegrity — 400 when claim absent
* TestHandleHTTPDownload_DefaultModeIsStream — confirms new default

All 26 package test suites pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@novatechflow
Copy link
Copy Markdown
Collaborator

streamDownloadWithVerify writes the full body with io.Copy and sets Content-Length before comparing the digest. If the object is tampered but keeps the same size, the client can still receive a complete 200 OK response and treat it as success. Closing the connection afterward does not reliably turn that into a failed download.

The patch does not guarantee “truncated response on mismatch”, verification has to happen before sending the response, or the protocol needs an explicit post-body failure signal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants