fix(security): block S3 bucket takeover via unsafe example defaults by kamir · Pull Request #139 · KafScale/platform

kamir · 2026-04-17T16:02:12Z

Summary

A security researcher (Milan Katwal) registered the kafscale-lfs S3 bucket on AWS — the exact name used as an example in our OpenAPI specs. Any deployment using this default silently routes all LFS uploads to an attacker-controlled bucket.

Impact: data exfiltration, content tampering via presigned URLs, integrity bypass (no checksum on download).

This PR fixes the vulnerability with three defense layers:

Remove unsafe examples from both OpenAPI specs and Go doc/test files
Runtime blocklist that rejects known-compromised bucket names at startup
Required field enforcement — S3_BUCKET and S3_REGION must now be explicitly set

Why a blocklist?

Removing the examples from the spec fixes future copy-paste. But the bucket name kafscale-lfs has been public in this repo since commit 1b5dbd8. Every deployment that was configured by copying from those docs — Helm values, docker-compose files, CI scripts, customer PoCs — may still have kafscale-lfs hardwired in config files.

The blocklist is a runtime safety net for existing deployments: even if someone upgrades the proxy binary but doesn't touch their config, the proxy refuses to start instead of silently routing data to an attacker-controlled bucket.

The bucket name is permanently compromised — it can never be safe again, regardless of who controls it now. The code must reject it forever. This is the same pattern used by credential scanners that block known-leaked API keys even after rotation.

Changes

File	Change
`cmd/proxy/lfs.go`	Add `blockedBucketNames` slice + validation in `initLFSModule()`. Require `S3_BUCKET` and `S3_REGION` to be non-empty.
`cmd/proxy/openapi.yaml`	Replace 6× `kafscale-lfs` → `my-bucket`
`api/lfs-proxy/openapi.yaml`	Replace 6× `kafscale-lfs` → `my-bucket`
`pkg/lfs/doc.go`	Replace 2× `kafscale-lfs` → `my-bucket` in doc examples
`pkg/lfs/envelope_test.go`	Replace test fixture bucket → `test-bucket`

Breaking change

Deployments that relied on an empty KAFSCALE_LFS_PROXY_S3_BUCKET will now fail at startup with a clear error message. This is intentional — an empty bucket was never valid, it just failed silently downstream.

Test plan

go build ./cmd/proxy/ passes
go vet ./pkg/lfs/ passes
go test ./pkg/lfs/... passes
Start proxy with KAFSCALE_LFS_PROXY_S3_BUCKET=kafscale-lfs → exits with blocklist error
Start proxy with empty KAFSCALE_LFS_PROXY_S3_BUCKET → exits with "required" error
Start proxy with valid bucket + region → starts normally
Grep confirms no remaining kafscale-lfs bucket references (only in blocklist)

A security researcher registered the `kafscale-lfs` S3 bucket on AWS — the same name used as an example throughout the OpenAPI specs and docs. Any deployment using this default would silently route LFS uploads to an attacker-controlled bucket, enabling data exfiltration and content tampering. Three changes: 1. Replace all `kafscale-lfs` example values in both OpenAPI specs (cmd/proxy/ and api/lfs-proxy/) with safe placeholder `my-bucket`. 2. Add runtime blocklist in initLFSModule(): the proxy now refuses to start if the configured S3 bucket matches any known-unsafe example name. This protects existing deployments that upgrade the binary but don't update their config. 3. Require KAFSCALE_LFS_PROXY_S3_BUCKET and KAFSCALE_LFS_PROXY_S3_REGION at startup when LFS is enabled. Previously empty values were silently accepted, causing downstream failures or unsafe fallback behavior. Reported-by: Milan Katwal <milankatwal2056@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kamir · 2026-04-17T16:04:59Z

Design Notes (preserved from PR preparation)

Why a blocklist?

Removing the examples from the spec fixes future copy-paste. But the bucket name kafscale-lfs has been public in this repo since commit 1b5dbd8. Every deployment that was configured by copying from those docs — Helm values, docker-compose files, CI scripts, customer PoCs — may still have kafscale-lfs hardwired in config files.

The blocklist is a runtime safety net for existing deployments: even if someone upgrades the proxy binary but doesn't touch their config, the proxy refuses to start instead of silently routing data to an attacker-controlled bucket.

The bucket name is permanently compromised — it can never be safe again, regardless of who controls it now. The code must reject it forever. This is the same pattern used by credential scanners that block known-leaked API keys even after rotation.

Breaking change

Deployments that relied on an empty KAFSCALE_LFS_PROXY_S3_BUCKET will now fail at startup with a clear error message. This is intentional — an empty bucket was never valid, it just failed silently downstream.

Test plan

go build ./cmd/proxy/ passes
go vet ./pkg/lfs/ passes
go test ./pkg/lfs/... passes
Start proxy with KAFSCALE_LFS_PROXY_S3_BUCKET=kafscale-lfs → exits with blocklist error
Start proxy with empty KAFSCALE_LFS_PROXY_S3_BUCKET → exits with "required" error
Start proxy with valid bucket + region → starts normally
Grep confirms no remaining kafscale-lfs bucket references (only in blocklist)

Context

This fix addresses a reported S3 bucket takeover vulnerability. A security researcher registered the kafscale-lfs bucket on AWS — the exact name used as examples in our OpenAPI specs. The broader config hardening plan (centralized config package, startup validation for all binaries, console auth fix) is tracked in scalytics-all-in-one/PLANS/plan-001.md.

kamir · 2026-04-17T17:22:47Z

Scope expansion — adding LFS download integrity fix

Per review feedback, consolidating follow-up fixes into this PR rather than opening a separate one.

Additional changes being added:

LFS download integrity verification — closes the tampering vector that complements the exfiltration fix already in this PR.

Current behavior (still vulnerable after just the blocklist fix):

Any attacker/insider with S3 write access can tamper with LFS objects post-upload
Neither stream nor presign mode verifies the envelope's sha256 checksum

Planned fix:

Stream mode becomes default. Hashes bytes on the fly (io.TeeReader), compares to envelope checksum, aborts TCP connection on mismatch (no trailers — poor client support). Emits integrity_failure event to __lfs_ops_state tracker.
Presign mode is disabled by default. Opt-in via KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. When enabled, proxy logs a warning banner at startup and every 5 minutes. Response body includes integrity block with sha256/checksum_alg/size so SDK can verify client-side.
Advisory response headers on stream mode: X-Kafscale-LFS-Checksum and X-Kafscale-LFS-Content-Length — let advanced clients hash independently and fail-fast without waiting for connection close.
Updated OpenAPI spec reflects new default, presign_disabled error code, integrity block.

Trust model being enforced:

Kafka is the authority. S3 is untrusted storage.

The envelope lives in Kafka (attacker can't modify it). Downloaded bytes from S3 are verified against the Kafka-held checksum on every read.

Breaking change (intentional):

Default mode changes from presign → stream. Existing clients that hardcoded mode: presign must also have the operator set KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. This is secure-by-default by design.

Full spec

Detailed API behavior, acceptance criteria, and out-of-scope items are in the forthcoming commit. The broader platform-wide integrity audit (broker segment CRC, RecordBatch CRC32C, S3 checksums) is tracked separately in scalytics-all-in-one/PLANS/plan-002.md and will be addressed in follow-up PRs.

Pushing commits shortly.

…hecksum The LFS envelope records a SHA-256 checksum at upload time, but the download handler never verified it. A consumer receiving tampered content from a compromised S3 bucket had no way to detect the modification. This closes the tampering vector that complements the exfiltration fix already in this PR. Trust model enforced: Kafka is the authority. S3 is untrusted storage. The envelope lives in Kafka (attacker cannot modify it). Downloaded bytes from S3 are verified against the Kafka-held checksum on every read. Behavior changes: * **Default mode is now `stream`** (was `presign`). Secure-by-default. * **Stream mode hashes bytes on the fly** via io.TeeReader with SHA-256, compares to the client-supplied envelope checksum. On mismatch the TCP connection is aborted (http.ErrAbortHandler / Hijack+Close) so the client sees a truncated response and cannot mistake the bytes for verified content. HTTP trailers intentionally avoided — poor client support. * **Presign mode is disabled by default**, opt-in via KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. When enabled, proxy logs a warning banner at startup and every 5 minutes. Response body includes an `integrity` block (sha256, checksum_alg, size) so client SDKs can verify downloaded bytes themselves. * **Integrity claim is required** on all download requests. Missing integrity.sha256 → 400 missing_integrity. * **Advisory response headers** on stream mode: X-Kafscale-LFS-Checksum and X-Kafscale-LFS-Content-Length. Set before body flush so advanced clients can hash independently and fail-fast. * **New ops-tracker event** download_integrity_failed emits to __lfs_ops_state with expected/actual SHA-256, bytes read, and mode — gives operators a Kafka-native alerting signal for tampering attempts. Breaking change (intentional): clients that did not send an integrity block must add one. Clients that relied on the default mode being presign must either request presign explicitly (and have the operator opt in) or switch to stream. OpenAPI specs (cmd/proxy/openapi.yaml and api/lfs-proxy/openapi.yaml) updated with the new schema, IntegrityClaim component, presign_disabled error code, and safe example values (no real bucket names). Tests: * TestHandleHTTPDownload_StreamVerifyMatch — happy path * TestHandleHTTPDownload_StreamVerifyMismatch — aborts on tampering * TestHandleHTTPDownload_PresignMode — requires integrity, echoes it back * TestHandleHTTPDownload_PresignDisabledByDefault — 400 when not opted in * TestHandleHTTPDownload_MissingIntegrity — 400 when claim absent * TestHandleHTTPDownload_DefaultModeIsStream — confirms new default All 26 package test suites pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

novatechflow · 2026-04-18T09:00:54Z

streamDownloadWithVerify writes the full body with io.Copy and sets Content-Length before comparing the digest. If the object is tampered but keeps the same size, the client can still receive a complete 200 OK response and treat it as success. Closing the connection afterward does not reliably turn that into a failed download.

The patch does not guarantee “truncated response on mismatch”, verification has to happen before sending the response, or the protocol needs an explicit post-body failure signal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(security): block S3 bucket takeover via unsafe example defaults#139

fix(security): block S3 bucket takeover via unsafe example defaults#139
kamir wants to merge 2 commits intoKafScale:mainfrom
kamir:fix/s3-bucket-takeover-cve

kamir commented Apr 17, 2026

Uh oh!

kamir commented Apr 17, 2026

Uh oh!

kamir commented Apr 17, 2026

Uh oh!

novatechflow commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kamir commented Apr 17, 2026

Summary

Why a blocklist?

Changes

Breaking change

Test plan

Related

Uh oh!

kamir commented Apr 17, 2026

Design Notes (preserved from PR preparation)

Why a blocklist?

Breaking change

Test plan

Context

Uh oh!

kamir commented Apr 17, 2026

Scope expansion — adding LFS download integrity fix

Additional changes being added:

Trust model being enforced:

Breaking change (intentional):

Full spec

Uh oh!

novatechflow commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants