fix(security): block S3 bucket takeover via unsafe example defaults#139
fix(security): block S3 bucket takeover via unsafe example defaults#139kamir wants to merge 2 commits intoKafScale:mainfrom
Conversation
A security researcher registered the `kafscale-lfs` S3 bucket on AWS — the same name used as an example throughout the OpenAPI specs and docs. Any deployment using this default would silently route LFS uploads to an attacker-controlled bucket, enabling data exfiltration and content tampering. Three changes: 1. Replace all `kafscale-lfs` example values in both OpenAPI specs (cmd/proxy/ and api/lfs-proxy/) with safe placeholder `my-bucket`. 2. Add runtime blocklist in initLFSModule(): the proxy now refuses to start if the configured S3 bucket matches any known-unsafe example name. This protects existing deployments that upgrade the binary but don't update their config. 3. Require KAFSCALE_LFS_PROXY_S3_BUCKET and KAFSCALE_LFS_PROXY_S3_REGION at startup when LFS is enabled. Previously empty values were silently accepted, causing downstream failures or unsafe fallback behavior. Reported-by: Milan Katwal <milankatwal2056@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Design Notes (preserved from PR preparation)Why a blocklist?Removing the examples from the spec fixes future copy-paste. But the bucket name The blocklist is a runtime safety net for existing deployments: even if someone upgrades the proxy binary but doesn't touch their config, the proxy refuses to start instead of silently routing data to an attacker-controlled bucket. The bucket name is permanently compromised — it can never be safe again, regardless of who controls it now. The code must reject it forever. This is the same pattern used by credential scanners that block known-leaked API keys even after rotation. Breaking changeDeployments that relied on an empty Test plan
ContextThis fix addresses a reported S3 bucket takeover vulnerability. A security researcher registered the |
Scope expansion — adding LFS download integrity fixPer review feedback, consolidating follow-up fixes into this PR rather than opening a separate one. Additional changes being added:LFS download integrity verification — closes the tampering vector that complements the exfiltration fix already in this PR. Current behavior (still vulnerable after just the blocklist fix):
Planned fix:
Trust model being enforced:
The envelope lives in Kafka (attacker can't modify it). Downloaded bytes from S3 are verified against the Kafka-held checksum on every read. Breaking change (intentional):Default Full specDetailed API behavior, acceptance criteria, and out-of-scope items are in the forthcoming commit. The broader platform-wide integrity audit (broker segment CRC, RecordBatch CRC32C, S3 checksums) is tracked separately in Pushing commits shortly. |
…hecksum The LFS envelope records a SHA-256 checksum at upload time, but the download handler never verified it. A consumer receiving tampered content from a compromised S3 bucket had no way to detect the modification. This closes the tampering vector that complements the exfiltration fix already in this PR. Trust model enforced: Kafka is the authority. S3 is untrusted storage. The envelope lives in Kafka (attacker cannot modify it). Downloaded bytes from S3 are verified against the Kafka-held checksum on every read. Behavior changes: * **Default mode is now `stream`** (was `presign`). Secure-by-default. * **Stream mode hashes bytes on the fly** via io.TeeReader with SHA-256, compares to the client-supplied envelope checksum. On mismatch the TCP connection is aborted (http.ErrAbortHandler / Hijack+Close) so the client sees a truncated response and cannot mistake the bytes for verified content. HTTP trailers intentionally avoided — poor client support. * **Presign mode is disabled by default**, opt-in via KAFSCALE_LFS_PROXY_PRESIGN_ENABLED=true. When enabled, proxy logs a warning banner at startup and every 5 minutes. Response body includes an `integrity` block (sha256, checksum_alg, size) so client SDKs can verify downloaded bytes themselves. * **Integrity claim is required** on all download requests. Missing integrity.sha256 → 400 missing_integrity. * **Advisory response headers** on stream mode: X-Kafscale-LFS-Checksum and X-Kafscale-LFS-Content-Length. Set before body flush so advanced clients can hash independently and fail-fast. * **New ops-tracker event** download_integrity_failed emits to __lfs_ops_state with expected/actual SHA-256, bytes read, and mode — gives operators a Kafka-native alerting signal for tampering attempts. Breaking change (intentional): clients that did not send an integrity block must add one. Clients that relied on the default mode being presign must either request presign explicitly (and have the operator opt in) or switch to stream. OpenAPI specs (cmd/proxy/openapi.yaml and api/lfs-proxy/openapi.yaml) updated with the new schema, IntegrityClaim component, presign_disabled error code, and safe example values (no real bucket names). Tests: * TestHandleHTTPDownload_StreamVerifyMatch — happy path * TestHandleHTTPDownload_StreamVerifyMismatch — aborts on tampering * TestHandleHTTPDownload_PresignMode — requires integrity, echoes it back * TestHandleHTTPDownload_PresignDisabledByDefault — 400 when not opted in * TestHandleHTTPDownload_MissingIntegrity — 400 when claim absent * TestHandleHTTPDownload_DefaultModeIsStream — confirms new default All 26 package test suites pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The patch does not guarantee “truncated response on mismatch”, verification has to happen before sending the response, or the protocol needs an explicit post-body failure signal. |
Summary
A security researcher (Milan Katwal) registered the
kafscale-lfsS3 bucket on AWS — the exact name used as an example in our OpenAPI specs. Any deployment using this default silently routes all LFS uploads to an attacker-controlled bucket.Impact: data exfiltration, content tampering via presigned URLs, integrity bypass (no checksum on download).
This PR fixes the vulnerability with three defense layers:
S3_BUCKETandS3_REGIONmust now be explicitly setWhy a blocklist?
Removing the examples from the spec fixes future copy-paste. But the bucket name
kafscale-lfshas been public in this repo since commit1b5dbd8. Every deployment that was configured by copying from those docs — Helm values, docker-compose files, CI scripts, customer PoCs — may still havekafscale-lfshardwired in config files.The blocklist is a runtime safety net for existing deployments: even if someone upgrades the proxy binary but doesn't touch their config, the proxy refuses to start instead of silently routing data to an attacker-controlled bucket.
The bucket name is permanently compromised — it can never be safe again, regardless of who controls it now. The code must reject it forever. This is the same pattern used by credential scanners that block known-leaked API keys even after rotation.
Changes
cmd/proxy/lfs.goblockedBucketNamesslice + validation ininitLFSModule(). RequireS3_BUCKETandS3_REGIONto be non-empty.cmd/proxy/openapi.yamlkafscale-lfs→my-bucketapi/lfs-proxy/openapi.yamlkafscale-lfs→my-bucketpkg/lfs/doc.gokafscale-lfs→my-bucketin doc examplespkg/lfs/envelope_test.gotest-bucketBreaking change
Deployments that relied on an empty
KAFSCALE_LFS_PROXY_S3_BUCKETwill now fail at startup with a clear error message. This is intentional — an empty bucket was never valid, it just failed silently downstream.Test plan
go build ./cmd/proxy/passesgo vet ./pkg/lfs/passesgo test ./pkg/lfs/...passesKAFSCALE_LFS_PROXY_S3_BUCKET=kafscale-lfs→ exits with blocklist errorKAFSCALE_LFS_PROXY_S3_BUCKET→ exits with "required" errorkafscale-lfsbucket references (only in blocklist)Related
scalytics-all-in-one/PLANS/plan-001.md🤖 Generated with Claude Code