fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6
fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6matyas-jirat-keboola merged 10 commits intomainfrom
Conversation
…ation - Add merge() to DefaultSanitizer, TokenSanitizer, HeaderSanitizer, QueryParamSanitizer, UrlPatternSanitizer, ResponseUrlSanitizer - Add _dedup_sanitizers() which collapses same-class sanitizers into one, so the sanitizer chain makes a single pass over each request/response instead of N passes when multiple sanitizers of the same type are present - Add pre-scan in DefaultSanitizer._sanitize_body: skip the full JSON parse when no sensitive field name or value is present in the body (fast path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: every HTTPS interaction during recording caused urllib3 to call load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of CA cert data into the OpenSSL C-heap (via jemalloc). jemalloc's high-watermark behaviour means this memory is never returned to the OS, producing ~7 MB RSS growth per 10 interactions and OOM for jobs with 500–2000 interactions. Three root causes, three fixes applied inside _pool_reuse_patch(): 1. Pool proliferation: components that create a new requests.Session per API call (e.g. Facebook Ads batch requests) each get their own urllib3.PoolManager and a fresh HTTPSConnectionPool, so N interactions → N pools → N ssl_context allocations. Fix: intercept PoolManager.connection_from_host and return one shared pool per (scheme, host, port) for the lifetime of the recording. 2. ssl_context per connect(): even with a single pool, urllib3 calls _new_conn() → connect() → _ssl_wrap_socket_and_match_hostname( ssl_context=None) for each interaction because requests 2.32.x never puts ssl_context into pool.conn_kw. When ssl_context is None, urllib3 creates a new context and calls load_default_certs() every time. Fix: inject one shared ssl.SSLContext (with load_default_certs() pre-called) into pool.conn_kw["ssl_context"] so all _new_conn() connections share it. 3. Connection churn: VCRConnection.is_connected delegated to real_connection.sock, which is None after Connection: close responses. urllib3 saw is_connected=False and called _new_conn() for every request. Fix: override is_connected on VCRConnection to always return True when real_connection exists (recording) or a VCRFakeSocket is present (replay). Additionally, implement VCRHTTPResponse.release_conn() so urllib3's pool queue is refilled after each response and _new_conn() is not called unnecessarily. Also: write large response body strings to the cassette file in 64 KB chunks (_write_interaction_chunked) to prevent multi-MB transient allocations from fragmenting glibc's heap during recording of APIs with large JSON payloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_vcr_is_connected uses _VCRFakeSocket via a global lookup (module-level function, not a closure). Deleting it with `del` before the property is ever called would raise NameError at runtime. Remove it from the del list so it stays available as a module-level name. Also fix ruff I001: import block sort order in sanitizers.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
soustruh
left a comment
There was a problem hiding this comment.
A couple of actually quite important things to consider or fix please 😇
src/keboola/vcr/recorder.py
Outdated
| """ | ||
| body_dict = response.get("body", {}) | ||
| resp_rest = {k: v for k, v in response.items() if k != "body"} | ||
|
|
There was a problem hiding this comment.
Honestly, all this method is quite hacky and messy, the variables naming is inconsistent (body_dict + resp_rest instead of for example resp_body + resp_rest) and I am not convinced it brings much measurable value? If so, please polish it a little, especially the naming 🙏
src/keboola/vcr/sanitizers.py
Outdated
| seen.add(v) | ||
| combined_values.append(v) | ||
| merged.sensitive_values = sorted(combined_values, key=len, reverse=True) | ||
| merged._field_patterns = [re.compile(rf"({re.escape(f)}=)[^&\"\s]+") for f in merged.sensitive_fields] |
There was a problem hiding this comment.
Wow, this really seems like an obfuscation at its finest 🙀
Addresses review comment: _write_json_string_chunked and _write_interaction_chunked were missing the type annotation for the file argument. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The chunked string writing optimization added complexity without measurable benefit — the SSLContext sharing fix is the primary OOM prevention. Replace the manual body-splitting logic with a single json.dump call; _BytesEncoder already handles bytes-to-str conversion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "chunked" suffix no longer applies after the simplification. Cleaner name that matches what the function actually does. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The module already imports from __future__ import annotations, so forward-reference string quotes on all merge() signatures are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace __new__ + manual field assignment with a clean __init__ call. dict.fromkeys() deduplicates sensitive_values while preserving order; __init__ handles sorting, lowercasing, and pattern compilation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Versions are managed via git tags and CI release automation, not by manually editing pyproject.toml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
soustruh
left a comment
There was a problem hiding this comment.
Just one small improvement please 🙏
|
|
||
| self.headers_to_remove = set(h.lower() for h in (headers_to_remove or [])) | ||
|
|
||
| def merge(self, other: HeaderSanitizer) -> HeaderSanitizer: |
There was a problem hiding this comment.
HeaderSanitizer.merge() uses __new__ -- inconsistent with every other merge
Location: src/keboola/vcr/sanitizers.py -- HeaderSanitizer.merge()
Pattern: Every other merge() in this PR calls the constructor: DefaultSanitizer(...), TokenSanitizer(...), QueryParamSanitizer(...), UrlPatternSanitizer(...), ResponseUrlSanitizer(...). HeaderSanitizer is the only one using __new__ + manual field assignment.
Fix:
def merge(self, other: HeaderSanitizer) -> HeaderSanitizer:
return HeaderSanitizer(
safe_headers=list(self.safe_headers | other.safe_headers),
headers_to_remove=list(self.headers_to_remove | other.headers_to_remove),
)
soustruh
left a comment
There was a problem hiding this comment.
OK, thanks for the changes & explanations! ദ്ദി(•ᴗ•)
Summary
Two self-contained commits:
1.
feat(sanitizers): merge duplicate sanitizers + pre-scan optimisationmerge()method_dedup_sanitizers()collapses same-class sanitizers into one, so the chain makes a single pass over each request/response instead of N passesDefaultSanitizer._sanitize_bodygains a pre-scan: skips the full JSON parse when no sensitive field name or value is present in the body (fast path for large unrelated responses)2.
fix(oom): prevent RSS growth during VCR recordingRoot cause: every HTTPS interaction caused urllib3 to call
load_default_certs()on a brand-newssl.SSLContext, allocating ~480 KB of CA cert data into the OpenSSL C-heap via jemalloc. jemalloc never returns this to the OS, producing ~7 MB RSS growth per 10 interactions → OOM for jobs with 500–2000 interactions.Three root causes fixed:
requests.Sessionper call each get their ownPoolManager→ N pools → N ssl_context allocations. Fix:_pool_reuse_patch()returns one shared pool per(scheme, host, port).requests2.32.x never putsssl_contextintopool.conn_kw, so urllib3 callsload_default_certs()on everyconnect(). Fix: inject a sharedSSLContextintopool.conn_kwat pool creation time.VCRConnection.is_connecteddelegated toreal_connection.sock(None afterConnection: close), causing_new_conn()on every request. Fix: overrideis_connectedto returnTruewhen a VCR connection/socket is active; implementVCRHTTPResponse.release_conn()so the pool queue is refilled.Also: response bodies written in 64 KB chunks to avoid large transient heap allocations fragmenting glibc's heap.
Confirmed working on Keboola platform — RSS growth eliminated on a 100+ interaction Facebook Ads recording run.
Test plan
uv run pytest)🤖 Generated with Claude Code