fix(oom): prevent RSS growth during VCR recording + sanitizer dedup by matyas-jirat-keboola · Pull Request #6 · keboola/python-vcr-tests

matyas-jirat-keboola · 2026-03-05T18:54:21Z

Summary

Two self-contained commits:

1. `feat(sanitizers)`: merge duplicate sanitizers + pre-scan optimisation

Each sanitizer class gains a merge() method
_dedup_sanitizers() collapses same-class sanitizers into one, so the chain makes a single pass over each request/response instead of N passes
DefaultSanitizer._sanitize_body gains a pre-scan: skips the full JSON parse when no sensitive field name or value is present in the body (fast path for large unrelated responses)

2. `fix(oom)`: prevent RSS growth during VCR recording

Root cause: every HTTPS interaction caused urllib3 to call load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of CA cert data into the OpenSSL C-heap via jemalloc. jemalloc never returns this to the OS, producing ~7 MB RSS growth per 10 interactions → OOM for jobs with 500–2000 interactions.

Three root causes fixed:

Pool proliferation — components creating a new requests.Session per call each get their own PoolManager → N pools → N ssl_context allocations. Fix: _pool_reuse_patch() returns one shared pool per (scheme, host, port).
ssl_context per connect() — even with one pool, requests 2.32.x never puts ssl_context into pool.conn_kw, so urllib3 calls load_default_certs() on every connect(). Fix: inject a shared SSLContext into pool.conn_kw at pool creation time.
Connection churn — VCRConnection.is_connected delegated to real_connection.sock (None after Connection: close), causing _new_conn() on every request. Fix: override is_connected to return True when a VCR connection/socket is active; implement VCRHTTPResponse.release_conn() so the pool queue is refilled.

Also: response bodies written in 64 KB chunks to avoid large transient heap allocations fragmenting glibc's heap.

Confirmed working on Keboola platform — RSS growth eliminated on a 100+ interaction Facebook Ads recording run.

Test plan

119 unit tests pass (uv run pytest)
Confirmed working on Keboola platform

🤖 Generated with Claude Code

…ation - Add merge() to DefaultSanitizer, TokenSanitizer, HeaderSanitizer, QueryParamSanitizer, UrlPatternSanitizer, ResponseUrlSanitizer - Add _dedup_sanitizers() which collapses same-class sanitizers into one, so the sanitizer chain makes a single pass over each request/response instead of N passes when multiple sanitizers of the same type are present - Add pre-scan in DefaultSanitizer._sanitize_body: skip the full JSON parse when no sensitive field name or value is present in the body (fast path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Root cause: every HTTPS interaction during recording caused urllib3 to call load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of CA cert data into the OpenSSL C-heap (via jemalloc). jemalloc's high-watermark behaviour means this memory is never returned to the OS, producing ~7 MB RSS growth per 10 interactions and OOM for jobs with 500–2000 interactions. Three root causes, three fixes applied inside _pool_reuse_patch(): 1. Pool proliferation: components that create a new requests.Session per API call (e.g. Facebook Ads batch requests) each get their own urllib3.PoolManager and a fresh HTTPSConnectionPool, so N interactions → N pools → N ssl_context allocations. Fix: intercept PoolManager.connection_from_host and return one shared pool per (scheme, host, port) for the lifetime of the recording. 2. ssl_context per connect(): even with a single pool, urllib3 calls _new_conn() → connect() → _ssl_wrap_socket_and_match_hostname( ssl_context=None) for each interaction because requests 2.32.x never puts ssl_context into pool.conn_kw. When ssl_context is None, urllib3 creates a new context and calls load_default_certs() every time. Fix: inject one shared ssl.SSLContext (with load_default_certs() pre-called) into pool.conn_kw["ssl_context"] so all _new_conn() connections share it. 3. Connection churn: VCRConnection.is_connected delegated to real_connection.sock, which is None after Connection: close responses. urllib3 saw is_connected=False and called _new_conn() for every request. Fix: override is_connected on VCRConnection to always return True when real_connection exists (recording) or a VCRFakeSocket is present (replay). Additionally, implement VCRHTTPResponse.release_conn() so urllib3's pool queue is refilled after each response and _new_conn() is not called unnecessarily. Also: write large response body strings to the cassette file in 64 KB chunks (_write_interaction_chunked) to prevent multi-MB transient allocations from fragmenting glibc's heap during recording of APIs with large JSON payloads. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

_vcr_is_connected uses _VCRFakeSocket via a global lookup (module-level function, not a closure). Deleting it with `del` before the property is ever called would raise NameError at runtime. Remove it from the del list so it stays available as a module-level name. Also fix ruff I001: import block sort order in sanitizers.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

soustruh

A couple of actually quite important things to consider or fix please 😇

src/keboola/vcr/recorder.py

soustruh · 2026-03-06T15:12:28Z

src/keboola/vcr/recorder.py

+    """
+    body_dict = response.get("body", {})
+    resp_rest = {k: v for k, v in response.items() if k != "body"}
+


Honestly, all this method is quite hacky and messy, the variables naming is inconsistent (body_dict + resp_rest instead of for example resp_body + resp_rest) and I am not convinced it brings much measurable value? If so, please polish it a little, especially the naming 🙏

src/keboola/vcr/sanitizers.py

soustruh · 2026-03-06T15:18:57Z

src/keboola/vcr/sanitizers.py

+                seen.add(v)
+                combined_values.append(v)
+        merged.sensitive_values = sorted(combined_values, key=len, reverse=True)
+        merged._field_patterns = [re.compile(rf"({re.escape(f)}=)[^&\"\s]+") for f in merged.sensitive_fields]


Wow, this really seems like an obfuscation at its finest 🙀

pyproject.toml

Addresses review comment: _write_json_string_chunked and _write_interaction_chunked were missing the type annotation for the file argument. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The chunked string writing optimization added complexity without measurable benefit — the SSLContext sharing fix is the primary OOM prevention. Replace the manual body-splitting logic with a single json.dump call; _BytesEncoder already handles bytes-to-str conversion. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The "chunked" suffix no longer applies after the simplification. Cleaner name that matches what the function actually does. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The module already imports from __future__ import annotations, so forward-reference string quotes on all merge() signatures are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace __new__ + manual field assignment with a clean __init__ call. dict.fromkeys() deduplicates sensitive_values while preserving order; __init__ handles sorting, lowercasing, and pattern compilation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Versions are managed via git tags and CI release automation, not by manually editing pyproject.toml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

soustruh

Just one small improvement please 🙏

soustruh · 2026-03-16T14:32:26Z

src/keboola/vcr/sanitizers.py


        self.headers_to_remove = set(h.lower() for h in (headers_to_remove or []))

+    def merge(self, other: HeaderSanitizer) -> HeaderSanitizer:


HeaderSanitizer.merge() uses __new__ -- inconsistent with every other merge

Location: src/keboola/vcr/sanitizers.py -- HeaderSanitizer.merge()
Pattern: Every other merge() in this PR calls the constructor: DefaultSanitizer(...), TokenSanitizer(...), QueryParamSanitizer(...), UrlPatternSanitizer(...), ResponseUrlSanitizer(...). HeaderSanitizer is the only one using __new__ + manual field assignment.
Fix:

def merge(self, other: HeaderSanitizer) -> HeaderSanitizer: return HeaderSanitizer( safe_headers=list(self.safe_headers | other.safe_headers), headers_to_remove=list(self.headers_to_remove | other.headers_to_remove), )

soustruh

OK, thanks for the changes & explanations! ദ്ദി(•ᴗ•)

matyas-jirat-keboola and others added 3 commits March 5, 2026 19:53

matyas-jirat-keboola mentioned this pull request Mar 5, 2026

fix(oom): prevent RSS growth during VCR recording by reusing SSLContext per pool #5

Closed

2 tasks

soustruh requested changes Mar 6, 2026

View reviewed changes

matyas-jirat-keboola and others added 6 commits March 6, 2026 18:24

style(recorder): add IO[str] type annotation to f parameter

8efc0c2

Addresses review comment: _write_json_string_chunked and _write_interaction_chunked were missing the type annotation for the file argument. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style(recorder): rename _write_interaction_chunked to _write_interaction

dc747d4

The "chunked" suffix no longer applies after the simplification. Cleaner name that matches what the function actually does. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

style(sanitizers): use PEP 563 lazy annotations; drop quoted class names

98d452a

The module already imports from __future__ import annotations, so forward-reference string quotes on all merge() signatures are redundant. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

revert(pyproject): restore version = "0.0.0"

4a98c8a

Versions are managed via git tags and CI release automation, not by manually editing pyproject.toml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

matyas-jirat-keboola requested a review from soustruh March 16, 2026 08:14

soustruh requested changes Mar 16, 2026

View reviewed changes

Consistency in merging

0554d3d

matyas-jirat-keboola requested a review from soustruh March 16, 2026 15:34

soustruh approved these changes Mar 16, 2026

View reviewed changes

matyas-jirat-keboola merged commit 2a3389f into main Mar 16, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6

fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6
matyas-jirat-keboola merged 10 commits intomainfrom
fix/oom-clean

matyas-jirat-keboola commented Mar 5, 2026

Uh oh!

soustruh left a comment

Uh oh!

Uh oh!

Uh oh!

soustruh Mar 6, 2026

Uh oh!

Uh oh!

soustruh Mar 6, 2026

Uh oh!

Uh oh!

soustruh left a comment

Uh oh!

soustruh Mar 16, 2026

Uh oh!

soustruh left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		self.headers_to_remove = set(h.lower() for h in (headers_to_remove or []))

		def merge(self, other: HeaderSanitizer) -> HeaderSanitizer:

Conversation

matyas-jirat-keboola commented Mar 5, 2026

Summary

1. feat(sanitizers): merge duplicate sanitizers + pre-scan optimisation

2. fix(oom): prevent RSS growth during VCR recording

Test plan

Uh oh!

soustruh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

soustruh Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

soustruh Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

soustruh left a comment

Choose a reason for hiding this comment

Uh oh!

soustruh Mar 16, 2026

Choose a reason for hiding this comment

HeaderSanitizer.merge() uses __new__ -- inconsistent with every other merge

Uh oh!

soustruh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `feat(sanitizers)`: merge duplicate sanitizers + pre-scan optimisation

2. `fix(oom)`: prevent RSS growth during VCR recording

`HeaderSanitizer.merge()` uses `new` -- inconsistent with every other merge