Skip to content

fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6

Merged
matyas-jirat-keboola merged 10 commits intomainfrom
fix/oom-clean
Mar 16, 2026
Merged

fix(oom): prevent RSS growth during VCR recording + sanitizer dedup#6
matyas-jirat-keboola merged 10 commits intomainfrom
fix/oom-clean

Conversation

@matyas-jirat-keboola
Copy link
Collaborator

Summary

Two self-contained commits:

1. feat(sanitizers): merge duplicate sanitizers + pre-scan optimisation

  • Each sanitizer class gains a merge() method
  • _dedup_sanitizers() collapses same-class sanitizers into one, so the chain makes a single pass over each request/response instead of N passes
  • DefaultSanitizer._sanitize_body gains a pre-scan: skips the full JSON parse when no sensitive field name or value is present in the body (fast path for large unrelated responses)

2. fix(oom): prevent RSS growth during VCR recording

Root cause: every HTTPS interaction caused urllib3 to call load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of CA cert data into the OpenSSL C-heap via jemalloc. jemalloc never returns this to the OS, producing ~7 MB RSS growth per 10 interactions → OOM for jobs with 500–2000 interactions.

Three root causes fixed:

  • Pool proliferation — components creating a new requests.Session per call each get their own PoolManager → N pools → N ssl_context allocations. Fix: _pool_reuse_patch() returns one shared pool per (scheme, host, port).
  • ssl_context per connect() — even with one pool, requests 2.32.x never puts ssl_context into pool.conn_kw, so urllib3 calls load_default_certs() on every connect(). Fix: inject a shared SSLContext into pool.conn_kw at pool creation time.
  • Connection churnVCRConnection.is_connected delegated to real_connection.sock (None after Connection: close), causing _new_conn() on every request. Fix: override is_connected to return True when a VCR connection/socket is active; implement VCRHTTPResponse.release_conn() so the pool queue is refilled.

Also: response bodies written in 64 KB chunks to avoid large transient heap allocations fragmenting glibc's heap.

Confirmed working on Keboola platform — RSS growth eliminated on a 100+ interaction Facebook Ads recording run.

Test plan

  • 119 unit tests pass (uv run pytest)
  • Confirmed working on Keboola platform

🤖 Generated with Claude Code

matyas-jirat-keboola and others added 3 commits March 5, 2026 19:53
…ation

- Add merge() to DefaultSanitizer, TokenSanitizer, HeaderSanitizer,
  QueryParamSanitizer, UrlPatternSanitizer, ResponseUrlSanitizer
- Add _dedup_sanitizers() which collapses same-class sanitizers into one,
  so the sanitizer chain makes a single pass over each request/response
  instead of N passes when multiple sanitizers of the same type are present
- Add pre-scan in DefaultSanitizer._sanitize_body: skip the full JSON parse
  when no sensitive field name or value is present in the body (fast path)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Root cause: every HTTPS interaction during recording caused urllib3 to call
load_default_certs() on a brand-new ssl.SSLContext, allocating ~480 KB of
CA cert data into the OpenSSL C-heap (via jemalloc).  jemalloc's
high-watermark behaviour means this memory is never returned to the OS,
producing ~7 MB RSS growth per 10 interactions and OOM for jobs with
500–2000 interactions.

Three root causes, three fixes applied inside _pool_reuse_patch():

1. Pool proliferation: components that create a new requests.Session per
   API call (e.g. Facebook Ads batch requests) each get their own
   urllib3.PoolManager and a fresh HTTPSConnectionPool, so N interactions
   → N pools → N ssl_context allocations.  Fix: intercept
   PoolManager.connection_from_host and return one shared pool per
   (scheme, host, port) for the lifetime of the recording.

2. ssl_context per connect(): even with a single pool, urllib3 calls
   _new_conn() → connect() → _ssl_wrap_socket_and_match_hostname(
   ssl_context=None) for each interaction because requests 2.32.x never
   puts ssl_context into pool.conn_kw.  When ssl_context is None, urllib3
   creates a new context and calls load_default_certs() every time.  Fix:
   inject one shared ssl.SSLContext (with load_default_certs() pre-called)
   into pool.conn_kw["ssl_context"] so all _new_conn() connections share it.

3. Connection churn: VCRConnection.is_connected delegated to
   real_connection.sock, which is None after Connection: close responses.
   urllib3 saw is_connected=False and called _new_conn() for every request.
   Fix: override is_connected on VCRConnection to always return True when
   real_connection exists (recording) or a VCRFakeSocket is present (replay).
   Additionally, implement VCRHTTPResponse.release_conn() so urllib3's pool
   queue is refilled after each response and _new_conn() is not called
   unnecessarily.

Also: write large response body strings to the cassette file in 64 KB chunks
(_write_interaction_chunked) to prevent multi-MB transient allocations from
fragmenting glibc's heap during recording of APIs with large JSON payloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
_vcr_is_connected uses _VCRFakeSocket via a global lookup (module-level
function, not a closure). Deleting it with `del` before the property is
ever called would raise NameError at runtime. Remove it from the del list
so it stays available as a module-level name.

Also fix ruff I001: import block sort order in sanitizers.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@soustruh soustruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of actually quite important things to consider or fix please 😇

"""
body_dict = response.get("body", {})
resp_rest = {k: v for k, v in response.items() if k != "body"}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, all this method is quite hacky and messy, the variables naming is inconsistent (body_dict + resp_rest instead of for example resp_body + resp_rest) and I am not convinced it brings much measurable value? If so, please polish it a little, especially the naming 🙏

seen.add(v)
combined_values.append(v)
merged.sensitive_values = sorted(combined_values, key=len, reverse=True)
merged._field_patterns = [re.compile(rf"({re.escape(f)}=)[^&\"\s]+") for f in merged.sensitive_fields]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this really seems like an obfuscation at its finest 🙀

matyas-jirat-keboola and others added 6 commits March 6, 2026 18:24
Addresses review comment: _write_json_string_chunked and
_write_interaction_chunked were missing the type annotation for the
file argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The chunked string writing optimization added complexity without
measurable benefit — the SSLContext sharing fix is the primary OOM
prevention. Replace the manual body-splitting logic with a single
json.dump call; _BytesEncoder already handles bytes-to-str conversion.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The "chunked" suffix no longer applies after the simplification.
Cleaner name that matches what the function actually does.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The module already imports from __future__ import annotations, so
forward-reference string quotes on all merge() signatures are redundant.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace __new__ + manual field assignment with a clean __init__ call.
dict.fromkeys() deduplicates sensitive_values while preserving order;
__init__ handles sorting, lowercasing, and pattern compilation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Versions are managed via git tags and CI release automation, not by
manually editing pyproject.toml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link

@soustruh soustruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one small improvement please 🙏


self.headers_to_remove = set(h.lower() for h in (headers_to_remove or []))

def merge(self, other: HeaderSanitizer) -> HeaderSanitizer:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HeaderSanitizer.merge() uses __new__ -- inconsistent with every other merge

Location: src/keboola/vcr/sanitizers.py -- HeaderSanitizer.merge()
Pattern: Every other merge() in this PR calls the constructor: DefaultSanitizer(...), TokenSanitizer(...), QueryParamSanitizer(...), UrlPatternSanitizer(...), ResponseUrlSanitizer(...). HeaderSanitizer is the only one using __new__ + manual field assignment.
Fix:

def merge(self, other: HeaderSanitizer) -> HeaderSanitizer:
    return HeaderSanitizer(
        safe_headers=list(self.safe_headers | other.safe_headers),
        headers_to_remove=list(self.headers_to_remove | other.headers_to_remove),
    )

Copy link

@soustruh soustruh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks for the changes & explanations! ദ്ദി(•ᴗ•)

@matyas-jirat-keboola matyas-jirat-keboola merged commit 2a3389f into main Mar 16, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants