fix: scan_full_page captures all content on virtual-scroll pages (#731)#1853
fix: scan_full_page captures all content on virtual-scroll pages (#731)#1853hafezparast wants to merge 1 commit intounclecode:developfrom
Conversation
…n virtual-scroll pages (unclecode#731) scan_full_page previously captured HTML only once after scrolling completed, so pages that recycle DOM elements (Twitter/X, Xiaohongshu, etc.) only returned the last visible batch. VirtualScrollConfig had a similar gap: it used container.scrollTop which has no effect when the window is the scrollable element. _handle_full_page_scan: - Install a MutationObserver before scrolling that captures every element removed from the DOM (i.e. recycled by virtual scroll). - After scrolling, deduplicate the accumulated elements against what is still visible, then re-inject them into a hidden container so that the subsequent page.content() call includes everything. - Clean up the observer on both success and error paths. _handle_virtual_scroll: - Fall back to window.scrollBy() when container.scrollTop has no effect (window-level scrolling, e.g. Twitter). - Update end-of-scroll detection to check window position when using the window fallback. Tested with a local virtual-scroll page (50 items, 10 visible at a time): - Before: 10/50 captured (only last batch) - After: 50/50 captured Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
unclecode
left a comment
There was a problem hiding this comment.
Maysam, good instinct going with MutationObserver, that's the natural first thought for this problem. But I tested it and there's a fundamentally better approach. Let me walk u through why, and also how to validate performance claims on PRs like this.
The Core Problem with MutationObserver Here
MutationObserver fires a callback on every single DOM mutation. On a Twitter feed, one scroll step triggers 20-40 mutations (remove 10 nodes, add 10 nodes, subtree changes). Each callback runs synchronously before the browser can paint, which actually slows down the site's own rendering. And you're cloning nodes inside that hot callback path.
But here's the thing: we control the scroll timing. We know exactly when to look. We don't need real-time observation at all.
Better Approach: Scroll-and-Snapshot
Replace the observer with a simple Map:
1. Create a Map<fingerprint, outerHTML>
2. Before first scroll: capture all container children into Map
3. Scroll one step, wait for content to settle
4. Capture children again (Map deduplicates inline, zero extra memory)
5. Repeat until bottom
6. If Map.size > container.children.length, replace innerHTML
Why this wins:
| MutationObserver (your PR) | Scroll-and-Snapshot | |
|---|---|---|
| CPU | Callback fires 20-40x per scroll step, blocks main thread | Runs once per scroll step |
| Memory | Unbounded __c4ai_removed array, dedup only at end |
Map deduplicates inline, memory = unique items only |
| Noise | Captures ALL removed nodes (tooltips, ads, banners) | Only reads container children |
| Detached nodes | innerText can return empty on removed nodes |
Reads from live DOM, no issue |
| Code | Setup, teardown, callback, error cleanup | Just a loop with a Map |
Specific Bugs in Current PR
1. 200-char fingerprint silently drops unique items
t.substring(0, 200) as the dedup key is dangerous. I tested this: 100 product cards with identical text but different data-item-id values. Your approach would collapse them to ~10. The fix scored 100/100 using attribute-based fingerprinting (id / data-*-id first, content hash fallback).
2. innerText on detached nodes can be empty
innerText is layout-dependent. Nodes removed from DOM (exactly when your observer fires) can return empty string. Multiple detached nodes all fingerprint the same, dedup keeps one. Use textContent if u stick with text, but better to avoid text fingerprinting entirely.
3. Observer on document.body captures everything
subtree: true on body means every tooltip close, cookie banner dismiss, ad rotation, dropdown close gets into __c4ai_removed. That noise gets re-injected into the hidden div and pollutes extraction. Scope to the container at minimum.
4. Unbounded memory
window.__c4ai_removed stores full outerHTML for every removed node with zero cap. A Twitter feed with 5000 tweets = 5000 HTML strings in memory before dedup runs. Always cap accumulators and deduplicate inline.
How to Test Performance Claims (Important for Future PRs)
When a PR touches scrolling, accumulation, or anything that scales with page size, u need to prove it works under stress. Here's the testing pattern:
1. Build adversarial test pages, not just happy path
Don't just test "50 items, 10 visible." Test:
- 1000 items (stress test, does it still work at scale?)
- Items with near-identical text (does your dedup collapse unique items?)
- Items without
data-idattributes (does your fallback fingerprint work?) - Append-only scroll (does your recycling fix break normal pages?)
2. Measure memory, not just correctness
Check the output HTML size. If 1000 items at ~500B each should be ~500KB, and your output is 2MB, you have duplicate accumulation. In my test: 276KB for 1000 rich items. No bloat.
html_size_kb = len(result.html) / 1024
assert html_size_kb < 2000, f"HTML too large ({html_size_kb:.0f}KB), duplicate accumulation"3. Time it
t0 = time.time()
result = await crawler.arun(url=url, config=config)
elapsed = time.time() - t0
print(f"unique={count}/1000, html={html_kb:.0f}KB, time={elapsed:.1f}s")My 1000-item stress test: 5.4s crawl, 276KB output, 1000/1000 captured.
4. Make tests real pytest, not standalone scripts
test_repro_731.py uses asyncio.run() under __name__ == "__main__". pytest won't discover it. Use @pytest.mark.asyncio and proper assertions. The "13 adversarial tests" mentioned in the PR description aren't in the diff.
My Test Results (for reference)
I built and ran 9 tests with the scroll-and-snapshot approach:
| Test | Captured | Time |
|---|---|---|
| 50 items, recycling | 50/50 | 1.0s |
| 200 items, recycling | 200/200 | 2.3s |
| 1000 items, stress | 1000/1000 | 5.4s |
| 80 items, append-only | 80/80 | 2.6s |
| 100 items, similar text | 100/100 | 1.8s |
| 50 items, no data-id | 50/50 | 1.1s |
| full_page_scan + recycling | 100/100 | 2.3s |
| normal page, no recycling | 4/4 | 1.2s |
| 1000 items, memory stress | 1000/1000, 276KB | 8.1s |
What to Do
- Drop the MutationObserver, use scroll-and-snapshot with a Map
- Fingerprint by
id/data-*attrs first, hash oftagName + href + textContentas fallback - Cap the accumulator (10K elements max)
- Add the window.scrollBy fallback from your PR (that part is solid)
- Write real pytest tests with the adversarial patterns above
The window.scrollBy fallback in _handle_virtual_scroll is clean btw, that part is good. Focus the rework on the accumulation strategy.
Summary
_handle_full_page_scan: installs aMutationObserverbefore scrolling to capture DOM elements that are removed during scroll (virtual scroll recycling). After scrolling, deduplicates against still-visible elements and re-injects accumulated content into a hidden<div>sopage.content()includes everything._handle_virtual_scroll: falls back towindow.scrollBy()whencontainer.scrollTophas no effect (window-level scrolling, e.g. Twitter/X). Updates end-of-scroll detection accordingly.Before fix:
scan_full_page=Trueon a 50-item virtual-scroll page → only 10 items (last batch)After fix: all 50 items captured
Fixes #731, related to #1087
Test plan
tests/test_repro_731.py) — local virtual-scroll page with 50 items, verifies all 50 captured🤖 Generated with Claude Code