fix(JSONL): honor start/end offsets for string input in parseChunk#28025
fix(JSONL): honor start/end offsets for string input in parseChunk#28025Jarred-Sumner wants to merge 1 commit intomainfrom
Conversation
Previously start/end were silently ignored for string input and only worked for typed arrays. This broke error-recovery loops that pass an offset to skip past a bad line — the offset was ignored so the same bad position was re-parsed forever. - Extract start/end clamping into a shared lambda that takes length; was previously referencing length before it was defined - For strings: pass view->length() (UTF-16 code-unit count), use view->substring(start, end - start) to slice, and return read = start + charactersConsumed so offsets are absolute - Collapse the two .d.ts overloads; document byte-vs-char semantics - Add tests for string offsets, an error-recovery loop regression test, and NaN/Infinity/-Infinity/negative/fractional/huge offset handling on both string and Buffer paths
|
Updated 5:48 PM PT - Mar 11th, 2026
❌ @Jarred-Sumner, your commit b20fe73 has 2 failures in
🧪 To try this PR locally: bunx bun-pr 28025That installs a local version of the PR into your bun-28025 --bun |
WalkthroughThe JSONL parseChunk API was expanded to accept mixed input types (strings, TypedArrays, DataView, ArrayBufferLike) with optional start and end offset parameters. Implementation was updated to handle byte offsets for typed arrays and character offsets for strings, with comprehensive test coverage added. Changes
🚥 Pre-merge checks | ✅ 2✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/js/bun/jsonl/jsonl-parse.test.ts`:
- Around line 738-786: Add a new test exercising Bun.JSONL.parseChunk with a
start offset that skips a prefix containing a non-BMP (astral) character so we
verify offsets are measured in UTF-16 code units, not bytes/code points; e.g.
mirror the "start offset works for string input (character offsets)" test but
use a prefix like a single astral emoji (U+1F600 or any 2-unit UTF-16 char)
before the first JSONL line, call Bun.JSONL.parseChunk with the correct
character offset (accounting for the surrogate pair), and assert the parsed
values and read position match the expected results (use the same helpers/result
shape as the existing tests).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: bea1d441-21c6-43cb-8b58-892a5db569da
📒 Files selected for processing (3)
packages/bun-types/bun.d.tssrc/bun.js/bindings/BunObject.cpptest/js/bun/jsonl/jsonl-parse.test.ts
| test("start offset works for string input (character offsets)", () => { | ||
| const result = Bun.JSONL.parseChunk('{"a":1}\n{"b":2}\n', 8); | ||
| expect(result.values).toStrictEqual([{ b: 2 }]); | ||
| expect(result.read).toBe(15); | ||
| }); | ||
|
|
||
| test("end offset works for string input", () => { | ||
| const input = '{"a":1}\n{"b":2}\n{"c":3}\n'; | ||
| const result = Bun.JSONL.parseChunk(input, 0, 16); | ||
| expect(result.values).toStrictEqual([{ a: 1 }, { b: 2 }]); | ||
| }); | ||
|
|
||
| test("start+end window works for string input", () => { | ||
| const input = '{"a":1}\n{"b":2}\n{"c":3}\n'; | ||
| const result = Bun.JSONL.parseChunk(input, 8, 16); | ||
| expect(result.values).toStrictEqual([{ b: 2 }]); | ||
| expect(result.read).toBe(15); | ||
| }); | ||
|
|
||
| test("string: start equals end returns empty", () => { | ||
| const result = Bun.JSONL.parseChunk('{"a":1}\n', 5, 5); | ||
| expect(result.values).toStrictEqual([]); | ||
| expect(result.read).toBe(5); | ||
| expect(result.done).toBe(true); | ||
| }); | ||
|
|
||
| test("string: error-recovery loop with start offset terminates", () => { | ||
| // Regression: previously start was ignored for strings, so an | ||
| // error-recovery loop that advanced the offset would re-parse the | ||
| // same position forever. Now offsets are honored and each iteration | ||
| // makes progress. | ||
| const input = '{"a":1}\n{bad\n{"c":3}\n'; | ||
| const collected: unknown[] = []; | ||
| let offset = 0; | ||
| let iterations = 0; | ||
| while (offset < input.length) { | ||
| iterations++; | ||
| if (iterations > 10) throw new Error("infinite loop"); | ||
| const r = Bun.JSONL.parseChunk(input, offset); | ||
| collected.push(...r.values); | ||
| if (!r.error || r.done || r.read >= input.length) break; | ||
| const nl = input.indexOf("\n", r.read); | ||
| if (nl === -1) break; | ||
| offset = nl + 1; | ||
| if (offset <= r.read) offset = r.read + 1; // guarantee progress | ||
| } | ||
| expect(collected).toStrictEqual([{ a: 1 }, { c: 3 }]); | ||
| expect(iterations).toBeLessThanOrEqual(4); | ||
| }); |
There was a problem hiding this comment.
Add one non-BMP string-offset case.
These new string-offset assertions are all ASCII. Since the implementation measures string offsets in UTF-16 code units, please add a case where the skipped prefix contains an astral character so this path can't silently regress to byte/code-point indexing later.
Suggested coverage addition
+ test("string offsets use UTF-16 code units for astral chars", () => {
+ const input = '{"e":"🎉"}\n{"a":1}\n';
+ const start = '{"e":"🎉"}\n'.length;
+ const result = Bun.JSONL.parseChunk(input, start);
+ expect(result.values).toStrictEqual([{ a: 1 }]);
+ expect(result.read).toBe('{"e":"🎉"}\n{"a":1}'.length);
+ });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/js/bun/jsonl/jsonl-parse.test.ts` around lines 738 - 786, Add a new test
exercising Bun.JSONL.parseChunk with a start offset that skips a prefix
containing a non-BMP (astral) character so we verify offsets are measured in
UTF-16 code units, not bytes/code points; e.g. mirror the "start offset works
for string input (character offsets)" test but use a prefix like a single astral
emoji (U+1F600 or any 2-unit UTF-16 char) before the first JSONL line, call
Bun.JSONL.parseChunk with the correct character offset (accounting for the
surrogate pair), and assert the parsed values and read position match the
expected results (use the same helpers/result shape as the existing tests).
There was a problem hiding this comment.
LGTM — clean, well-tested bug fix that extends existing offset-clamping logic to the string path.
Extended reasoning...
Overview
This PR fixes Bun.JSONL.parseChunk to honor start/end offsets when the input is a string. Previously, offsets were silently ignored for strings, causing infinite loops in error-recovery patterns. Three files are touched: BunObject.cpp (core fix), bun.d.ts (type consolidation), and jsonl-parse.test.ts (new tests).
Security risks
The C++ change involves input parsing and substring operations, but the bounds-clamping logic is carried over directly from the existing typed-array path — start and end are clamped to [0, length] with start <= end enforced. The view->substring(start, end - start) call is guarded by the clamping, so there is no risk of out-of-bounds access. No new attack surface is introduced.
Level of scrutiny
The C++ diff is small (~20 lines net change, mostly refactoring existing code into a lambda). The logic is straightforward: extract shared clamping, apply to string path, adjust the returned readBytes to be absolute. The test suite is comprehensive, covering edge cases like NaN, Infinity, negative offsets, error-recovery loops, and start/end boundary conditions for both strings and buffers. This warrants moderate scrutiny but the change is well-contained.
Other factors
No outstanding reviewer comments. No CODEOWNERS conflicts. The .d.ts change is a clean consolidation of two overloads into one with updated JSDoc. The test file adds ~140 lines of new tests that mirror the existing typed-array offset suite, providing strong regression coverage. The bug fix addresses a real user-facing issue (infinite loop with O(n²) heap growth).
What
Bun.JSONL.parseChunk(input, start, end)now honorsstart/endwheninputis a string. Previously these offsets only worked for typed arrays and were silently ignored for strings.Why
This broke error-recovery loops that call
parseChunk(data, offset)to skip past a malformed line — with string input the offset was ignored, so the same bad position was re-parsed forever, causing an infinite loop with O(n²) heap growth from repeatedconcatcalls.How
start/endclamping logic into a shared lambdaparseOffsets(length). This also fixes a latent bug where the code referencedlengthbefore it was defined.parseOffsets(view->length()), useview->substring(start, end - start)when offsets differ from defaults, returnread = start + charactersConsumedso the offset is absolute into the original string (UTF-16 code-unit positions)..d.tsoverloads into one; JSDoc now documents byte-vs-character offset semantics.Tests
test.eachsuites forNaN,Infinity,-Infinity, negative, fractional,Number.MAX_SAFE_INTEGERoffsets on both string and Buffer — all clamp correctly without crashing