Skip to content

fix(JSONL): honor start/end offsets for string input in parseChunk#28025

Open
Jarred-Sumner wants to merge 1 commit intomainfrom
jarred/fix-jsonl
Open

fix(JSONL): honor start/end offsets for string input in parseChunk#28025
Jarred-Sumner wants to merge 1 commit intomainfrom
jarred/fix-jsonl

Conversation

@Jarred-Sumner
Copy link
Collaborator

What

Bun.JSONL.parseChunk(input, start, end) now honors start/end when input is a string. Previously these offsets only worked for typed arrays and were silently ignored for strings.

Why

This broke error-recovery loops that call parseChunk(data, offset) to skip past a malformed line — with string input the offset was ignored, so the same bad position was re-parsed forever, causing an infinite loop with O(n²) heap growth from repeated concat calls.

How

  • Extracted the start/end clamping logic into a shared lambda parseOffsets(length). This also fixes a latent bug where the code referenced length before it was defined.
  • String path: call parseOffsets(view->length()), use view->substring(start, end - start) when offsets differ from defaults, return read = start + charactersConsumed so the offset is absolute into the original string (UTF-16 code-unit positions).
  • Collapsed the two .d.ts overloads into one; JSDoc now documents byte-vs-character offset semantics.

Tests

  • String offset tests mirroring the existing typed-array offset suite
  • Error-recovery loop regression test (terminates in ≤4 iterations instead of spinning forever)
  • test.each suites for NaN, Infinity, -Infinity, negative, fractional, Number.MAX_SAFE_INTEGER offsets on both string and Buffer — all clamp correctly without crashing

Previously start/end were silently ignored for string input and only
worked for typed arrays. This broke error-recovery loops that pass an
offset to skip past a bad line — the offset was ignored so the same
bad position was re-parsed forever.

- Extract start/end clamping into a shared lambda that takes length;
  was previously referencing length before it was defined
- For strings: pass view->length() (UTF-16 code-unit count), use
  view->substring(start, end - start) to slice, and return
  read = start + charactersConsumed so offsets are absolute
- Collapse the two .d.ts overloads; document byte-vs-char semantics
- Add tests for string offsets, an error-recovery loop regression
  test, and NaN/Infinity/-Infinity/negative/fractional/huge offset
  handling on both string and Buffer paths
@Jarred-Sumner Jarred-Sumner requested a review from alii as a code owner March 12, 2026 00:03
@robobun
Copy link
Collaborator

robobun commented Mar 12, 2026

Updated 5:48 PM PT - Mar 11th, 2026

@Jarred-Sumner, your commit b20fe73 has 2 failures in Build #39333 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 28025

That installs a local version of the PR into your bun-28025 executable, so you can run:

bun-28025 --bun

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 12, 2026

Walkthrough

The JSONL parseChunk API was expanded to accept mixed input types (strings, TypedArrays, DataView, ArrayBufferLike) with optional start and end offset parameters. Implementation was updated to handle byte offsets for typed arrays and character offsets for strings, with comprehensive test coverage added.

Changes

Cohort / File(s) Summary
API Type Definitions
packages/bun-types/bun.d.ts
Widened parseChunk overload to accept string | NodeJS.TypedArray | DataView | ArrayBufferLike with optional start/end offset parameters for byte/character windowing.
Native Implementation
src/bun.js/bindings/BunObject.cpp
Added parseOffsets helper and logic to handle start/end offsets for both TypedArray (byte offsets) and String (character offsets) inputs, including ArrayBuffer validation and readBytes calculation adjustments.
Test Coverage
test/js/bun/jsonl/jsonl-parse.test.ts
Added comprehensive tests for offset handling including edge cases (NaN, Infinity, negative/fractional values), consistent behavior across string and Uint8Array paths, and BOM/UTF-8 boundary interactions.
🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing JSONL parseChunk to honor start/end offsets for string input, which is the core issue addressed in this PR.
Description check ✅ Passed The description comprehensively covers both required sections: 'What' explains the fix clearly, and 'How' provides implementation details. A 'Why' section is also included explaining the motivation and impact.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/js/bun/jsonl/jsonl-parse.test.ts`:
- Around line 738-786: Add a new test exercising Bun.JSONL.parseChunk with a
start offset that skips a prefix containing a non-BMP (astral) character so we
verify offsets are measured in UTF-16 code units, not bytes/code points; e.g.
mirror the "start offset works for string input (character offsets)" test but
use a prefix like a single astral emoji (U+1F600 or any 2-unit UTF-16 char)
before the first JSONL line, call Bun.JSONL.parseChunk with the correct
character offset (accounting for the surrogate pair), and assert the parsed
values and read position match the expected results (use the same helpers/result
shape as the existing tests).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: bea1d441-21c6-43cb-8b58-892a5db569da

📥 Commits

Reviewing files that changed from the base of the PR and between 9d51af8 and b20fe73.

📒 Files selected for processing (3)
  • packages/bun-types/bun.d.ts
  • src/bun.js/bindings/BunObject.cpp
  • test/js/bun/jsonl/jsonl-parse.test.ts

Comment on lines +738 to +786
test("start offset works for string input (character offsets)", () => {
const result = Bun.JSONL.parseChunk('{"a":1}\n{"b":2}\n', 8);
expect(result.values).toStrictEqual([{ b: 2 }]);
expect(result.read).toBe(15);
});

test("end offset works for string input", () => {
const input = '{"a":1}\n{"b":2}\n{"c":3}\n';
const result = Bun.JSONL.parseChunk(input, 0, 16);
expect(result.values).toStrictEqual([{ a: 1 }, { b: 2 }]);
});

test("start+end window works for string input", () => {
const input = '{"a":1}\n{"b":2}\n{"c":3}\n';
const result = Bun.JSONL.parseChunk(input, 8, 16);
expect(result.values).toStrictEqual([{ b: 2 }]);
expect(result.read).toBe(15);
});

test("string: start equals end returns empty", () => {
const result = Bun.JSONL.parseChunk('{"a":1}\n', 5, 5);
expect(result.values).toStrictEqual([]);
expect(result.read).toBe(5);
expect(result.done).toBe(true);
});

test("string: error-recovery loop with start offset terminates", () => {
// Regression: previously start was ignored for strings, so an
// error-recovery loop that advanced the offset would re-parse the
// same position forever. Now offsets are honored and each iteration
// makes progress.
const input = '{"a":1}\n{bad\n{"c":3}\n';
const collected: unknown[] = [];
let offset = 0;
let iterations = 0;
while (offset < input.length) {
iterations++;
if (iterations > 10) throw new Error("infinite loop");
const r = Bun.JSONL.parseChunk(input, offset);
collected.push(...r.values);
if (!r.error || r.done || r.read >= input.length) break;
const nl = input.indexOf("\n", r.read);
if (nl === -1) break;
offset = nl + 1;
if (offset <= r.read) offset = r.read + 1; // guarantee progress
}
expect(collected).toStrictEqual([{ a: 1 }, { c: 3 }]);
expect(iterations).toBeLessThanOrEqual(4);
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add one non-BMP string-offset case.

These new string-offset assertions are all ASCII. Since the implementation measures string offsets in UTF-16 code units, please add a case where the skipped prefix contains an astral character so this path can't silently regress to byte/code-point indexing later.

Suggested coverage addition
+      test("string offsets use UTF-16 code units for astral chars", () => {
+        const input = '{"e":"🎉"}\n{"a":1}\n';
+        const start = '{"e":"🎉"}\n'.length;
+        const result = Bun.JSONL.parseChunk(input, start);
+        expect(result.values).toStrictEqual([{ a: 1 }]);
+        expect(result.read).toBe('{"e":"🎉"}\n{"a":1}'.length);
+      });
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/js/bun/jsonl/jsonl-parse.test.ts` around lines 738 - 786, Add a new test
exercising Bun.JSONL.parseChunk with a start offset that skips a prefix
containing a non-BMP (astral) character so we verify offsets are measured in
UTF-16 code units, not bytes/code points; e.g. mirror the "start offset works
for string input (character offsets)" test but use a prefix like a single astral
emoji (U+1F600 or any 2-unit UTF-16 char) before the first JSONL line, call
Bun.JSONL.parseChunk with the correct character offset (accounting for the
surrogate pair), and assert the parsed values and read position match the
expected results (use the same helpers/result shape as the existing tests).

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean, well-tested bug fix that extends existing offset-clamping logic to the string path.

Extended reasoning...

Overview

This PR fixes Bun.JSONL.parseChunk to honor start/end offsets when the input is a string. Previously, offsets were silently ignored for strings, causing infinite loops in error-recovery patterns. Three files are touched: BunObject.cpp (core fix), bun.d.ts (type consolidation), and jsonl-parse.test.ts (new tests).

Security risks

The C++ change involves input parsing and substring operations, but the bounds-clamping logic is carried over directly from the existing typed-array path — start and end are clamped to [0, length] with start <= end enforced. The view->substring(start, end - start) call is guarded by the clamping, so there is no risk of out-of-bounds access. No new attack surface is introduced.

Level of scrutiny

The C++ diff is small (~20 lines net change, mostly refactoring existing code into a lambda). The logic is straightforward: extract shared clamping, apply to string path, adjust the returned readBytes to be absolute. The test suite is comprehensive, covering edge cases like NaN, Infinity, negative offsets, error-recovery loops, and start/end boundary conditions for both strings and buffers. This warrants moderate scrutiny but the change is well-contained.

Other factors

No outstanding reviewer comments. No CODEOWNERS conflicts. The .d.ts change is a clean consolidation of two overloads into one with updated JSDoc. The test file adds ~140 lines of new tests that mirror the existing typed-array offset suite, providing strong regression coverage. The bug fix addresses a real user-facing issue (infinite loop with O(n²) heap growth).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants