Skip to content

fix(core): preserve whitespace edge cases but collapse html formatting newlines (BLO-1065)#2551

Open
YousefED wants to merge 3 commits intomainfrom
copy-paste-from-ms-word-is-broken-blo-1065
Open

fix(core): preserve whitespace edge cases but collapse html formatting newlines (BLO-1065)#2551
YousefED wants to merge 3 commits intomainfrom
copy-paste-from-ms-word-is-broken-blo-1065

Conversation

@YousefED
Copy link
Collaborator

@YousefED YousefED commented Mar 10, 2026

Summary

Fixes the issue where copying and pasting from MS Word (or other sources with HTML source formatting) introduced extra, unintended line breaks. (closes #2548, closes #2356)

Rationale

When HTML is pasted, source code line breaks (`\n`) within elements like `

` were being converted into hard breaks in BlockNote. While standard HTML collapses these into spaces, PR #2230 introduced `preserveWhitespace: true` to prevent the stripping of leading and trailing spaces during AI diffing. This global flag disabled all whitespace collapsing, causing every formatting newline to appear as a visible hard break. This PR introduces a targeted preprocessing step to resolve both needs simultaneously.

Changes

  • Added `normalizeTextNodeWhitespace` utility to collapse whitespace (including newlines) into single spaces for text nodes.
  • Added `isNotionHTML` to detect Notion documents.
  • Added `preprocessHTMLWhitespace` to conditionally normalize HTML, explicitly skipping Notion HTML (which deliberately relies on `\n` for hard breaks).
  • Integrated `preprocessHTMLWhitespace` into `HTMLToBlocks` before ProseMirror parsing.
  • Added dedicated `msWordPaste` test to `parseTestInstances` using the exact HTML from the bug report.

Impact

Fixes pasting from MS Word and preserves the behavior required by PR #2230 for AI diffing. No negative impacts expected; Notion pasting behavior is fully preserved.

Testing

Screenshots/Video

N/A

Checklist

  • Code follows the project's coding standards.
  • Unit tests covering the new feature have been added.
  • All existing tests pass.
  • The documentation has been updated to reflect the new feature

Additional Notes

See PR comment for more technical details on why native ProseMirror `preserveWhitespace` settings were insufficient.

Summary by CodeRabbit

  • Bug Fixes

    • Improved HTML whitespace normalization during paste operations to handle formatting more consistently, with special handling for code and pre-formatted text blocks.
  • Tests

    • Added test coverage for Microsoft Word paste scenarios to ensure proper handling of Word-formatted content.

…g newlines (BLO-1065)

- Added a targeted DOM preprocessing step before ProseMirror parsing.
- Explictly replicates CSS white-space: normal behavior internally to fix MS Word line breaks.
- Retains preserveWhitespace: true in PM to satisfy AI diffing constraints from PR #2230.
- Skips Notion HTML to preserve intentional hard breaks.
@vercel
Copy link

vercel bot commented Mar 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
blocknote Ready Ready Preview Mar 11, 2026 6:09pm
blocknote-website Ready Ready Preview Mar 11, 2026 6:09pm

Request Review

@YousefED
Copy link
Collaborator Author

Background Context: Why DOM Preprocessing?

This PR resolves a conflict between two parsing requirements:

  1. Pasting from MS Word: Requires HTML newlines (\n) inside text nodes to be collapsed into spaces, matching standard browser white-space: normal behavior.
  2. AI HTML Diffing (PR fix: html diff error with whitespace #2230): Requires leading/trailing spaces (e.g., <p>hello, </p>) to be strictly preserved. If stripped, the AI diff algorithm throws an error due to length mismatches.

To support requirement 2, PR #2230 set ProseMirror's preserveWhitespace: true. However, this global setting disables all HTML whitespace collapsing, immediately breaking requirement 1 by converting every source formatting \n into a hard break.

Why not change preserveWhitespace to "full" or false?

  • false (Default): Fixes Word pasting but strips leading/trailing spaces, immediately breaking 5 test cases from PR fix: html diff error with whitespace #2230.
  • "full": Preserves both spaces and newlines perfectly. However, it preserves \n as literal text characters. While browsers visually collapse these on-screen, exporting the document to Markdown permanently bakes the HTML source wrapping into the Markdown output as hard breaks.

Conclusion:
Since ProseMirror's native parser cannot selectively preserve edge spaces while simultaneously collapsing internal newlines, the targeted manual DOM preprocessing step introduced in this PR is the most technically robust solution. It explicitly replicates CSS white-space: normal newline collapsing on the DOM to fix MS Word HTML, but leaves spaces intact so ProseMirror's preserveWhitespace: true can save them.

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a2d95d64-1256-44d4-9d4b-5a1b106c0ed7

📥 Commits

Reviewing files that changed from the base of the PR and between 7014fd0 and 15ae625.

📒 Files selected for processing (1)
  • tests/src/unit/core/formatConversion/parse/parseTestInstances.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/src/unit/core/formatConversion/parse/parseTestInstances.ts

📝 Walkthrough

Walkthrough

The PR introduces HTML whitespace preprocessing functionality to normalize non-Notion HTML before parsing. A new utility module detects Notion-specific HTML markers and collapses whitespace in text nodes while preserving content within code elements.

Changes

Cohort / File(s) Summary
HTML Whitespace Preprocessing
packages/core/src/api/parsers/html/parseHTML.ts, packages/core/src/api/parsers/html/util/normalizeWhitespace.ts
Added preprocessHTMLWhitespace() utility that detects Notion HTML via comment markers and normalizes text node whitespace (except in PRE/CODE elements). Integrated into parseHTML pipeline immediately after node creation.
Test Import Reordering
tests/src/unit/core/clipboard/paste/pasteTestInstances.ts
Reordered import statements for schema definitions; no functional changes.
Test Snapshots
tests/src/unit/core/formatConversion/parse/__snapshots__/html/mixedTextTableCell.json, tests/src/unit/core/formatConversion/parse/__snapshots__/html/msWordPaste.json
Updated mixedTextTableCell snapshot reflecting whitespace normalization behavior (collapsed multiline content). Added new msWordPaste snapshot for Word-formatted HTML paste scenario with styled French content.
Test Cases
tests/src/unit/core/formatConversion/parse/parseTestInstances.ts
Added new msWordPaste test case to HTML parsing test suite with Office Word namespace and metadata simulation.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

A rabbit hops through whitespace wild,
Collapsing runs both meek and mild,
But code blocks safe, preserved pristine,
Notion HTML stays untouched, serene—
Fresh parsing brought with fluffy care! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: fixing MS Word paste whitespace handling while preserving AI diffing requirements.
Description check ✅ Passed The description covers all major sections of the template with comprehensive details including summary, rationale, changes, impact, testing, and checklist status.
Linked Issues check ✅ Passed The PR description explicitly links to two GitHub issues (#2548, #2356) in the Summary section with 'closes' keywords, properly connecting this change to tracked work.
Out of Scope Changes check ✅ Passed All changes are directly related to the whitespace normalization fix; only the import reordering in pasteTestInstances.ts is tangential but remains within scope of test organization.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch copy-paste-from-ms-word-is-broken-blo-1065

Comment @coderabbitai help to get the list of available commands and usage tips.

@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 10, 2026

Open in StackBlitz

@blocknote/ariakit

npm i https://pkg.pr.new/@blocknote/ariakit@2551

@blocknote/code-block

npm i https://pkg.pr.new/@blocknote/code-block@2551

@blocknote/core

npm i https://pkg.pr.new/@blocknote/core@2551

@blocknote/mantine

npm i https://pkg.pr.new/@blocknote/mantine@2551

@blocknote/react

npm i https://pkg.pr.new/@blocknote/react@2551

@blocknote/server-util

npm i https://pkg.pr.new/@blocknote/server-util@2551

@blocknote/shadcn

npm i https://pkg.pr.new/@blocknote/shadcn@2551

@blocknote/xl-ai

npm i https://pkg.pr.new/@blocknote/xl-ai@2551

@blocknote/xl-docx-exporter

npm i https://pkg.pr.new/@blocknote/xl-docx-exporter@2551

@blocknote/xl-email-exporter

npm i https://pkg.pr.new/@blocknote/xl-email-exporter@2551

@blocknote/xl-multi-column

npm i https://pkg.pr.new/@blocknote/xl-multi-column@2551

@blocknote/xl-odt-exporter

npm i https://pkg.pr.new/@blocknote/xl-odt-exporter@2551

@blocknote/xl-pdf-exporter

npm i https://pkg.pr.new/@blocknote/xl-pdf-exporter@2551

commit: cb74fcf

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
wordcopy.txt (1)

11-21: Local file paths in test fixture.

This file contains local filesystem paths (file:////Users/yousef/Library/...) which are not functionally relevant and could be removed or anonymized for cleaner test data. These paths won't affect the parsing since they're in <link> elements that don't impact the text content extraction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@wordcopy.txt` around lines 11 - 21, The test fixture contains literal local
file URIs in the <link> elements (e.g., the file:////Users/yousef/... values in
the href attributes) which should be removed or anonymized; edit wordcopy.txt to
either delete those <link rel=File-List>, <link rel=themeData> and <link
rel=colorSchemeMapping> lines or replace their href values with neutral/dummy
paths (e.g., relative or placeholder URIs) so the fixture no longer contains
host-specific local filesystem paths while preserving the structure for parsing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@wordcopy.txt`:
- Around line 11-21: The test fixture contains literal local file URIs in the
<link> elements (e.g., the file:////Users/yousef/... values in the href
attributes) which should be removed or anonymized; edit wordcopy.txt to either
delete those <link rel=File-List>, <link rel=themeData> and <link
rel=colorSchemeMapping> lines or replace their href values with neutral/dummy
paths (e.g., relative or placeholder URIs) so the fixture no longer contains
host-specific local filesystem paths while preserving the structure for parsing.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6e26020a-caff-4045-ae6e-4a1ae93907d8

📥 Commits

Reviewing files that changed from the base of the PR and between a69bba9 and 7014fd0.

📒 Files selected for processing (7)
  • packages/core/src/api/parsers/html/parseHTML.ts
  • packages/core/src/api/parsers/html/util/normalizeWhitespace.ts
  • tests/src/unit/core/clipboard/paste/pasteTestInstances.ts
  • tests/src/unit/core/formatConversion/parse/__snapshots__/html/mixedTextTableCell.json
  • tests/src/unit/core/formatConversion/parse/__snapshots__/html/msWordPaste.json
  • tests/src/unit/core/formatConversion/parse/parseTestInstances.ts
  • wordcopy.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Copy paste from MS Word is broken Broken copy paste from word (backspace created)

1 participant