Skip to content

Json.cs: Catastrophic regex backtracking in Regex_String causes extreme slowdown with JSON arrays #45

@tschinagl

Description

@tschinagl

Description

The JSON language definition in the underlying ColorCode-Universal dependency (ColorCode.Core/Compilation/Languages/Json.cs) contains a catastrophic backtracking bug in its Regex_String pattern. This causes exponential processing time (O(2^n)) when colorizing JSON that contains arrays with string values, making the application appear completely frozen.

The pattern is:

"[^"\\]*(?:\\[^\r\n]|[^"\\]*)*"

The repeating group (?:\\[^\r\n]|[^"\\]*) has two alternatives:

  • \\[^\r\n] — backslash escape, always consumes 2 characters ✓
  • [^"\\]*can match an empty string

The key-matching rule [,{]\s*("...")\s*: attempts to match JSON keys (string followed by :). For every , inside a JSON array, the engine tries this rule, matches the string value (e.g. "John Doe"), but fails when no : follows. On failure, the engine backtracks into the string subpattern and tries all possible ways to partition the n non-special characters between the outer [^"\\]* and the inner (?:[^"\\]*) repetition — resulting in O(2^n) backtracking states. For a 16-character string value this is ~65,000 states; for longer strings the number grows exponentially.

The fix is to replace the pattern with Friedl's Unrolled Loop:

"[^"\\]*(?:\\.[^"\\]*)*"

Each iteration of the group now must consume at least 2 characters via \\., eliminating the empty alternative and all backtracking.

Expected Behaviour

JSON arrays with string values are colorized in milliseconds, identical to simple JSON objects without arrays.

Actual Behaviour

Colorizing JSON containing arrays with string values causes extreme slowdown. Processing time grows exponentially with the length of string values inside arrays. Simple JSON objects (no arrays) are not affected.

Affected Version

3.0.1

Steps to Reproduce

  1. Set up a Markdig pipeline with UseColorCode()
  2. Call Markdig.Markdown.ToHtml() on a Markdown document containing the following fenced code block:
```json
{
  "Project": "My Project",
  "Tags": [
    "first long string value",
    "another long string value"
  ]
}
```
  1. Observe that processing takes several seconds (or longer) instead of being near-instantaneous
  2. Note that removing the array (keeping only simple key-value pairs) restores normal performance

Workaround (until the underlying bug is fixed in ColorCode-Universal): pass a corrected JSON language definition via additionalLanguages:

var pipeline = new MarkdownPipelineBuilder()
    .UseAdvancedExtensions()
    .UseColorCode(additionalLanguages: new[] { new JsonLanguageFixed() })
    .Build();
// Friedl's Unrolled Loop – no backtracking, correct escape handling
private sealed class JsonLanguageFixed : ILanguage
{
    private const string RegexString = @"""[^""\\]*(?:\\.[^""\\]*)*""";
    private const string RegexNumber = @"-?(?:0|[1-9][0-9]*)(?:\.[0-9]*)?(?:[eE][-+]?[0-9]+)?";

    public string Id => LanguageId.Json;
    public string Name => "JSON";
    public string CssClassName => "json";
    public string FirstLinePattern => null;

    public IList<LanguageRule> Rules => new List<LanguageRule>
    {
        new LanguageRule($@"[,\{{]\s*({RegexString})\s*:", new Dictionary<int, string> { { 1, ScopeName.JsonKey } }),
        new LanguageRule(RegexString,                       new Dictionary<int, string> { { 0, ScopeName.JsonString } }),
        new LanguageRule(RegexNumber,                       new Dictionary<int, string> { { 0, ScopeName.JsonNumber } }),
        new LanguageRule(@"\b(true|false|null)\b",          new Dictionary<int, string> { { 1, ScopeName.JsonConst } }),
    };

    public bool HasAlias(string lang) => false;
}

Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions