Skip to content

perf: replace regex findall with str.count in parser advance()#629

Open
matheusvir wants to merge 1 commit intotheskumar:mainfrom
matheusvir:optimization/str-count-newline-advance
Open

perf: replace regex findall with str.count in parser advance()#629
matheusvir wants to merge 1 commit intotheskumar:mainfrom
matheusvir:optimization/str-count-newline-advance

Conversation

@matheusvir
Copy link

What was done

Replaced re.findall with str.count() in the advance() method of the parser to eliminate unnecessary list allocations.

Previously, for each token read, re.findall created a Python list with all newline matches found, only to get its length.
With str.count(), counting is done directly in C, without allocating any intermediate data structure.

No new test files were added. The advance() method is exercised on every parse call, so the entire tests/test_parser.py suite (parametrized over ~30 input cases) provides coverage. All existing tests pass with no regressions.


Performance

All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.

Methodology

  • Input: a .env file with 24,999 variables, parsed in full on each run.
  • Baseline: 39 valid runs after outlier filtering.
  • Optimized: 32 valid runs after outlier filtering.
  • Timing: time.perf_counter_ns() with GC disabled during measurement.

Rationale

The advance() method is called for every token during parsing. In the original implementation, each call to re.findall(r'\n', ...) creates a Python list object, populates it with match objects, and then discards it — all to count newlines. str.count('\n') performs the same counting entirely in C with no heap allocation, making it strictly cheaper per call and highly effective at scale.

Results

Variant Mean (ms) Std dev (ms) Runs
Baseline 11,403.40 1,227.85 39
Optimized 8,528.51 132.97 32
Improvement 25.21%

str.count newline benchmark

Analysis

The change reduces mean parse time by 25.21% and is statistically confirmed. More notable is the 9x reduction in standard deviation (from 1,227 ms to 132 ms), which indicates that the original allocation pressure was also responsible for inconsistent GC pauses and timing spikes.

The change is minimal in scope — a single-line replacement — with no behavioral difference and no impact on correctness.

Reproducing the benchmark

The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.

Relevant files:

To run inside Docker:

# From the root of eda-oss-performance
docker build -t dotenv-perf ./setup/python-dotenv/

# Run baseline
docker run --rm -e EXPERIMENT=str_count_parser_test -e VARIANT=baseline dotenv-perf

# Run optimized
docker run --rm -e EXPERIMENT=str_count_parser_test -e VARIANT=optimized dotenv-perf

Results are written to results/python-dotenv/result_python-dotenv_str-count-newline-advance.json.


This is a targeted, low-risk improvement with measurable impact on large .env file parsing.


Relates to #504.

Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br>
Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br>
Co-authored-by: Pedro <pedroalmeida1896@gmail.com>
Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com>
Co-authored-by: RailtonDantas <railtondantas.code@gmail.com>
Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
@matheusvir matheusvir force-pushed the optimization/str-count-newline-advance branch from 306c140 to 2c29354 Compare March 12, 2026 01:38
@matheusvir matheusvir changed the title perf(python-dotenv): replace regex findall with str.count in parser advance() perf: replace regex findall with str.count in parser advance() Mar 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants