perf: replace regex findall with str.count in parser advance()#629
Open
matheusvir wants to merge 1 commit intotheskumar:mainfrom
Open
perf: replace regex findall with str.count in parser advance()#629matheusvir wants to merge 1 commit intotheskumar:mainfrom
matheusvir wants to merge 1 commit intotheskumar:mainfrom
Conversation
Co-authored-by: Matheus Virgolino <matheus.virgolino.abilio.da.silva@ccc.ufcg.edu.br> Co-authored-by: Manoel Netto <manoel.da.nobrega.eustaqueo.netto@ccc.ufcg.edu.br> Co-authored-by: Pedro <pedroalmeida1896@gmail.com> Co-authored-by: Lucaslg7 <lucasmoizinholg7@gmail.com> Co-authored-by: RailtonDantas <railtondantas.code@gmail.com> Co-authored-by: João Pereira <joao.pereira.de.oliveira@ccc.ufcg.edu.br>
306c140 to
2c29354
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What was done
Replaced
re.findallwithstr.count()in theadvance()method of the parser to eliminate unnecessary list allocations.Previously, for each token read,
re.findallcreated a Python list with all newline matches found, only to get its length.With
str.count(), counting is done directly in C, without allocating any intermediate data structure.No new test files were added. The
advance()method is exercised on every parse call, so the entiretests/test_parser.pysuite (parametrized over ~30 input cases) provides coverage. All existing tests pass with no regressions.Performance
All benchmarks were executed inside Docker containers to isolate the runtime environment and eliminate host-specific variance from CPU scheduling, OS caching, and library versions.
Methodology
.envfile with 24,999 variables, parsed in full on each run.time.perf_counter_ns()with GC disabled during measurement.Rationale
The
advance()method is called for every token during parsing. In the original implementation, each call tore.findall(r'\n', ...)creates a Python list object, populates it with match objects, and then discards it — all to count newlines.str.count('\n')performs the same counting entirely in C with no heap allocation, making it strictly cheaper per call and highly effective at scale.Results
Analysis
The change reduces mean parse time by 25.21% and is statistically confirmed. More notable is the 9x reduction in standard deviation (from 1,227 ms to 132 ms), which indicates that the original allocation pressure was also responsible for inconsistent GC pauses and timing spikes.
The change is minimal in scope — a single-line replacement — with no behavioral difference and no impact on correctness.
Reproducing the benchmark
The full benchmark infrastructure is available in the research repository at matheusvir/eda-oss-performance.
Relevant files:
setup/python-dotenv/Dockerfileexperiments/python-dotenv/str_count_parser_test/baseline_pythondotenv_str-count-newline-advance.pyexperiments/python-dotenv/str_count_parser_test/experiment_pythondotenv_str-count-newline-advance.pyexperiments/python-dotenv/str_count_parser_test/merge_results.pyexperiments/python-dotenv/str_count_parser_test/run.shTo run inside Docker:
Results are written to
results/python-dotenv/result_python-dotenv_str-count-newline-advance.json.This is a targeted, low-risk improvement with measurable impact on large
.envfile parsing.Relates to #504.