Skip to content

perf: buffer accumulation in _write_query_params() reduces f.write() calls#790

Draft
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/buffer-accum-write-params
Draft

perf: buffer accumulation in _write_query_params() reduces f.write() calls#790
mykaul wants to merge 1 commit intoscylladb:masterfrom
mykaul:perf/buffer-accum-write-params

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 4, 2026

Summary

Replace per-parameter write_value(f, param) loops with buffer accumulation (list.append + b"".join + single f.write()), reducing f.write() calls from (2*N + 1) to 1 for N query parameters in the execute/query path.

This supersedes the closed PR #788 (inlining approach). Buffer accumulation is strictly superior: it achieves equal or better speedups in every scenario while producing a smaller, cleaner diff.

Motivation

Every CQL query/execute call serializes query parameters via write_value(f, param), which does 2 f.write() calls per parameter (length prefix + data). For queries with vector embeddings (128–1536 dimensions), this creates many small writes per message.

Buffer accumulation collects all bytes in a Python list and writes once, eliminating per-parameter function call overhead and reducing syscall-like overhead.

What changed

cassandra/protocol.py (2 hunks)

  1. _QueryMessage._write_query_params() — Buffer accumulation for the parameter loop. Local variable caching (_int32_pack, _parts_append) for Cython-friendly tight loop.

  2. ExecuteMessage._write_query_params() — Removed unnecessary super() pass-through override (now inherited directly from _QueryMessage).

tests/unit/test_protocol.py

Added 14 new test methods in WriteQueryParamsBufferAccumulationTest:

  • Normal, NULL, UNSET, mixed, empty bytes, empty list, None params
  • Large vector (768D), many params (50), cross-protocol (v3 vs v4)
  • Full encode_message round-trip through ProtocolHandler
  • Single NULL and single UNSET regression tests

benchmarks/bench_execute_write_params.py (new)

Standalone benchmark script for reproducibility.

Benchmark results

Environment: Python 3.14, Cython .so compiled, 500K iterations, best of 5 runs.

Scenario Baseline (ns/call) Buffer accum (ns/call) Speedup
128D float32 vector (1 param) 794 634 1.25x
768D float32 vector (1 param) 858 766 1.12x
1536D float32 vector (1 param) 924 834 1.11x
10 text columns 1222 940 1.30x

Comparison with PR #788 (inlining)

Scenario Inlining (PR #788) Buffer accum (this PR)
128D vector 1.05x 1.25x
768D vector 1.06x 1.12x
1536D vector 1.05x 1.11x
10 text columns 1.20x 1.30x

Implementation notes

  • list.append + b"".join benchmarked faster than bytearray +=
  • Local variable caching avoids repeated attribute lookups in the hot loop
  • Fully backward-compatible — serialized bytes are identical (verified by 14 tests)
  • protocol.py is Cython-compiled; optimization benefits both pure Python and Cython paths

@mykaul mykaul marked this pull request as draft April 4, 2026 14:23
@mykaul mykaul force-pushed the perf/buffer-accum-write-params branch 2 times, most recently from bc1545f to 9b21d5b Compare April 4, 2026 14:36
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 4, 2026

Just spitting this here:
Is there any value in ~0.1us improvement?

Honest answer: on its own, ~100ns per call is tiny. But context matters:
Where it adds up:

  • High-throughput vector INSERT pipelines doing 100K+ ops/sec — that's ~10ms/sec of CPU saved per core
  • The write path is single-threaded per connection, so it's on the critical path
  • Multiply by batch sizes (BatchMessage calls write_value per param per query)
    Where it doesn't matter:
  • Network round-trip to Scylla is typically 100-500us minimum
  • Serialization of the vector bytes themselves dwarfs the write_value overhead
  • Real workloads are rarely bottlenecked on parameter serialization
    The real argument for merging is code quality, not the 100ns:
  • Removing the redundant super() pass-through is a clear cleanup
  • The buffer accumulation is a well-known pattern (fewer writes = better) and the code isn't harder to read
  • It establishes a pattern for future BatchMessage optimization where the savings would be larger (N queries × M params)
    But if the question is "would a user ever notice this in production?" — no, almost certainly not. It's a micro-optimization. Whether it's worth the diff churn and review cost is a judgment call for the maintainers.

@mykaul mykaul force-pushed the perf/buffer-accum-write-params branch from 9b21d5b to f2be2a8 Compare April 4, 2026 14:47
Replace the per-parameter write_value(f, param) loop in
_QueryMessage._write_query_params() with a buffer accumulation approach:
list.append + b"".join + single f.write().

This reduces the number of f.write() calls from 2*N+1 to 1, which is
significant for vector workloads with large parameters.

Also removes the redundant ExecuteMessage._write_query_params()
pass-through override to avoid extra MRO lookup per call.

Includes 14 unit tests covering normal, NULL, UNSET, empty, large vector,
and mixed parameter scenarios for both ExecuteMessage and QueryMessage.

Includes a benchmark script (benchmarks/bench_execute_write_params.py).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant