tests: improve integration test performance (~28% faster)#771
tests: improve integration test performance (~28% faster)#771mykaul wants to merge 12 commits intoscylladb:masterfrom
Conversation
|
Nice, I'll try to look into this soon (let me know when it is ready to review). |
My poor (pathetic?) laptop is reviewing the change while I'm in parallel asking it to do the same for the gocql driver. There's so much it can do in parallel... And it might skew the results too. I'll check this later. I also need another AI review on this. |
93ab1b6 to
4e5f8e7
Compare
CI Performance Results: PR #771 vs BaselineAll 11 CI jobs passed. Compared against the 4 most recently merged PRs (#759, #729, #728, #706): Per-Job Timing
Summary
Also validated locally: 10x10 runs across ScyllaDB 2026.1, master, 2025.4, and 2025.1 — all passed with no regressions. |
Cluster Setup/Teardown Analysis (from CI logs)Compared the Cluster transitions (teardown/setup cycles)
Standard tests wall-clock timing
Per-file timing (biggest savings)
*0s means the file completed within the same second as the previous file's timestamp — the tests themselves are fast, the baseline duration was dominated by cluster setup. Where the savings come from
Note: The |
|
@mykaul , i think it worth to split it into into 2-3 PRs:
|
4e5f8e7 to
d5b9611
Compare
v2 changesForce-pushed with two changes:
Separately, the investigation of the two pre-existing test failures is complete:
|
ba41af4 to
ca431dd
Compare
|
@mykaul , could you please rebase |
…r message The error message said 'Failed after 100 attempts' but the retry limit is 10 (while tries < 10). This was a copy-paste error from execute_until_pass() which does retry 100 times.
Change test_cluster.py from --smp 1 to --smp 2 to match the standard configuration used by other test files. This enables cluster topology consolidation in a follow-up commit.
Merge cluster names for test files with identical configurations: - test_shard_aware.py: 'shard_aware' -> 'cluster_tests' (same --smp 2, 3 nodes as test_cluster.py) - test_client_routes.py: 'test_client_routes' -> 'shared_aware' (same --smp 2 --memory 2048M, 3 nodes as test_use_keyspace.py) This allows the CCM cluster to be reused when these tests run sequentially, avoiding a full cluster teardown and restart. Also update conftest.py cleanup list to include 'cluster_tests' and 'test_client_routes_replacement' which were previously missing.
Add pytest_collection_modifyitems hook that sorts test modules by their cluster configuration group. This ensures tests sharing the same CCM cluster (same name, same node count, same ext opts) run adjacently, avoiding unnecessary cluster teardown/restart cycles between modules. Groups: default singledc -> cluster_tests -> shared_aware -> single_node -> destructive/special clusters.
These test files don't require multiple nodes for their test logic (they test data types, protocol handlers, row factories, UDTs, and client warnings). Using a single node reduces resource usage and cluster startup time. Files switched from use_singledc() to use_single_node(): - test_types.py - test_cython_protocol_handlers.py - test_custom_protocol_handler.py - test_row_factories.py - test_udts.py - test_client_warnings.py
Move remove_cluster() from setUp (which ran before every test) to only the destructive test methods that actually need a fresh cluster. Read-only tests (test_token_aware_is_used_by_default, test_token_aware_composite_key, test_token_aware_with_local_table, test_dc_aware_roundrobin_two_dcs, test_dc_aware_roundrobin_two_dcs_2) can now reuse an existing cluster, avoiding 5 unnecessary cluster teardown/startup cycles.
The test_can_connect_with_sslauth test asserted exact equality between auth warning count and ReadyMessage count. With --smp 2, shard-aware connections produce additional ReadyMessages, breaking the equality. Drop the exact equality check and assert a lower bound of >= 3 (one per node connection in a 3-node cluster). The control connection and shard-aware connections may produce additional warnings, so the actual count varies between runs.
The cluster name 'test_concurrent_schema_change_and_node_kill' (43 chars) causes the maintenance socket path to exceed the 107-byte sun_path limit on Linux when the working directory is deep enough. Shorten to 'test_schema_kill' to stay well within the limit for all environments.
Several test modules set SCYLLA_EXT_OPTS in setup_module() but never restore it in teardown_module(). When tests are reordered to share clusters, stale values can leak into subsequent modules and cause misconfigured clusters. Save the original value before overwriting and restore it on teardown in: - test_cluster.py - test_shard_aware.py - test_use_keyspace.py - test_ip_change.py - test_client_routes.py (module-level and TestFullNodeReplacementThroughNlb) - test_authentication.py
Add an actions/cache step for ~/.ccm/repository keyed on the Scylla version and runner OS. On cache hit the 'Download Scylla' step becomes a near-instant no-op. On miss (or version bump) CCM re-downloads as before, so there is no regression risk.
The routes_visible() polling function in TestSslThroughNlb creates a new TestCluster with SSL on every retry attempt. Under resource pressure (--smp 2 --memory 2048M shared across 3 nodes), the SSL handshake plus CQL negotiation can exceed the default 5-second connect_timeout, causing intermittent OperationTimedOut failures. Fix by passing connect_timeout=30 to TestCluster (matching the generous timeout recommended for slow-starting clusters) and increasing the wait_until_not_raised parameters from (0.5, 10) to (1, 30), consistent with other wait_until_not_raised calls in this file (lines 773, 855).
The test_tablets.py file uses @pytest.mark.last to ensure the decommission test runs last. Register this mark in pyproject.toml to eliminate the PytestUnknownMarkWarning.
|
@mykaul , still not rebased |
ca431dd to
cb56bdc
Compare
It is now. The thumbs up was 'I will', not 'I did'. |
Summary
Reduces integration test wall-clock time by ~28% (from ~33 min to ~24 min per run against ScyllaDB 2026.1) through a combination of sleep elimination, cluster topology consolidation, and test ordering optimizations.
Motivation
Integration tests are the primary bottleneck in the CI feedback loop. The majority of wasted time comes from:
time.sleep()calls — tests sleeping for fixed durations instead of polling for the actual conditionChanges (11 commits, each independent and cherry-pickable)
Bug fixes (commits 1-2)
execute_with_long_wait_retrysaid "Failed after 100 attempts" but the retry limit is 10setup_keyspace(): The cluster is already confirmed ready by three prior condition-based waitsSleep elimination (commits 3, 9)
wait_until())wait_until_not_raised), metrics cluster recovery (15s→polling), tablets metadata refresh (13s→polling), simulacron connection pool (20s→polling)Cluster topology optimization (commits 4-7)
test_cluster.pyto--smp 2: Consistent with the rest of the test suitetest_shard_aware.pysharescluster_testsconfig;test_client_routes.pymodule-level sharesshared_awareconfigconftest.py):pytest_collection_modifyitemshook groups tests by their CCM cluster requirements, minimizing cluster restarts across filestest_types.py,test_cython_protocol_handlers.py,test_custom_protocol_handler.py,test_row_factories.py,test_udts.py,test_client_warnings.py— these tests don't exercise multi-node behaviorTest churn reduction (commit 8)
LoadBalancingPolicyTests: Moveremove_cluster()fromsetUp(ran before every test) to only the destructive test methods that actually modify cluster topologyCompatibility fixes (commits 10-11)
--smp 2: Shard-aware connections produce additionalReadyMessages, breaking the exact equality assertion. Simplified to>= 4lower bound (the test's actual purpose)test_tablets.pyby reordering destructive tests before non-destructive ones. Uses original collection index instead.Test Results
Timing comparison (ScyllaDB 2026.1)
Integration test matrix (10 consecutive green runs each)
unstable/master:latest)Cassandra 4 and 5 could not be tested on the development machine (requires Java 8/11; only Java 21/25 available). These should be validated in CI.
Pre-existing failures (deselected, not caused by this PR)
TestFullNodeReplacementThroughNlb::test_should_survive_full_node_replacement_through_nlb—RuntimeError: The process is dead, returncode=1during node bootstrap. Root cause: cluster nametest_client_routes_replacementcreates Unix socket paths exceeding the 108-char limit.TestConcurrentSchemaChangeAndNodeKill(master only) — same Unix socket path length issue with cluster nametest_concurrent_schema_change_and_node_kill.Unit tests
651 passed, 16 skipped, 0 failures.