Skip to content

tests: fix load balancing policy tests for Scylla Raft topology#779

Draft
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:fix/test-roundrobin-decommission-dead-node
Draft

tests: fix load balancing policy tests for Scylla Raft topology#779
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:fix/test-roundrobin-decommission-dead-node

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 31, 2026

Summary

Four integration tests in tests/integration/long/test_loadbalancingpolicies.py fail against modern Scylla (>= 2026.1, and likely earlier Raft-enabled versions). This PR fixes all four, each in a separate commit.

Problem

Commits 1–3: Raft rejects topology changes with dead nodes

Scylla's Raft topology coordinator rejects decommission and bootstrap operations when there are dead (unreachable) nodes in the cluster. Three tests trigger this by calling force_stop() on a node and then immediately attempting a topology change while that node is still down:

  • test_roundrobin: force_stop(3)decommission(1) — decommission fails because node 3 is dead.
  • test_roundrobin_two_dcs: force_stop(1)bootstrap(5, 'dc3') — bootstrap fails because node 1 is dead.
  • test_roundrobin_two_dcs_2: force_stop(1)bootstrap(5, 'dc1') — bootstrap fails because node 1 is dead.

Fix: Reorder the operations so the topology change (decommission/bootstrap) happens before the node is killed, or the dead node is restarted before the topology change:

  • test_roundrobin: Restart node 3 (start(3) + wait_for_up) before decommissioning node 1.
  • test_roundrobin_two_dcs: Bootstrap node 5 (bootstrap(5, 'dc3') + wait_for_up) before force-stopping node 1.
  • test_roundrobin_two_dcs_2: Bootstrap node 5 (bootstrap(5, 'dc1') + wait_for_up) before force-stopping node 1.

The test semantics are preserved — the same nodes end up alive/dead/decommissioned, and the same query distribution assertions hold.

Commit 4: Shard-aware routing distributes across replicas

test_token_aware_with_rf_2 hardcodes the expectation that all 12 TokenAwarePolicy queries with RF=2 go to a single node (node 2). With Scylla's shard-aware routing, queries may be distributed across both replicas (e.g., {node2: 5, node3: 7}), since the driver can route to the specific shard owning the token on either replica.

Fix: Instead of asserting node2 == 12, node3 == 0, assert that node1 == 0 (not a replica) and node2 + node3 == 12 (all queries go to replicas). The second assertion block (after stopping node 2) remains unchanged — with only one replica alive, all 12 queries correctly go to node 3.

Test results

Full suite: 16 passed, 1 skipped (the skipped test is test_token_aware_with_transient_replication, gated on Cassandra 4.0+).

tests/...::test_black_list_with_host_filter_policy PASSED
tests/...::test_dc_aware_roundrobin_one_remote_host PASSED
tests/...::test_dc_aware_roundrobin_two_dcs PASSED
tests/...::test_dc_aware_roundrobin_two_dcs_2 PASSED
tests/...::test_roundrobin PASSED
tests/...::test_roundrobin_two_dcs PASSED
tests/...::test_roundrobin_two_dcs_2 PASSED
tests/...::test_token_aware PASSED
tests/...::test_token_aware_composite_key PASSED
tests/...::test_token_aware_is_used_by_default PASSED
tests/...::test_token_aware_prepared PASSED
tests/...::test_token_aware_with_local_table PASSED
tests/...::test_token_aware_with_rf_2 PASSED
tests/...::test_token_aware_with_shuffle_rf2 PASSED
tests/...::test_token_aware_with_shuffle_rf3 PASSED
tests/...::test_token_aware_with_transient_replication SKIPPED
tests/...::test_white_list PASSED
============ 16 passed, 1 skipped =============

Tested with: Scylla release:2026.1, Python 3.14, EVENT_LOOP_MANAGER=asyncio, PROTOCOL_VERSION=4.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes integration load balancing policy tests to be compatible with modern Raft-enabled Scylla behavior, where certain topology changes are rejected if any nodes are down, and shard-aware routing can distribute token-aware traffic across multiple replicas.

Changes:

  • Reorders node start/stop vs. decommission/bootstrap operations in RoundRobinPolicy tests to avoid Raft topology-change rejection when a node is dead.
  • Updates test_token_aware_with_rf_2 to accept shard-aware routing distributing requests across both replicas (while still ensuring all requests go to replicas only).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mykaul added 4 commits April 3, 2026 23:11
Scylla's Raft topology coordinator rejects decommission when there are
dead nodes in the cluster. Restart node 3 before decommissioning node 1.
Scylla's Raft topology coordinator rejects bootstrap when there are
dead nodes in the cluster. Bootstrap node 5 before force-stopping node 1.
Scylla's Raft topology coordinator rejects bootstrap when there are
dead nodes in the cluster. Bootstrap node 5 before force-stopping node 1.
Scylla's shard-aware routing may distribute TokenAwarePolicy queries
across both replicas instead of always picking the first one. Assert
that the total query count across both replicas equals 12.
@mykaul mykaul force-pushed the fix/test-roundrobin-decommission-dead-node branch from 17d215f to 69611fa Compare April 3, 2026 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants