-
Notifications
You must be signed in to change notification settings - Fork 51
LWT routing: RackAwareRoundRobinPolicy demotes Paxos leader #780
Description
Problem
When TokenAwarePolicy wraps RackAwareRoundRobinPolicy, LWT queries may be sent to the wrong replica (not the Paxos leader), adding an unnecessary network hop and increasing latency.
Root Cause
In TokenAwarePolicy.make_query_plan() (cassandra/policies.py:496-529), LWT queries correctly skip replica shuffling (line 517-518). However, replicas are still passed through yield_in_order() which buckets them by distance:
def yield_in_order(hosts):
for distance in [HostDistance.LOCAL_RACK, HostDistance.LOCAL, HostDistance.REMOTE]:
for replica in hosts:
if replica.is_up and child.distance(replica) == distance:
yield replicaWith RackAwareRoundRobinPolicy, replicas in the same rack as the client get LOCAL_RACK distance, while replicas in other racks get LOCAL. This causes the Paxos leader (first natural replica in token-ring order) to be demoted if it's in a different rack.
Example
3 replicas in DC1, client in rack1:
- Replica 1 (Paxos leader, ring order first) → rack2 → distance
LOCAL - Replica 2 → rack1 → distance
LOCAL_RACK - Replica 3 → rack2 → distance
LOCAL
Result: yield_in_order yields Replica 2 first (same rack), then Replica 1 (Paxos leader). The query goes to Replica 2, which must forward the Paxos proposal to Replica 1 — an extra network hop.
Note: With DCAwareRoundRobinPolicy, all local DC replicas get LOCAL distance, so ring order is preserved and this bug does not manifest.
Impact
- Extra network hop per LWT query when the Paxos leader is in a different rack
- Increased Paxos latency and potential contention
- Only affects users of
RackAwareRoundRobinPolicy(or any child policy that distinguishesLOCAL_RACKfromLOCAL)
Proposed Fix
For LWT queries, bypass yield_in_order and yield replicas in their natural token-ring order (filtering only down/ignored hosts):
if query.is_lwt():
for replica in replicas:
if replica.is_up and child.distance(replica) != HostDistance.IGNORED:
yield replica
else:
yield from yield_in_order(replicas)Reference
gocql handles this correctly — for LWT queries, replicas are yielded in natural token-ring order without distance-based reordering.