Skip to content

LWT routing: RackAwareRoundRobinPolicy demotes Paxos leader #780

@mykaul

Description

@mykaul

Problem

When TokenAwarePolicy wraps RackAwareRoundRobinPolicy, LWT queries may be sent to the wrong replica (not the Paxos leader), adding an unnecessary network hop and increasing latency.

Root Cause

In TokenAwarePolicy.make_query_plan() (cassandra/policies.py:496-529), LWT queries correctly skip replica shuffling (line 517-518). However, replicas are still passed through yield_in_order() which buckets them by distance:

def yield_in_order(hosts):
    for distance in [HostDistance.LOCAL_RACK, HostDistance.LOCAL, HostDistance.REMOTE]:
        for replica in hosts:
            if replica.is_up and child.distance(replica) == distance:
                yield replica

With RackAwareRoundRobinPolicy, replicas in the same rack as the client get LOCAL_RACK distance, while replicas in other racks get LOCAL. This causes the Paxos leader (first natural replica in token-ring order) to be demoted if it's in a different rack.

Example

3 replicas in DC1, client in rack1:

  • Replica 1 (Paxos leader, ring order first) → rack2 → distance LOCAL
  • Replica 2 → rack1 → distance LOCAL_RACK
  • Replica 3 → rack2 → distance LOCAL

Result: yield_in_order yields Replica 2 first (same rack), then Replica 1 (Paxos leader). The query goes to Replica 2, which must forward the Paxos proposal to Replica 1 — an extra network hop.

Note: With DCAwareRoundRobinPolicy, all local DC replicas get LOCAL distance, so ring order is preserved and this bug does not manifest.

Impact

  • Extra network hop per LWT query when the Paxos leader is in a different rack
  • Increased Paxos latency and potential contention
  • Only affects users of RackAwareRoundRobinPolicy (or any child policy that distinguishes LOCAL_RACK from LOCAL)

Proposed Fix

For LWT queries, bypass yield_in_order and yield replicas in their natural token-ring order (filtering only down/ignored hosts):

if query.is_lwt():
    for replica in replicas:
        if replica.is_up and child.distance(replica) != HostDistance.IGNORED:
            yield replica
else:
    yield from yield_in_order(replicas)

Reference

gocql handles this correctly — for LWT queries, replicas are yielded in natural token-ring order without distance-based reordering.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions