Skip to content

Showcase subscription partition_by unexpected behavior#321

Open
jdewar wants to merge 1 commit intocommanded:masterfrom
merchant-ly:jd/assert-we-not-crazy
Open

Showcase subscription partition_by unexpected behavior#321
jdewar wants to merge 1 commit intocommanded:masterfrom
merchant-ly:jd/assert-we-not-crazy

Conversation

@jdewar
Copy link
Contributor

@jdewar jdewar commented Mar 15, 2026

This PR is a test that 'passes' by showing the behavior of the eventstore in the face of a concurrent subscriber crashing. Subscribers that do not partition_by to a particular event may still handle that event. Given both the documentation and the BEAM's attitude to crashing processes, this was surprising.

The partition_by is presented as a guarantee in both eventstore docs

  You can use a `partition_by` function to guarantee ordering of events within a particular group 
  (e.g. per stream) but still allow events for different groups to be processed concurrently.

and commanded docs

  If you need to enforce an order, such as per stream or by using a field from
  an event, you can define a `c:partition_by/2` callback function in the event
  handler module. The function will receive each event and its metadata and must
  return a consistent term indicating the event's partition. Events which return
  the same term are guaranteed to be processed in order by the same event
  handler instance.

But eventstore cannot hold to the guarantee of partition_by if a subscriber crashes. It falls back to round robining to the remaining subscribers; it prefers to deliver events as fast as possible.

We see eventstore's FSM has some machinery that will not advance the subscription's overall 'last_seen' until the lowest event that's 'in_flight' is acked. These 'at least once' semantics unfortunately fail when coupled with a concurrent Commanded EventHandler's 'last_seen' counter, see the # Ignore already seen event. branch of the handle fn:

when not is_nil(last_seen_event) and event_number <= last_seen_event

If an event handler that would-not-normally-handle-the-event handles it, it can decide it's an old event based on its own last_seen, so it must've handled it, so it acks it trivially. 💔

Minor note, we do see some hedging in the documentation, but it wasn't enough to make us assume that partition_by was not acting like a hash ring[1]

  While events with different partitions may be processed
  concurrently by another instance. An attempt will be made to distribute
  events as evenly as possible to all running event handler instances.

This PR isn't meant to be merged, it's an easy way to provide code to run on eventstore. The PR code was a product of Claude, reviewed by me. This post was human written and rewritten to remove as much of my usual rambling as I can manage.


  1. Unless you are doing phash2ing when you partition_by, it's not, which surprised us too, because the documentation shows bare uuids being used. If you have two events partitioned by the same uuid, but they are far enough apart in the event stream that your concurrent handlers have had to handle one other uuid, you are not guaranteed to land back on the same handler. Round robin again takes over because 'in_partition?' is comparing uuid to uuid and will eventually return 'false', letting the event fall where it may. if you use phash2, then you can "guarantee" this (provided no crashing as above).

@jdewar jdewar changed the title Showcase subscription partition_by misbehavior Showcase subscription partition_by unexpected behavior Mar 15, 2026
@yordis
Copy link
Contributor

yordis commented Mar 17, 2026

For what is worth, I restricted myself from using such feature due to the complexity of getting it right. I hope it gets resolved still.

@jdewar
Copy link
Contributor Author

jdewar commented Mar 19, 2026

We're facing the fact that concurrency with partition_by simply cannot be used right; round robin is the only safe expectation because it is the fallback. partition_by cannot guarantee that an 'update' event will be handled after a 'create' event. If you only have idempotent inserts, you can do concurrency, but then you don't need partition_by.

And even to get there we edited the core commanded EventHandler to not skip events that it thinks it has seen, because it may skip an unhandled event if another concurrent handler is unlucky enough to crash. (We also edited commanded_ecto_projections to write out a composite keyed projection_version to correctly track each handler's output.)

Crashing can be easy, too, it's not just busted events/code. If you have too many projectors trying to quickly ack events that they don't care about, subscriptions can fail trying to talk to their local eventstore with their 5 second limit on the :ack GenServer call.

We can't see an obvious way to resolve these things so we're also using concurrency 1 and only projectors in order to deal with the 5s :ack problem.

@yordis
Copy link
Contributor

yordis commented Mar 19, 2026

@jdewar could you fanout to something like Oban, RabbitMQ or other component that allows you to do concurrent processing? Unless you can not afford the extra ms latency, I will strong suggest you to do that

@jdewar
Copy link
Contributor Author

jdewar commented Mar 19, 2026

Yes, sorry, this wasn't meant as a 'I need a solution.' This was meant as 'We finally looked deeply at a problem and we traced it to partition_by not being able to do what it claims if the system has a hiccup and the system can have a hiccup relatively easy if you press it.

The fix is unfortunately not as simple as 'use Oban or RabbitMQ' at this point, we're sorta past the point of no return for the time being. One small fix that has helped was adding a config value to up that 5s timeout for :acking; as best as we can tell, more backpressure there is not a problem with single EventHandlers.

We've got Oban involved in some parts, and other parts we're just doing concurrency 1 for now. The system as a whole is fine, but we have years of data to import from a CRUD app into a Commanded app. We hadn't tried a full import until recently, because we hadn't yet built out the whole system to handle a whole import. The sudden influx of events overwhelmed our dozens of projectors, and :acking started failing repeatedly and then we saw hints of data loss. Any more rambling is best done in a chatroom, where are chatty Commanded folk these days?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants