Showcase subscription partition_by unexpected behavior#321
Showcase subscription partition_by unexpected behavior#321jdewar wants to merge 1 commit intocommanded:masterfrom
Conversation
|
For what is worth, I restricted myself from using such feature due to the complexity of getting it right. I hope it gets resolved still. |
|
We're facing the fact that concurrency with partition_by simply cannot be used right; round robin is the only safe expectation because it is the fallback. partition_by cannot guarantee that an 'update' event will be handled after a 'create' event. If you only have idempotent inserts, you can do concurrency, but then you don't need partition_by. And even to get there we edited the core commanded EventHandler to not skip events that it thinks it has seen, because it may skip an unhandled event if another concurrent handler is unlucky enough to crash. (We also edited commanded_ecto_projections to write out a composite keyed projection_version to correctly track each handler's output.) Crashing can be easy, too, it's not just busted events/code. If you have too many projectors trying to quickly ack events that they don't care about, subscriptions can fail trying to talk to their local eventstore with their 5 second limit on the :ack GenServer call. We can't see an obvious way to resolve these things so we're also using concurrency 1 and only projectors in order to deal with the 5s :ack problem. |
|
@jdewar could you fanout to something like Oban, RabbitMQ or other component that allows you to do concurrent processing? Unless you can not afford the extra ms latency, I will strong suggest you to do that |
|
Yes, sorry, this wasn't meant as a 'I need a solution.' This was meant as 'We finally looked deeply at a problem and we traced it to The fix is unfortunately not as simple as 'use Oban or RabbitMQ' at this point, we're sorta past the point of no return for the time being. One small fix that has helped was adding a config value to up that 5s timeout for :acking; as best as we can tell, more backpressure there is not a problem with single EventHandlers. We've got Oban involved in some parts, and other parts we're just doing concurrency 1 for now. The system as a whole is fine, but we have years of data to import from a CRUD app into a Commanded app. We hadn't tried a full import until recently, because we hadn't yet built out the whole system to handle a whole import. The sudden influx of events overwhelmed our dozens of projectors, and :acking started failing repeatedly and then we saw hints of data loss. Any more rambling is best done in a chatroom, where are chatty Commanded folk these days? |
This PR is a test that 'passes' by showing the behavior of the eventstore in the face of a concurrent subscriber crashing. Subscribers that do not
partition_byto a particular event may still handle that event. Given both the documentation and the BEAM's attitude to crashing processes, this was surprising.The partition_by is presented as a guarantee in both eventstore docs
and commanded docs
But eventstore cannot hold to the guarantee of partition_by if a subscriber crashes. It falls back to round robining to the remaining subscribers; it prefers to deliver events as fast as possible.
We see eventstore's FSM has some machinery that will not advance the subscription's overall 'last_seen' until the lowest event that's 'in_flight' is acked. These 'at least once' semantics unfortunately fail when coupled with a concurrent Commanded EventHandler's 'last_seen' counter, see the
# Ignore already seen event.branch of thehandlefn:If an event handler that would-not-normally-handle-the-event handles it, it can decide it's an old event based on its own last_seen, so it must've handled it, so it acks it trivially. 💔
Minor note, we do see some hedging in the documentation, but it wasn't enough to make us assume that partition_by was not acting like a hash ring[1]
This PR isn't meant to be merged, it's an easy way to provide code to run on eventstore. The PR code was a product of Claude, reviewed by me. This post was human written and rewritten to remove as much of my usual rambling as I can manage.