One thing missing here is to avoid synchronous communication. Sync comms tie client state to server state; if the server fails, the client will be responsible for handling it.
If you use queue-based services your clients can 'fire and forget', and then your error handling logic can be encapsulated by the queue/ consumers.
This means that if you deploy broken code rather than a cascading failure across all of your systems you just have a queue backup. Queue backups are also really easy to monitor, and make a great smoke-signal alert.
The other way to go, for sync comms, would be circuit breakers.
My current project uses queue-based communications exclusively and it's great. I have retry-queues, which use over-provisioned compute, and a dead-letter for manually investigating messages that caused persistent failures.
Isolation of state is probably the #1 suggestion I have for building scalable, resilient, self-healing services.
100% agree with and would echo the content in the article, otherwise.
edit: Also, idempotency. It's worth taking the time to write idempotent services.
Queues introduce entire other dimensions of complexity. Now you've got to monitor your queue size (and ideally autoscale when the queue backlog grows), and have a dead letter queue for messages that failed processing and monitor that. Tracing requests is harder b/c now your logs are scattered around the worker fleet, so debugging becomes harder. You need more APIs for the client to poll the async state, and you need some data store to track the async state (and now you've got to worry about maintaining and monitoring that data store). It's a can of worms that should be avoided when possible.
The only way to know whether or not to accept this kind of complexity is to think about your use cases. Quite often it's fine (and desirable) to fail fast and make the client retry.
> Queues introduce entire other dimensions of complexity. Now you've got to monitor your queue size (and ideally autoscale when the queue backlog grows), and have a dead letter queue for messages that failed processing and monitor that.
Wouldn't you need similar mechanisms without a queue? It seems to me queues give more visibility and more hooks for autoscaling without adding additional instrumentation to the app itself.
The queue sits behind a service. If you don't do the work in the service, and do it in a queue instead, you've got more infrastructure to manage, monitor, and autoscale.
> Now you've got to monitor your queue size (and ideally autoscale when the queue backlog grows), and have a dead letter queue for messages that failed processing and monitor that.
These are both trivial things to do though. I don't see how it's any more complex than monitoring a circuit breaker, or setting up CI/CD.
> Tracing requests is harder b/c now your logs are scattered around the worker fleet, so debugging becomes harder.
Correlation IDs work just as well in a queue based system as a sync system.
> You need more APIs for the client to poll the async state, and you need some data store to track the async state (and now you've got to worry about maintaining and monitoring that data store).
Not sure what you mean. Again, in the Amazon Cart example, your state is just the cart - regardless of sync or async. You don't add any new state management at all.
It's certainly fairly simple to use queues for straightforward, independent actions, such as sending off an email when someone says they forgot their password. It's less obvious to me how your proposal lines up with things that are less so. Such as a user placing an order.
So I'm having trouble envisioning how your system actually works. At least in the stuff I work on, realistically, very few things are "fire and forget". Most things are initiated by a user and they expect to see something as a result of their actions, regardless of how the back end is implemented.
> It's less obvious to me how your proposal lines up with things that are less so.
Usually you have a sync wrapper around async work, maybe poll based.
As an example, I believe that Amazon's "place in cart" is completely async with queues in the background. But, of course, you may want to synchronously wait on the client side for that event to propagate around.
You get all of the benefits in your backend services - retry logic is encapsulated, failures won't cascade, scaling is trivial, etc. The client is tied to service state, but so be it.
You'll want to ensure idempotency, certainly. Actually, yeah, that belongs in the article too. Idempotent services are so much easier to reason about.
So, assuming an idempotent API, the client would "send", then poll "check", and call "send" again upon a timeout. Or, more likely, a simple backend service handles that for you, providing a sync API.
Going from Sync to Async usually means splitting up your states explicitly.
For example:
Given two communication types:
Sync (<->)
Async (->)
We might have a sync diagram like this:
A <-> B
A calls B, and B 'calls back' into A (via a response).
The async diagram would look like one of these two diagrams:
A -> B -> A
or:
A -> B -> C
Whatever code in A happens after what the sync call would have been gets split out into its own handler.
If your system is extremely simple this may not be worth it. But you could say that about anything in the article, really.
> If your system is extremely simple this may not be worth it. But you could say that about anything in the article, really.
I don't know. The level of complexity this introduces seems to be way higher than anything in the original article.
E.g. for placing something in cart, its not only the next page that is reliant upon it, but anything that deals with the cart - things like checkout, removing from cart, updating quantities, etc. Adding to cart has to be mindful of queued checkout attempts. And vice versa. It sounds way messier than the comparatively isolated concepts such as CI, DI, and zero downtime deploys.
Async communication certainly seems desirable across subsystems that are only loosely connected. E.g. shopping, admin, warehouse, accounting, and reporting subsystems. But by using asynchronous comms you're actually introducing more state into your application than synchronous comms. State you should be testing - both in unit & integration tests (somewhat easy) and full end-to-end tests (much more expensive).
I'm sure Amazon has all sorts of complexities that are required at their scale. But you can heavily benefit from the techniques in the OP even if you aren't Amazon scale.
> The level of complexity this introduces seems to be way higher than anything in the original article.
I don't find it very complex at all. You send a message to a service. You want to get some state after, you query for it.
> but anything that deals with the cart - things like checkout, removing from cart, updating quantities, etc. Adding to cart has to be mindful of queued checkout attempts.
How so? Those pages just query to get the cart's state. You'd do this even in a sync system. The only difference is that on the backend this might be implemented via a poll. On subsequent pages you'd only poll the one time, since the 'add-to-cart' call was synchronous.
> But by using asynchronous comms you're actually introducing more state into your application than synchronous comms.
I don't see how. Again, with the cart example, there is always the same state - the 'cart'. You mutate the cart, and then you query for its state. If you have an expectation of its state, due to that mutation, you just poll it. You can trivially abstract that into a sync comm at your edge.
def make_sync(mutation, query):
mutation()
while not expected_state(query()):
# handle retry logic/ timeouts
Your solution seems to assume only one thing will be accessing what is being mutated at once. If another thread comes in and gets a cart (e.g. maybe the user reloads the page) and they aren't waiting on the operation to be processed anymore. If you remove it from the queue after a few seconds of failure then fine. But if the point is "self healing" it presumably hangs around for a while.
You have to deal with this to some extent in any webapp that has more than 1 OS thread or process. But if you're keeping actions around for minutes or hours instead of second you're going to have to account for a lot of weird stuff you normally wouldn't.
If you really wanted something like this, I would think you would want a concept of "stale" data and up-to-date data. If a process is OK with stale data, the service can just return whatever it sees. But if a process isn't OK with it (like, say, checkout), you probably need to wait on the queue to finish processing.
And since the front end may care about these states, you probably need to expose this concept to clients. It seems like a client should be able to know if it's serving stale data so you can warn the user.
Yeah. And this kind of system often leads to large delays without clear causes, so people re-tap many times and get into weird states.
On the extreme end of doing this well, you have stuff like Redux, which effectively does this locally plus a cache so you don't notice it. Redux has some super nice attributes, there are some definite advantages, but it is many times more complicated than a sync call.
> I don't know. The level of complexity this introduces seems to be way higher than anything in the original article.
HTTP isn't synchronous, we just often pretend it is. You can pretend messages are synchronous using exactly the same semantics and get exactly the same terrible failure modes when requests or responses are lost or delayed.
If you use queue-based services your clients can 'fire and forget', and then your error handling logic can be encapsulated by the queue/ consumers.
This means that if you deploy broken code rather than a cascading failure across all of your systems you just have a queue backup. Queue backups are also really easy to monitor, and make a great smoke-signal alert.
The other way to go, for sync comms, would be circuit breakers.
My current project uses queue-based communications exclusively and it's great. I have retry-queues, which use over-provisioned compute, and a dead-letter for manually investigating messages that caused persistent failures.
Isolation of state is probably the #1 suggestion I have for building scalable, resilient, self-healing services.
100% agree with and would echo the content in the article, otherwise.
edit: Also, idempotency. It's worth taking the time to write idempotent services.