If I could get away with a vendor cloud queue I wouldn't move to Kafka for the h...

If I could get away with a vendor cloud queue I wouldn't move to Kafka for the hell of it, but if I needed higher volume data shipping I've never found the infra as hard it people make it out to be. Unless you're doing insane volumes in single clusters, most of the pieces around it can work OK on default mode for a surprisingly long time.

You can cost footgun yourself like the blog here talks about with cross-AZ stuff (but that doesn't feel like the right level to do that at for me for most cases anyway), and anytime you're doing events or streaming data at all you're gonna run into some really interesting semantic problems compared to traditional services (but also new capacities that are rarely even attempted in that world, like replaying failed messages from hours ago), so it's good to know exactly what you're getting into, but I've spent far less time fighting ZK than Kafka and far less time fighting either than getting the application semantics right.

I imagine a lot of pain comes from "I want events, I know nothing about events, I don't know how to select a tool, now I'm learning both the tool and the semantics of events and queues both on the fly and making painful decisions along the way" which I've seen several places (and helped avoid in some of the later places after learning some hard, not-well-discussed-online lessons). I think the space just lets you do so many more things, so figuring out what's best for YOU is way more difficult the first time you as traditional-backend-online-service-developer start asking questions like "but what if we reprocess the stuff that we otherwise would've just black-hole-500'd during that outage after all" and then have to deal with things like ordering and time in all its glory.