To me a technology company is not just a company that uses tech (every company d...

zbentley · on Aug 8, 2023

> As far as I can tell there are mostly companies that use Kafka and companies that have a SPOF PostgreSQL/MySQL database

I haven't seen that at all, across the many companies I've worked at, consulted with, and talked with others about.

Kafka is usually an ancillary system added to companies with a strong culture around one or more pre-existing datastores (from PG/MySQL to Dynamo/Cassandra to Mongo/Elastic). When Kafka's actually needed, it handles things those pre-existing stores can't do efficiently at high volumes.

Are you really seeing companies use Kafka for their main persistence layer? As in, like, KQL or the equivalent for all/most business operations?

Even the CQRS/ES zealots are still consuming from Kafka topics into (usually relational) databases for reads.

lmm · on Aug 8, 2023

> Are you really seeing companies use Kafka for their main persistence layer?

I'm seeing kafka-streams-style event processing as the primary data layer used by most business operations, although only in the last couple of years.

> As in, like, KQL or the equivalent for all/most business operations?

> Even the CQRS/ES zealots are still consuming from Kafka topics into (usually relational) databases for reads.

Yeah, I'm not seeing KQL, and I'm still seeing relational databases used for a lot of secondary views and indices. But the SQL database is populated from the Kafka, not vice versa, and can be wiped and regenerated if needed, and at least in theory it can't be used for live processing (so an SQL outage would take down the management UI and mean customers couldn't change their settings, it would be a big deal and need fixing quickly, but it wouldn't be an outage in the primary system).

fulafel · on Aug 8, 2023

I think if you dismiss HA setups of SQL dbs as "you won't get around to operating it properly" the same ops culture will also end up getting many less 9's availability than aspired to with Kafka.

(But also of course lots of applications are also fine with the availability that you get from fate-sharing with a single db server)

lmm · on Aug 8, 2023

> I think if you dismiss HA setups of SQL dbs as "you won't get around to operating it properly" the same ops culture will also end up getting many less 9's availability than aspired to with Kafka.

Up to a point. IME Kafka is a lot easier to operate in true HA form than SQL dbs, and a lot more commonly operated that way; Kafka has a reputation for being harder to operate than a typical datastore, but that's usually comparing a HA Kafka setup with a single-node SQL db. And I don't know why, but many otherwise high-quality ops teams seem to have a bizzare blind spot around SQL dbs where they'll tolerate a much lower level of resilience/availability than they would for any other part of the stack.

otabdeveloper4 · on Aug 8, 2023

We standardized on Clickhouse for everything. (With its own set of surprising and/or horrifying ops issues.) But at least it is a proper high-load, high-availablity solution, unlike Kafka, Cassandra, et al.

lmm · on Aug 8, 2023

> We standardized on Clickhouse for everything. (With its own set of surprising and/or horrifying ops issues.)

Clickhouse I admittedly haven't personally seen quite as much operational unpleasantness as Greenplum or Galera, but at this point I'm dubious of anything in that bucket.

> But at least it is a proper high-load, high-availablity solution, unlike Kafka, Cassandra, et al.

What went wrong with those for you? In my experience the setup stage is cumbersome, but once you've got them running they work well and do what you expect; most complaints you see come down to they're not relational/not SQL/not ACID (true, but IME more of an advantage than a disadvantage).

aforwardslash · on Aug 8, 2023

Not the parent, but I have some ClickHouse experience. ClickHouse is surprisingly easy to deploy and setup, talks both mysql and postgresql wire protocols (so you can query stuff with your existing relational tools), the query language is SQL (including joins with external data sources, such as S3 files, external relational databases and other clickhouse tables), and it is ACID on specific operations. It assumes your dataset is (mostly) append-only, and inserts work well when done in batch. It is also blazingly fast, and very compact when using the MergeTree family of storage engines.

Development is very active, and some features are experimental. One of the common mistakes is to use latest releases for production environments - you will certainly find odd bugs on specific usage scenarios. Stay away from the bleeding edge and you're fine. Clustering (table replication and sharding of queries) is also a sort-of can of worms by itself, and requires good knowledge of your workload and your data structure to understand all the tradeoffs. Thing is, when designing from scratch, you can often design in such a way where you don't need (clustered) table replication or sharding - again, this also has a learning curve, for both devs and devops.

You can easily spin it on a VM or on your laptop, load a dataset and see for yourself how powerful ClickHouse can be. Honestly, just the data compression alone is good enough to save a s**load of money on storage on an enterprise, compared to most solutions. Couple this with tiered storage - your hot data is eg. in ssd, your historical data is stored on s3, and rotation is done automatically, plus automated ingestion from kafka, and you have a data warehousing system at a fraction of the price of many common alternatives.