"A license that doesn’t allow reselling the service is good enough
For example, I saw this with Tailwind UI"
FYI, we discussed it in this thread https://github.com/ClickHouse/ClickHouse/issues/44767.
For the moment, the only way to get the new features is to use the cloud version (which is likely to be a no-go for most companies managing their own clickhouse infrastructure).
As a user of clickhouse since 2018 I'm fully aligned with the content of this article. This technology is one of the best I've been using in my career.
The choice of clickhouse for a new project in my company has always been a no-brainer, but the recent move from clickhouse.inc to a closed source version has made this choice less straightforward.
It's a good read.
I have a few questions/comments:
- given the description of the Rumor-mongering approach vs the Anti-entropy approach, it looks like the Anti-entropy approach:
-- has an important overhead in terms of network/messages sent (since nodes are always chatting even when there are no changes in the cluster).
-- is slower to propagate a change of cluster state to all the nodes. Does it mean that in case of a node failures/shutdown, the cluster will be instable for longer (since dead nodes will receive queries)?
- the article mention a "seed node" but doesn't define what this is.
- on a dynamic quickwit setup (that often downscales/upscales depending on the load), it seems that the size of the metadata in each node will keep increasing since the state of the dead nodes will be kept, unless the same node_unique_id can be reused after a downscale/upscale (but I don't know enough how kube works to see if it's the case).
This is correct yes.
Rumor-mongering is more efficient both in term of detecting/propagating a node failure and in term of network overhead.
We stopped using SWIM because it is too hard to get right, and because scuttlebutt allows all nodes to expose a bunch of meta-information about themselves.
It took a lot of time and effort to Hashicorp to patch memberlist into what it is today.
- In term of message overhead there is not much difference. one details is that each node keep incrementing his heartbeat counter continuously, making it a state change that need to be propagated.
- slower here depends on the number of node selected to gossip with. if you choose a long gossip period yes, but in real system the gossip period is short enough to not notice it. So slower means compared to Rumor-mongering style. Not as in having an inconsistent cluster state.
-(Thanks) I will probably amend this as a note: A seed node can be any node in the cluster we know is reliable and will almost always be there; kind of like the ACE within your cluster, they can be many.
-Garbage collection :"... We mitigate this by periodically garbage collecting obsolete nodes."
FYI, we're using clickhouse since 2018 at ContentSquare.
I did a few POCs to compare clickhouse vs other databases on ContentSquare's use case. One of them was memSQL.
Although memSQL was very good, since we don't need to JOIN big datasets or need killer features like fulltext search, clickhouse gave a better perf/cost ratio for us (I don't remember exactly but it was at least twice cheaper).
There is also a video recording of the 2019 presentation available here.
https://www.youtube.com/watch?v=lwYSYMwpJOU
nb: The video is not great because the camera is often losing focus but it's still understandable.
If you liked this article, maybe you'll like my article on Shazam. I used the same pattern: I start from the basics of sound processing and computer science and finish with an in-depth explanation of Shazam.
I you're right. In fact in the optimizer part I say (in a simple way) that big O (i.e. asymptotic complexity) is not the same as CPU cost but it's easier for me because the real cost of an operation depends on the CPU architecture.
Someone told me the same on the article comments and here is the answer I gave him:
You’re right and I agree with you.
When I wrote this part, I REALLY hesitated to give the real asymptotic definition and what it means for the number of operations but I chose a simpler explanation since the aim of this post is not to become an expert but to have a good idea. I hope that this won’t mislead people but I thought the real definition was too hard for a “newcomer” and not important to understand a database.
This is also why I added in this part “The time complexity doesn’t give the exact number of operations but a good idea.” and said at the end of the part “I didn’t give you the real definition of the big O notation but just the idea” with a link to the real definition.
(I'm the author of the article) I'm 28 and I’m currently a Big Data developer (I use Hadoop, HBase, Hive …) and I don’t understand the buzz surrounding Big Data and NoSQL.
With a relational database the complexity is hidden (more or less…) whereas with Big Data and NoSQL the developer needs to deal with this complexity himself/herself. As a result, most of the Big Data applications I’ve seen don’t work well.
A really like Big Data because it’s more complex but to be honest, most of the time my work does not required the “Big Data scale”.
At Couchbase we did a survey of developers (this was ages ago) and the biggest motivator for NoSQL was schema flexibility. Not having to coordinate migrations is seen as a productivity boost. [1]
The other thing document databases can offer that relational databases struggle with is taking subsets (which we use for offline sync.) [2]
I'm in the opposite camp. I don't like NoSQL because of the flexible schema. I have to build tools to make sure my data is consistent or have error handling. Migrations ensure that whenever I pull down a version of the code, the database is in the right state.
I don't see too many use cases where having being schemaless is a good thing outside of infrastructure ease of use. If you want to store arbitrary data in a table, BJSON in postgres is very efficient and flexible in that regard, while allowing you to have a schema for the rest of the data (e.g. if you are collecting data you probably want a schema for things like name, email, timestamp, etc and then have a BJSON field to jam in whatever you need.)
Having worked in ERP, Ecommerce, Financial tech, and general SASS based tech stuff, I agree with you on the "not too many use cases where having schemaless is a good thing". In most cases it's a shortcut and an unnecessary tradeoff made by people to avoid the few technical issues like DB migrations (also, a solved problem in many ways as long as you don't attempt to reinvent the wheel). The only time I saw a good use for a NoSQL db was to store products in Ecommerce. Managing taxonomy and attributes was always a nightmare and everyone was constantly afraid of performance issues (being on Magento and battling the EAV system didn't help). It would have been great to have only the products being stored on a NoSQL instance and the rest of the data being on the traditional relational data store.
As someone who spent several years studying programming languages, the thing that drives me crazy about traditional relational databases is the assumption that all data is tuple-structured. Much data is structured as unions of alternates or more complex things like maps. Shoehorning your data model into a tuple-based system is always possible, but often unnatural.
The place NoSQL shines is the acknowledgement that most data is complex. Of course it often also does away with ACIDity, which is a huge disaster (EDIT: the doing-away-with is a disaster, I mean).
I don't think this is correct. You can have a relational, columnar-stored key-value map that stores any values you want. Bonus: it's super easy to make these kinds of updates ACID. Of course, if you're maxing out the storage space, you're gonna have a rough time with indexes unless you take the EXACT SAME approach as you would with NoSQL.
I don't think there are any "inherent" problems to relational or NoSQL databases, but there are many tradeoffs. The tradeoff of NoSQL databases is that complexity gets very, very difficult to pull off in a distributed fashion. So throw 99% of the indices out the window, dumb your queries down, and cache any joins or scans as much as possible. The upside, I guess, is that the "schema" is pretty irrelevant if it's not your primary key (or secondary, in some databases). But, you lose joins, schemas, subqueries, orderings, many types of transactions, etc, etc, and a lot of "free" stuff that is really only "free" for small numbers of rows per table or strong assumptions about the data.
The relational model is extremely general, its' been argued (fairly successfully) by Date and Codd to be THE most general model (Graph being a close second). It's a rigorous approach to managing data with integrity.
I used to be a programming language oriented person, was big into data structures and objects, but then I read Date and my mind was blown at how beautiful and expressive the relational model is -- for its intended purpose (managing data for logical integrity and ad hoc queriability).
The main issues are
1. is that many implementations don't include some features such as unions.
2. Certain things (tree traversal) have also been hard to express in older versions of SQL or older versions of Tutorial D (Chris Date's language that's closer to the model).
3. Sometimes you don't care about long term data management (i.e. ad hoc queriability and integrity), you just want programmatic data persistence with pre-baked access paths that are FAST.
4. Relational integrity features are often crude implementations that slow things down too much or require custom triggers.
5. Queriability in reality requires decent knowledge of the physical layout and indexing if you're going to make it performant
6. Most relational databases have not been built in a cloud native era where we assume distribution across ephemeral disks and compute
So... great mathematical model, great way to think through and organize your data for no ambiguity, but the practical implementations leave a lot to be desired.
The problem is that "my data is too complex for the relational model" often means "I haven't thought through my data". Things like maps, unions, ordered sets, N-ary relationships, graphs and trees, are actually quite straight forward to represent in relations. The challenge is many of the lessons and arguments for this are trapped in books from the 70s-90s, not on the Web.
Agree, 100%. I too early in my career was very enamored by object and graph databases (this is pre the 'nosql' buzz), but once I started reading Date and Codd (and the inflammatory Fabian Pascal) some lights starting turning on in my brain.
Firstly, it should be made clear to people that SQL is not truly relational, and a lot of the things people dislike about it are nothing to do with the relational model and more to do with its late 70s, early 80s heritage. It was thrown together at a time when business systems were still very focused around COBOL.
The second thing that people are not picking up on is that the industry _already_ went through a pre-sql "nosql" phase in the 60s and 70s when network and hierarchical databases were the norm, and the relational model was developed to deal with the perceived faults those systems had: an enforced topology which could not easily rearranged at query time, lack of rigor in modeling, lack of standardized modeling concepts and notation...
Finally, I did find in previous jobs certain uses for nosql systems -- very low latency high throughput quasi-realtime systems that deal with very small bits of simply structured data and need to distribute it widely across a cluster. For that I used Cassandra (tho I understand now that there are successors that are better).
What I don't get is the point of systems like Redis or MongoDB which don't offer a compelling distribution implementation and simply replace the fairly well understood quasi-relational model of SQL with their own ultimately inferior graph/network/hierarchical models.
As a fan of Cassandra (though not for its relational model ;), what successors do you believe are better? Riak is the only one that comes to mind.
Btw, to me the point of Redis and Mongo is a very fast distributed dictionary. They're data structure servers, for when you want to persist and share data structures across processes, not "manage information". It depends on your goal.
Oh, this is true, and I think I came off too harshly. The relational model is general enough to support (almost?) any data model, it certainly has a lot of advantages in terms of efficient implementation, and the math is elegant.
I just don't think that it naturally reflects the data structures people use, and we should be willing to make the computer do the work, rather than humans.
The buzz around NoSQL is you don't have to worry about scaling the database. There are many, many more options now for e.g. multi-master, sharding, no-downtime copy-on-write migrations, etc., but just the idea of being able to run a tiny subset of queries or writes without having to worry about running out of resource capacity is a HUGE plus.
But having data corruption baked into the system design is shuge minus. Even the big shots at Google and Amazon are constantly firefighting data corruption in their NoSQL systems.
I'm not quite sure what you're referring to. But at certain scales of data (petabytes, probably?) data corruption is inevitable. I do not know if they use ECC.
"With a relational database the complexity is hidden"
That is my main issue. I use Cassandra over relational firstly for its linear scalability and multi-master-esque HA. But even ignoring those, I understand exactly what is being scanned and what is not, I don't have to fight with an optimizer at runtime based on several parameters.
I understand your point (and it’s a good one) and here is mine: Unless you're working in a team with a lot of good IT guys, you're likely to end up with worse performances and problems.
For example, when I started in Big Data, in less than 3 weeks I was able to optimize some batches just because I read the documentation of the framework used (PIG in this case) and read a small part of the source code to dig deeper. And it was not some touchy optimizations: I used in-memory joins and reduced the number relations in the scripts to reduce the generation of Hadoop jobs (which led to batchs 4 times faster).
There are often problems with our HBase database because it’s often overloaded (I’m not an IT operator so I can’t give more details) and no one really masters this database whereas it’s in production since 2014.
I do understand that in some cases a NoSQL database is mandatory and like you I like to understand what I’m doing. But:
- I’m not working in Silicon Valley
- Most of my co-worker are not geeks (and I respect that)
- It's VERY hard to find guys with real Big Data or NoSQL skills (this comes from a French technical recruiter)
So, if the geek part of me loves Big Data and NoSQL, the rational part prefers using well known technologies. If NoSQL and Big Data becomes mainstream and more known then the rational part will love them too.
While I agree with a couple of your premises, I don't think they all apply to Cassandra as much and is too broad of a brush to use.
I don't believe NoSQL means no validation. In fact, I've found things like Cassandra CQL actively prevent me from running expensive queries unless I opt in (e.g. ALLOW FILTERING). Validation is DB specific, but I don't think it's fair to say it's a footgun in NoSQL any more than in SQL databases.
As for choosing what is known by the employee market, I personally don't choose technologies that way (but I do choose based on maturity of course). I rarely look for skills as much as the ability to learn new ones, but I understand it can be a pipe dream when in the market for juniors.
Those optimizers you are fighting with have thousands of man hours of research behind them. For every silly choice they make, they make hundreds or thousands of correct ones.
Teradata, Oracle, PostgreSQL for example are reasonably complex databases to cluster and manage yourself. Just as easy/hard as setting up HDFS and installing Hive. In all cases people who are at big data scale are buying OTS solutions e.g. Cloudera appliance. They aren't rolling their own.
And if you are using Hive then I can understand why you are not feeling the buzz. But play around for Spark for a while and it's easy to see the future. Being able to write Scala/Python/SQL/R against a data set that can be anywhere from 100MB to 100PB without any changes is pretty compelling.
While it can handle 100 MB easily there probably are faster ways to handle that small amount of data. But yes, Spark can handle many PB and doesn't require a ton of changes in the code as you scale up from say 10 TB to 100 PB. The underlying cluster would change, and the performance profile would change a lot (10 TB can be done in-memory ... many PB, not so much)
Which really isn't intended for 100 MB (I bet I could write a unix pipe & filter script that's faster than Spark), but is intended for 10 TB through several PB.
I know. And I'm sorry, bitterly sorry, but I know that... no apologies I can make can alter the fact that in our thread you have been given a dirty, filthy, smelly piece of technical argument.