Do you know about IBM's Spark+GPU Hackathon?[1] There are some pretty substantial prizes (I thought they were offering cash, but I'm not seeing it now).
Agree on HDBSCAN/DBSCAN, which is able to find the number of clusters in a large class of problems (unlike K-means, which requires that the number of clusters/centroids be provided as a hyperparameter, or found via some kind of search).
Otherwise, I just want to say to vmarkovtsev: thank you for this -- I will add it to my arsenal of tools, and may others will surely do so as well.
Thanks. Actually, I like DBSCAN a lot and use it often, though I am not much familiar with it's internals. It looks like it is iterative and thus does not fit very well to a GPU. The only way I see is to pick several seed points at start...
This paper claims a "97x improvement" over traditional (non-parallelized) DBSCAN algorithms, but that's not a very helpful claim, because it does not indicate what the computational costs are as a function of, say, the number of data points or dimensions.
Hi there, very nice work and thanks for sharing/open sourcing!
My questions to you would be:
1. The main problem with k-means is it's scalability [1]. Could you comment on what k and n g you'd expect your implementation to compute?
2. (kd-, ...) Trees with blacklisting (Pelleg-Moore) are a more efficient algorithmic approach to k-means clustering than k-means++ and consorts (which you are showing in the benchmarks on your blog). Do you know how yinyang fares against a correct Pelleg-Moore implementation of k-means (as the yinyang algorithm's authors did not... :-( )?
1. The number of samples must not exceed UINT32_MAX, that is, 4*10^9. The number of clusters must not exceed UINT32_MAX too. Number of dimensions must not exceed 12288 (GPU shared memory constraint). We successfully tested with 4M samples and 480 dimensions (the product is greater than UINT32_MAX) against potential overflows. Practically speaking, it will not take ages if the product of samples, clusters and dimensions does not exceed 10^14.
2. No, unfortunately I don't. Implementing a tree on GPU is a real pain and the performance will still be bad, so I didn't even consider them.
The common problem with advanced approaches is the memory overhead. Hi-end GPU has only 12 GB and you have to fit. E.g. Yinyang becomes unapplicable on 500000 samples, 10000 clusters and 480 dimensions (though only the product of samples and clusters matters much).
Thanks for explaining! So I conclude for the data sizes you mention yinyang on a GPU is possibly the best approach, after which Pelleg-Moore on CPUs is (still) the goto solution. Or can you see a way for distributing this among graphics cards?
I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step.
So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.