vmarkovtsev's comments

vmarkovtsev · on March 16, 2017

Me: https://twitter.com/tmarkhor/status/842347955390685184

vmarkovtsev · on Jan 30, 2017

Agreed. Sent you a job proposal. We are http://sourced.tech

srogatch · on Feb 2, 2017

Thanks, but I'm looking for investments in the project.

vmarkovtsev · on July 28, 2016

Here. Please don't bite :)

nl · on July 28, 2016

I'm pretty impressed!

Do you know about IBM's Spark+GPU Hackathon?[1] There are some pretty substantial prizes (I thought they were offering cash, but I'm not seeing it now).

[1] http://openpower.devpost.com/details/spark_rally

vmarkovtsev · on July 28, 2016

This is an exciting opportunity, thanks, we will study it.

fnord123 · on July 28, 2016

Do you have any interest in implementing other clustering algorithms on GPU? e.g. HDBSCAN? Or is it not as parallelizable?

cs702 · on July 28, 2016

Agree on HDBSCAN/DBSCAN, which is able to find the number of clusters in a large class of problems (unlike K-means, which requires that the number of clusters/centroids be provided as a hyperparameter, or found via some kind of search).

Otherwise, I just want to say to vmarkovtsev: thank you for this -- I will add it to my arsenal of tools, and may others will surely do so as well.

vmarkovtsev · on July 28, 2016

Thanks. Actually, I like DBSCAN a lot and use it often, though I am not much familiar with it's internals. It looks like it is iterative and thus does not fit very well to a GPU. The only way I see is to pick several seed points at start...

cs702 · on July 28, 2016

A Google search reveals this paper: https://arxiv.org/abs/1506.02226

This paper claims a "97x improvement" over traditional (non-parallelized) DBSCAN algorithms, but that's not a very helpful claim, because it does not indicate what the computational costs are as a function of, say, the number of data points or dimensions.

vmarkovtsev · on July 28, 2016

97x improvement is actually very suspicious. Thanks for the article!

fnl · on July 28, 2016

DBScan certainly is. But not sure about a FOSS implementation...

http://www.sciencedirect.com/science/article/pii/S1877050913...

https://www.researchgate.net/publication/221614133_Density-b...

fnl · on July 29, 2016

Hi there, very nice work and thanks for sharing/open sourcing!

My questions to you would be:

1. The main problem with k-means is it's scalability [1]. Could you comment on what k and n g you'd expect your implementation to compute?

2. (kd-, ...) Trees with blacklisting (Pelleg-Moore) are a more efficient algorithmic approach to k-means clustering than k-means++ and consorts (which you are showing in the benchmarks on your blog). Do you know how yinyang fares against a correct Pelleg-Moore implementation of k-means (as the yinyang algorithm's authors did not... :-( )?

[1] http://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

vmarkovtsev · on July 29, 2016

1. The number of samples must not exceed UINT32_MAX, that is, 4*10^9. The number of clusters must not exceed UINT32_MAX too. Number of dimensions must not exceed 12288 (GPU shared memory constraint). We successfully tested with 4M samples and 480 dimensions (the product is greater than UINT32_MAX) against potential overflows. Practically speaking, it will not take ages if the product of samples, clusters and dimensions does not exceed 10^14.

2. No, unfortunately I don't. Implementing a tree on GPU is a real pain and the performance will still be bad, so I didn't even consider them.

The common problem with advanced approaches is the memory overhead. Hi-end GPU has only 12 GB and you have to fit. E.g. Yinyang becomes unapplicable on 500000 samples, 10000 clusters and 480 dimensions (though only the product of samples and clusters matters much).

fnl · on July 29, 2016

Thanks for explaining! So I conclude for the data sizes you mention yinyang on a GPU is possibly the best approach, after which Pelleg-Moore on CPUs is (still) the goto solution. Or can you see a way for distributing this among graphics cards?

vmarkovtsev · on July 29, 2016

I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step. So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.

vmarkovtsev · on July 6, 2016

The age is April 2016. The source is github.com/src-d/go-git-ing

minimaxir · on July 6, 2016

That's the tool, not the source.

Both the data source and tool are very important to include in analysis posts.

vmarkovtsev · on July 6, 2016

Data source is GitHub of course - it's in the article title, so I don't quite get you...

We used go-git to fetch every repo out there.

vmarkovtsev · on Nov 10, 2015

This is very similar to what we did in Samsung several years ago: https://velesnet.ml