1. The number of samples must not exceed UINT32_MAX, that is, 4\*10^9. The numbe...

fnl · on July 29, 2016

Thanks for explaining! So I conclude for the data sizes you mention yinyang on a GPU is possibly the best approach, after which Pelleg-Moore on CPUs is (still) the goto solution. Or can you see a way for distributing this among graphics cards?

vmarkovtsev · on July 29, 2016

I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step. So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.