Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

1. The number of samples must not exceed UINT32_MAX, that is, 4*10^9. The number of clusters must not exceed UINT32_MAX too. Number of dimensions must not exceed 12288 (GPU shared memory constraint). We successfully tested with 4M samples and 480 dimensions (the product is greater than UINT32_MAX) against potential overflows. Practically speaking, it will not take ages if the product of samples, clusters and dimensions does not exceed 10^14.

2. No, unfortunately I don't. Implementing a tree on GPU is a real pain and the performance will still be bad, so I didn't even consider them.

The common problem with advanced approaches is the memory overhead. Hi-end GPU has only 12 GB and you have to fit. E.g. Yinyang becomes unapplicable on 500000 samples, 10000 clusters and 480 dimensions (though only the product of samples and clusters matters much).



Thanks for explaining! So I conclude for the data sizes you mention yinyang on a GPU is possibly the best approach, after which Pelleg-Moore on CPUs is (still) the goto solution. Or can you see a way for distributing this among graphics cards?


I am working on the multi-gpu branch at the moment, but the memory constraints will remain the same (optimizing for the speed at this time). I do see the way to distribute memory across cards though, and it will be the next step. So yes, if your problem size fits Yinyang, then kmcuda looks like an optimal choice. If it is bigger, the best way is to make, say first 5 iterations with kmcuda in Lloyd mode and then pass the half-baked centroids to some Pelleg-Moore implementation on CPU.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: