For large calculations, the magic of an FPGA is in its throughput. Imagine that ...

joshvm · on Aug 14, 2018

My favourite personal example of this is using a GPU to perform an all-pairs nearest neighbour lookup in an image (for all pixels, find the nearest keypoint). That's something like 2 billion comparisons per image. A decent CPU (parallelised) took minutes to do that, at the time.

By far the simplest solution was to brute force it on a GPU. Probably took longer "single threaded", as there was no optimisation at all, but over O(10e6) pixels with a list of O(10e3) keypoints in shared memory it was basically instant.

It was a great lesson in premature optimisation. I could have spent days tweaking the CPU method with heuristics, sorting the inputs, etc. In the end it was less than 100 lines of OpenCL I think.