Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For large calculations, the magic of an FPGA is in its throughput. Imagine that you have some mess of addition, multiplication, ... The time for the first calculation hardly matters. Even if the FPGA is slower getting through the first calculation that took 100 clock cycles on a CPU vs 10 on an FPGA, what happens on the next clock cycle? The FPGA cranked through an entire second iteration while the CPU is a few steps in. Next clock cycle? Now the FPGA has pushed another whole calculation through the pipeline.


My favourite personal example of this is using a GPU to perform an all-pairs nearest neighbour lookup in an image (for all pixels, find the nearest keypoint). That's something like 2 billion comparisons per image. A decent CPU (parallelised) took minutes to do that, at the time.

By far the simplest solution was to brute force it on a GPU. Probably took longer "single threaded", as there was no optimisation at all, but over O(10e6) pixels with a list of O(10e3) keypoints in shared memory it was basically instant.

It was a great lesson in premature optimisation. I could have spent days tweaking the CPU method with heuristics, sorting the inputs, etc. In the end it was less than 100 lines of OpenCL I think.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: