Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

"Intel does offer an emulator, so testing for correctness does not require this long step, but determining and optimizing performance does require these overnight compile phases."

After some experience with FPGAs, the emulation step is not enough to test for correctness. Most of the problems happen while synchronizing signals with inputs/outputs and with other weird timing problems, glitches and unintuitive behavior that FPGAs provide (and the emulator behaves differently). I was using VHDL however. Anybody has experience with OpenCL on FPGAs to explain what difficulties persist (are the timing problems easier to solve)?



There are a number of real issues with the emulation approach even now. Firstly, emulation isn't accurate - if you do floating point math in your application it will give you different results (within the tolerances of the OpenCL spec) on FPGA vs. CPU. So you can't test for correctness in the emulator.

A second more serious issue is that getting performance that justifies using an FPGA requires tuning very carefully to the architecture. This may mean adopting pipeline architectures that destroy emulator performance (there's still issues with the emulator identifying that the design patterns for shift registers in FPGA look like pointer fun on CPU rather than mammoth memcpys). So for a huge part of the design stage the emulator is basically useless- because it tells you nothing about what you care about, since the performance of the emulator is often negatively correlated with performance on FPGA. This is made worse if you're doing hardcore FPGA tricks like mixed precision arithmetic.

As you say though - the fact that timing is lost in the emulator also means you don't get a true idea of whether you have buffer overflows etc or lock-up. Adding debug into the actual design impacts the implementation on FPGA in a way that it doesn't for software - and is sometimes unintuitive.


> if you do floating point math in your application it will give you different results

That seems like a gross weakness in the emulator; floating point isn't actually nondeterministic!


Actually it's not as clear cut as you'd expect. Obviously you can't represent every number in floating point, so you have to choose a way to round numbers - and for simple operations like add you can correctly round the results. For transcendental operations like x^y it's actually unknown how many resources you'd need to correctly calculate x^y for every valid value of x and y[1]. So since you can't calculate these numbers to correct rounding, you have to choose a level of rounding for your approximation - like 3 units of least precision rounding at the output. Of course we all need to know how accurate these are - so the OpenCL standard specifies it[2] - exp requires being correct to 3ULPs for single/double and 2ULPs for half.

Now if you have 3 ULP to play with, the maker of an Intel CPU is going to design an exp instruction to best make use of the existing Intel functional units. But an Intel FPGA dev is going to design an exp instruction to best make use of Lookup Tables and 18x18 multiplies - because that's what they have on the FPGA.

So whilst you'll get the same answer for x^y on Intel CPU and Intel FPGA within 3 ULPs those rounding errors are going to be different between the two architectures. So now, if you want to compute a normal distribution on Intel FPGA vs CPU you'll get 3 ULPs in your exponent, but that'll carry forward into the rest of the equation.

So now you have a choice - do you use the built-in function for exp on the Intel CPU - which is OpenCL compliant just like the FPGA, and get unknown rounding errors in what is probably a mathematically sensitive task, or do you emulate the actual sub-operations the FPGA does? In which case your hardy RTL designer who wrote that exponent function RTL is going to have to write an implementation in C that emulates the hardware. Oh and they don't only have to do that for exp - they have to do that for 100s of mathematical functions, and it'll run dog slow on the CPU compared to using the native functions.

[1]https://en.wikipedia.org/wiki/Rounding#Table-maker's_dilemma

[2]https://www.khronos.org/registry/OpenCL/specs/opencl-2.1-env...


> do you emulate the actual sub-operations the FPGA does?

Yes. It's an emulator.

> ... is going to have to write an implementation in C that emulates the hardware.

Makes sense. It's an emulator.

> ... and it'll run dog slow on the CPU compared to using the native functions.

Isn't that to be expected? It's an emulator. This isn't like games where it just has to look close. If it's a dev tool for testing correctness, exactness matters.


Well last time I looked, the answer to question 1 for the Intel OpenCL SDK is actually no.

And while yes, it's expected to be slow compared to the native functions, that's not the problem. It's slow compared to simulation.


Funny how some people believe that brain simulations are relatively nearby, but we can't even simulate an FPGa well enough to trust the model.


All sorts of simulators are nearby at once if you consider quantum computers to be nearby, it's not linear development. I agree however, since I don't consider a quantum computer to be nearby.


I wouldn't think the normal standard issues with glitches and synchronizing signals would happen with OpenCL. The synthesizer should have enough information to handle all of that and spit out max timing on the other side.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: