Yes. However, Intel (the guys making CPUs actually implementing these instructions) only supports them for C/C++. Just because you can use them from other languages (e.g. modern .NET has them as well, System.Numerics.Vectors) doesn’t necessarily mean it’s a good idea to do so.
> In fact, in Rust, they are easier to use.
That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.
When I code at that level of abstraction, I want to get whatever instructions are implemented by CPU. No more, no less.
I’ve looked at example on the front page. There’re two methods to compute rsqrt in SSE/AVX, fast approximate one (rsqrtps), and precise one (sqrtps, divps). There’re several methods to compute ceil/floor, again with different tradeoffs. Do you know which instructions their example compile into? Neither do I.
Also, one tricky part of CPU SIMD is cross-lane operations (shuffle, movelh/movehl, unpack, etc). Another one is integers: the instruction set is not comparable to any programming language, saturated versions of + and -, relatively high-level operations like psadbw, pmaddubsw, palignr, lack of something simple (e.g. can’t compare unsigned bytes for greater/less, only signed ones).
For trivially simple algorithms that compute same math on wide vectors of float values, you better use OpenCL and run on GPU. Will likely be faster.
> You can link the exact same intrinsics in a rust binary to get the same intrinsics.
Intrinsics are not library functions. You don’t link them anywhere. They’re processed by compiler not linker, and for SIMD math, each one usually becomes a single instruction. Linked functions are too slow for that.
> This is nothing about rust or not.
When I code C and write y=_mm_rsqrt_ps(x) I know I’ll get my rsqrtps instruction. When I write y=_mm_div_ps(_mm_set1_ps(1), _mm_sqrt_ps(x)) I know I’ll get slower more precise version. I don’t want compiler to choose one for me while converting a formula into machine code.
Sorry to disappoint but Rust can’t include C++ headers. Even if it could, they wouldn’t work, because intrinsics are not library functions.
> See the explicit section of the faster project.
These aren’t C intrinsics, they are library functions exported from stdsimd crate. Which in turn forwards them to LLVM. Requires Rust nighty. Also I’m not sure that many levels of indirection are good for performance. You usually want these m128/m256 values to stay in registers. In C++, I sometimes have to write __forceinline to achieve that, or the compiler breaks performance by making function calls, or referencing RAM.
Looks like significant overhead over C intrinsics. Two calls to transmute() for every instruction. And other calls for every instruction, stuff like as_i32x4.
It’s technically possible every last one of them compile into nothing at all, and emits just a single desired instruction. I don’t believe these optimizations are 100% reliable, however. They aren’t reliable in clang or vc++, I sometimes have to use trickery to force compilers to inline stuff, keep data in registers instead of loads/stores, and otherwise not screw up the performance.
Right, I know there’s some support.
> You can use intrinsics from a lot of languages.
Yes. However, Intel (the guys making CPUs actually implementing these instructions) only supports them for C/C++. Just because you can use them from other languages (e.g. modern .NET has them as well, System.Numerics.Vectors) doesn’t necessarily mean it’s a good idea to do so.
> In fact, in Rust, they are easier to use.
That’s not “in fact”, that’s your opinion. Personally, I don’t think simple is good.
When I code at that level of abstraction, I want to get whatever instructions are implemented by CPU. No more, no less.
I’ve looked at example on the front page. There’re two methods to compute rsqrt in SSE/AVX, fast approximate one (rsqrtps), and precise one (sqrtps, divps). There’re several methods to compute ceil/floor, again with different tradeoffs. Do you know which instructions their example compile into? Neither do I.
Also, one tricky part of CPU SIMD is cross-lane operations (shuffle, movelh/movehl, unpack, etc). Another one is integers: the instruction set is not comparable to any programming language, saturated versions of + and -, relatively high-level operations like psadbw, pmaddubsw, palignr, lack of something simple (e.g. can’t compare unsigned bytes for greater/less, only signed ones).
For trivially simple algorithms that compute same math on wide vectors of float values, you better use OpenCL and run on GPU. Will likely be faster.