Can you explain how to build your project and how to run the benchmarks? Cause I just spent a few hours disproving another poster's claim of getting OpenBLAS-like performance and I won't want to waste more time (https://news.ycombinator.com/item?id=38867009). While I don't know Nim very well, I dare claim that you don't get anywhere near OpenBLAS performance.
First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range
Reproduction:
- Assuming x86 and preferably Linux.
- Install Nim
- Install a C compiler with OpenMP support (not the default MacOS Clang)
- Install git
The repo submodules MKLDNN (now Intel oneDNN) to bench vs Intel JIT Compiler
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(77, 8) Warning: use `std/os` instead; ospaths is deprecated [Deprecated]
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(101, 8) template/generic instantiation of `bench` from here
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(106, 21) template/generic instantiation of `gemm_nn_fallback` from here
/home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm.nim(85, 34) template/generic instantiation of `newBlasBuffer` from here
/home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm_data_structure.nim(30, 6) Error: signature for '=destroy' must be proc[T: object](x: var T) or proc[T: object](x: T)
Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.
It was from an older implementation that wasn't compatible with Nim v2. I've commented it out.
If you pull again it should work.
> Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.
It defaults to 1920x1920 * 1920x1920. Note, if you activate the benchmarks versus PyTorch Glow, in the past it didn't support non-multiple of 16 or something, not sure today.
Ok, I will benchmark this more when I have time. My gut feeling is that it is impossible to reach OpenBLAS-like performance without carefully tuned kernels and without explicit SIMD code. Clearly, it's not impossible to be as fast as OpenBLAS otherwise OpenBLAS itself woudln't be that fast, but it is very difficult and takes a lot of work. There is a reason much of OpenBLAS is implemented in assembly and not C.
I have benches on i5-5257U (dual core from old MBP15), i9-9980XE (Skylake-X 18 cores), Dual Xeon Gold 6132, AMD 7840U.
See: https://github.com/mratsim/laser/blob/master/benchmarks%2Fge...
And using my own threadpool instead of OpenMP - https://github.com/mratsim/weave/issues/68#issuecomment-5692... - https://github.com/mratsim/weave/pull/94