The code is open-source I have benches on i5-5257U (dual core from old MBP15), i...

bjourne · on Jan 4, 2024

Can you explain how to build your project and how to run the benchmarks? Cause I just spent a few hours disproving another poster's claim of getting OpenBLAS-like performance and I won't want to waste more time (https://news.ycombinator.com/item?id=38867009). While I don't know Nim very well, I dare claim that you don't get anywhere near OpenBLAS performance.

mratsim · on Jan 4, 2024

First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range

Reproduction:

- Assuming x86 and preferably Linux.

- Install Nim

- Install a C compiler with OpenMP support (not the default MacOS Clang)

- Install git

The repo submodules MKLDNN (now Intel oneDNN) to bench vs Intel JIT Compiler

```

git clone https://github.com/mratsim/laser

cd laser

git submodule update --init --recursive

nim cpp -r --outdir:build -d:danger -d:openmp benchmarks/gemm/gemm_bench_float32.nim

```

This should output something like this

```

Laser production implementation

Collected 10 samples in 0.230 seconds

Average time: 22.684 ms

Stddev time: 0.596 ms

Min time: 21.769 ms

Max time: 23.603 ms

Perf: 624.037 GFLOP/s

OpenBLAS benchmark

Collected 10 samples in 0.216 seconds

Average time: 21.340 ms

Stddev time: 3.334 ms

Min time: 19.346 ms

Max time: 27.502 ms

Perf: 663.359 GFLOP/s

MKL-DNN JIT AVX512 benchmark

Collected 10 samples in 0.201 seconds

Average time: 19.775 ms

Stddev time: 8.262 ms

Min time: 15.625 ms

Max time: 43.237 ms

Perf: 715.855 GFLOP/s ```

Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.

It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...

Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...

in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...

AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...

And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...

I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...). As a side bonus it also has parallel nqueens implemented.

bjourne · on Jan 4, 2024

The compilation command errors out for me:

/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(77, 8) Warning: use `std/os` instead; ospaths is deprecated [Deprecated] /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(101, 8) template/generic instantiation of `bench` from here /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(106, 21) template/generic instantiation of `gemm_nn_fallback` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm.nim(85, 34) template/generic instantiation of `newBlasBuffer` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm_data_structure.nim(30, 6) Error: signature for '=destroy' must be proc[T: object](x: var T) or proc[T: object](x: T)

Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.

mratsim · on Jan 4, 2024

Ah,

It was from an older implementation that wasn't compatible with Nim v2. I've commented it out.

If you pull again it should work.

> Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.

I don't get why you think it's impossible to reach BLAS speed. The matrix sizes are configured here: https://github.com/mratsim/laser/blob/master/benchmarks/gemm...

It defaults to 1920x1920 * 1920x1920. Note, if you activate the benchmarks versus PyTorch Glow, in the past it didn't support non-multiple of 16 or something, not sure today.

Packing is done here: https://github.com/mratsim/laser/blob/master/laser/primitive...

And it also support pre-packing which is useful to reimplement batch_matmul like what CuBLAS provides and is quite useful for convolution via matmul.

bjourne · on Jan 6, 2024

Ok, I will benchmark this more when I have time. My gut feeling is that it is impossible to reach OpenBLAS-like performance without carefully tuned kernels and without explicit SIMD code. Clearly, it's not impossible to be as fast as OpenBLAS otherwise OpenBLAS itself woudln't be that fast, but it is very difficult and takes a lot of work. There is a reason much of OpenBLAS is implemented in assembly and not C.