I love a good benchmark, thanks for putting this together! However, I have a bit of feedback.
First, the graph is misleading, stacking times with languages that have half the implementation, they appear faster, until you dig in. I'd suggest producing an alternate graph that shows only the implemented puzzles in every language, or make a unique graph for every language:puzzle.
Second, the examples are taken from rosetta code and are not necessarily what would be the best implementation, or even close to the best implementation, for benchmarking purposes.
Finally, those examples should be reproduced across various hardware platforms, I'm on arm64 Darwin myself, but you might find different results on Intel platforms due to the various compiler optimizations available based on the hardware.
More benchmarks would be interesting to see, such as actual real world operations, e.g. opening a file, reading it, parsing json, opening a socket server, etc.
Also it may or may not be relevant to not exclude the warm up time for JIT languages. If it is for a calculation you will do continuously, including the warm up time may bias the results materially (I know for .net it can be significant).
Clarification: the examples are not taken from Rosetta code. Only the n-queens algorithm is inspired by (but still different from) an Rosetta code implementation; otherwise this benchmark has nothing to do with Rosetta code.
> It is obvious that c[i], b[k] and a[i][k] can be moved out of the inner loop to reduce the frequency of matrix access. [...] However, most other languages cannot optimize this nested loop. If we manually move a[i][k] to the loop above it, we can often improve their performance.
This is only true when three matrices are independent of each other, and also why C has a `restrict` qualifier that enables this assumption. The benchmark itself has no such assumption because all three variables are defined as `double **`, and it can be verified by assembly outputs. Clang's excellent performance in matmul is probably much more due to autovectorization.
In some cases there are other reasons for hoiting the dereference too. For example, Crystal will check if the array access is out of bounds and by hoisting variables that will be done a lot less seldom, which can have huge effects for code that does a lot of that, like matmul.
The branch does, but you still have the comparison itself, and also the branch instruction to decide and skip (although thanks to branch prediction, your pipeline doesn’t get flushed).
If you have enough idle execution units, you might not see a difference in wall clock time. But with many algorithms you can put those units to good use.
Doesn’t include Prolog which has decent constraint solver answers for n-queens and sudoku, which are pretty fast but I don’t know how they would compare in benchmarks:
The Julia matmul implementation has its rows and columns flipped though - unlike C, Julia uses row-major matrices. This has large implications for speed.
Also, the code may be much faster if you enable SIMD in the function, which is disabled in the code because a) the code unnecessarily checks bounds at every index instead of at the top of the function, and b) float SIMD is opt-in since SIMD changes the rounding
Yes. For this type of test, Julia should be nearly the exact same speed as C unless there is a mistake in the code. Which, to be fair, can happen very easily, especially with implicit typing.
For one-time problem runs, the JIT languages in practice will not be given time to warm up. All that matters for a user is how fast the application is in practice. It's not about "making it fair" for languages, it's about measuring how fast they go from nothing to results.
It doesn't make sense to allow "warmup" time for them unless your expected application is a server which for most of the time will be running "warm" (and even then with scalable containers that assumption may not even be true in some cases). For servers, however, what matters is mostly how fast its HTTP library is and how good the async IO is... check Techempower benchmarks for that: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...
Code compilation is a common example. In many languages, when you do a clean build of a large project, you run a large number of short-lived processes.
With command line tools, the dominant design philosophy is that a program completes one task and exits. The tasks may be large or small, and the same tools are often expected to handle both.
I strongly suspect that the author may have confused the JIT warmup (hard to measure, as you need to ensure that the performance figure have reached the stable point) from the startup overhead (easy to measure).
I wondered why rust is so far behind C/nim/zig, they should have similar behaviour.
The difference is mostly in matmul.
I see how C, for an n x n matrix, does 2 allocations, while rust does n+1. C's matrix rows are right next to each other, rusts are probably all over the place. Didn't look at nim or zig.
Maybe a slice of slice of double would perform better than a vec of vec of double? Then again, an argument can be made that rust pushes you to vec, so this impl is more honest for how a beginner would do it. Otoh, C has the optimization so why doesn't rust?
The rust code is very unidiomatic, not only because of the Vec of Vecs which I’d say, even if it’s the obvious naive approach, no one experienced wouldn’t choose over a flat slice, the implementation itself is very naive and unidiomatic.
Also it's using checked indexing, which apart from being not idiomatic, is also going to slow things down. A fairer comparison would be to use the unchecked indexing variants.
Yeah, it's a similar problem in C# - ported code 1:1 is not representative since it now can't do out of bounds accesses and is therefore safe but comes at a cost of bounds checks which can be manually elided.
Zig had a similar issue which I made an MR to fix. I didn't actually notice Rust had the same issue but that's probably just my forgotten knowledge with how the vec! macros expand when nested.
Doesn't your PR monomorphize the function every time N is changed ? I realize it's simpler since it keeps the structure, simplifies allocation of arrays, and elides bound check. But it explodes generated code size and matrix size can't be changed at runtime, which doesn't really match C.
Very interesting. My two cents: because the stacked bars for php, ruby, perl and py:cpy, it's impossible to compare the other languages in that first chart. All the chart says is that the benchmark is much slower in those 4 languages relative to "all other languages we tested".
It would be nice to see those other languages in a chart that doesn't include the slower four. Alternatively, you could also show those slower four with "broken" columns like this https://peltiertech.com/broken-y-axis-in-excel-chart/
Ah, duh, I see that now. That's what I get for commenting too quickly ;-)
The choice of stacked bar charts is also strange because the languages are not all comparable
> Every language has nqueen and matmul implementations. Some languages do not have sudoku or bedcov implementations. In addition, I implemented most algorithms in plb2 and adapted a few contributed matmul and sudoku implementations in plb. As I am mostly a C programmer, implementations in other languages may be suboptimal and there are no implementations in functional languages. Pull requests are welcomed!
So the point still stands that many charts doing one thing each would be better than fewer charts doing many things each
It would if you have some summary charts at the top that don't include all of the data, but just the key takeaways, with other charts providing more detail for discussion elsewhere
Given that its become increasingly more common for CPUs to have both Performance & Efficency cores … how do benchmarks ensure they are only being run on the P-cores?
Sibling answered on macOS. Their scheduler prefers by default to run everything on performance and reserves efficiency for background OS tasks by default. On Linux you could set the affinity when running by pinning the cpu affinity either upfront through task set or programmatically at runtime (eg I have 32 cores and 0-15 are performance while the rest are efficiency)
Having checked the C implementations of the algorithms, I'm a little skeptical because the implementations aren't optimized. Tiling matrix multiplication exploiting SIMD could easily improve performance 100-fold or more. At those speeds the cost of memory transfer usually dominate so the languages that give you the most fine-grained control over how data is laid out in memory tend to win. And it may not be the same languages that are now winning in your benchmark.
I think a benchmark of "naive" implementations is interesting too, because it shows you how fast your code is usually going to run, not how fast it theoretically could run at its best.
> shows you how fast your code is usually going to run
Why would you think that?
Maybe the "naive" implementation just shows how fast the easily removed hotspot in your code is going to run.
"Swap the order of two statements and see the Java code slow down … Swap globals for local variables in a function and see the Python code speed up. Swap language implementations and see the C code speed up."
These are not the best benchmarks, but Python is indeed as slow as Perl, which I find insane considering that Python has 100-1000 more people working on the interpreter and performance has been a big emphasis the last few years.
Well, for a long time speed was not a priority at all for the team developing cpython. In fact, Python 3 was still a bit slower than Python 2 until a few versions ago.
Recently there have been some decent improvements to CPython's speed, but there are real upper limits to how fast you can make an interpreter. CPython will need JIT compilation if it is ever to break out of its current speed bracket.
JavaScript has had a feature complete JIT reference implementation since 2008 which is a major part of the reason JS applications exploded so much in the 2010s.
The benchmark is not representative of how Python is actually used in practice. It ignores the existence of libraries (explicitly). For example, If numpy, pytorch were used for matmul, the results would be completely different.
Perl people would use PDL or Math::GSL.
I look into these benchmarks because I'm interested in VMs, and yeah the Python VM can be quite slow if you are not calling C code.
PHP results: I was stupid enough to write some scientific code in PHP once so know how slow it can be - mostly around array access and manipulation. But if your going to try, use the HHVM interpreter. It's much faster and is a drop in replacement for the PHP interpreter. Hack (https://hacklang.org/) uses that under the hood by default.
HHVM is not a drop-in replacement for a PHP interpreter. The semantics of Hack and PHP have diverged, typically in the direction of eliminating dynamic behavior from Hack that existed in PHP (examples: string -> function coercion, the PHP dual vector-hash-table array type, and non-throwing out-of-bounds array accesses are all gone from Hack). The semantics changes both simplify static analysis and make it easier to JIT fast code.
I put the pure math codes in C extension of PHP. Building C extension for PHP is easier than most of the other high level languages. And then things get blazingly fast.
The state of PHP 8.3 and especially 8.4 is a lot better than HHVM. They have diverged quite a lot to a point that it's no longer a drop-in replacement.
PHP added JIT in 8.0, and these math-heavy tasks can take advantage of it. It's not trivial to fine tine JIT configuration though.
In PHP 8.4 (scheduled Nov 2024), there is a major upgrade to JIT as well.
Also since PHP was built for web requests it builds The cache on the First run. would be interesting to see All languages of the benchmark with a second run
C# code should be using an official package for GEMM which is System.Numerics.Tensors, it is pure C# but runs at max hw efficiency (it is also idiomatic). Or at least use Vector<T> instead of scalar operations. I’d expect this applies to most other popular languages here too.
System.Numerics.Tensors is disqualified because it uses a different algorithm. If you have a faster matmul implementation in Vector<T>, a PR will be much appreciated. Thank you in advance.
A curious thing about Swift: after https://github.com/attractivechaos/plb2/pull/23, the matrix multiplication example is comparable to C and Rust. However, I don’t see a way to idiomatically optimise the sudoku example, whose main overhead is allocating several arrays each time solve() is called. Apparently, in Swift there is no such thing as static array allocation. That’s very unfortunate.
Figure updated. Now swift is pretty fast on nqueen+matmul but it has the longest green bar (i.e. longest running time for sudoku). This looks ... interesting.
There are only 20 thousand array allocations in total, not a lot. Javascript also has these many arrays allocated/deallocated but it is 4 times as fast.
This is an area where "solvers" are heavily under utilized. For the absolute fastest solution, bespoke implementations are almost certainly required.
However, translating Sudoku and N-Queens into a similar problem that you can feed into a solver can get you a long way. Even better, you can move that solver into whatever language gives you the best optimizations that you can work. Even better, there are almost certainly optimizations in common solvers that you don't want to deal with implementing on your on.
A benchmark I would like to see is a comparison of languages in terms of how fast they are to beginners vs experts. I've been thinking about how to design it to get that result. What I think would work is taking something like these simple puzzles and have maybe a hundred people write up different solutions, so we can compare them using the programmer's level of expertise as one of the factors.
Surprised how well JS (node and others) seem to come out when I've had firsthand experience of switching from JS to Go to speed up an algorithm type question and had the Go version crank through a bruteforce much much faster.
Maybe I made an accidental optimization in my language translation, or maybe there are some operations that are much slower in JS and these benchmarks didn't hit any of them.
Go's AOT compiler is actually not that sophiscated (comparable to -O1 in most C compilers, possibly even worse). Was your program running only for a fraction of a second? Then the JIT may haven't fully warmed up and even a basic AOT compiler has a better chance to win.
One thing I always wonder with these benchmarks is: do we want to know the best possible runtime for a language or the runtime of idiomatic/average implementations? Even with C there are loads of compiler flags (and compilers) to choose from.
The data has nothing definite to say about language implementations that were not measured. Let's focus on what we might say about those that we're measured.
Nowadays I would add lines of code and syntax complexity to the benchmark. It is apples and oranges comparison but you could prefer clarity than performance depending on the circumstances. Also, it is relatively easy to do with today language tools.
Yes, I saw that and still consider the methodology questionable. A fairer approach might be: time from the cli, including compilation, for compiled languages. (Or warmup the jit compiled code.)
IMO, uncompressed bytes is a better representation, because it can be used to compare relative expressive power for the particular problem. I'd bet Python cleans house here, but the write-only languages are a wild card.
"Expressive power" is a very subjective term, and uncompressed size is a bad proxy as it includes too many variables specific to coding conventions. Compressed size with a stupid enough algorithm (here gzip) is meant to reduce these variables. The true Kolmogorov complexity in comparison can't be computed, and too smart algorithms can start to infer enough about the language itself.
What I'm asking about is how much energy and time (computation) by a human brain it takes to emit or ingest each program because a human working 40 hour weeks from 18-65 will have 100,000 hours of working time, which at a typing speed of 250 characters per minute and a reading rate of 1500 characters per minute is a total career budget of like a billion characters emitted and 10 billion characters ingested.
For emitting programs, if we assume the program is already fully formed in my brain and I'm just transcribing it, then we would like an accurate physical model of my hands moving over a QWERTY keyboard that can tell us how many joules and milliseconds I will use for each keystroke to type the given sequence of symbols. Ideally we would have a complete model or simulation of my brain so we can measure how many neurons need to fire for each keystroke as well. We don't have that, so we could measure my average milliseconds per character and multiply it by the total count of characters, but a language that is just made up of one symbol function names is probably harder to type than one made out of English words and there's also all the typing mistakes I will make. This is what the simple compression algorithm (gzip) is an attempt to normalize for, that a more verbose but more predictable language is as fast to write and read than an overly terse one. gziping is an imitation of a complex model.
Jokes aside, human doesn't read each (uncompressed) byte anyway. The number of tokens would have been much better than the number of bytes, but even this is unclear because a single token can have multiple perceived words (e.g. someLongEnoughIdentifier) and multiple tokens can even be perceived as a single word for some cases (e.g. C/C++ `#define` is technically two tokens long, but no human would perceive it as such). I would welcome a more realistic estimate than the gzipped size, but I'm confident that it won't be the number of uncompressed bytes.
Why would uncompressed bytes be better? Using a good compression algorithm better approximates the statistical entropy of the code which is at least correlated with e.g., Kolmogorov complexity.
I was just popping in to make a similar comment. Fortran is probably doing the heavy lifting for most of the entries here, you might as well show how much your language-specific overhead is by including it.
Great post: Id be interested to see the difference between different C compilers, and to see what the difference is if the C code is compiled with a C++ compilers, if possible.
matmul is old and useful -- there's a lot of hardware on a chip that makes it run much faster (prefetch, vectorization, instruction parallelism) and some of these languages have optimizations to expose those things automatically.
Those will not simply give you close to optimal performance (>~90% of peak), even, say, POWER10's 4x4 matmul instruction. You need a structure matching the micro-architecture -- the cache and register structure [1]. That's not a triply-nested loop. Any remotely decent compiler will unroll and vectorize the micro-kernel appropriately, but you may still have to resort to assembler-level prefetch fiddling for the last 10s of percent performance (specifically on avx512). Compilers may recognize the loop structure and replace it with an optimal-ish implementation. I've an idea that FORTRAN H extended did that, but Polly does it in clang.
And some languages that perform array bounds check on each array deindex operation, which basically stops some of the above even if the optimizer can do all of it. Crystal is an example of that, where it would be possible and quite straightforward to write a specialized matrix class instead of the very generic dynamic array implementation.
Another alternative here is to plot the reciprocal (Op / Sec) which has the benefit of both the more natural "Bigger is Better" and easier comparison of the fastest runtimes.
I believe because the C# version has been written using rectangular arrays. This requires every array access to use a multiplication. The Java version uses array-of-arrays and hoisting the inner array out before accessing it in the inner loop.
C# also has arrays-of-arrays, and could (should) be written in the same manner.
I've just done this and it has been merged. The benchmarks table and image haven't been updated yet. But this should bring the C# result to ~2s instead of 4.67s
Unfortunately it doesn't. The newest and hottest way to do this is to either use bespoke matmul from System.Numerics.Tensors or at least using Vector<T> for SIMD (which is trivial and not "the last mile" optimization it often seems to be).
How come not all benchmarks appear on all languages? For example, Zig's bar appears to be lower than C's by virtue of sudoku and bedconv being missing.
There appears to be roughly the same code structure, ported to every language, while for some languages, arbitrary optimizations are introduced (such as using `array` instead of `list` in Python).
But nobody working in Python uses matrix multiplication code written in Python. They use NumPy, which is a de facto standard library for people working in the relevant fields. It's as much part of "Python" as list comprehensions.
Without taking such real-world conventions into account, such comparisons say essentially nothing about the languages involved (and their all-important ecosystems).
Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries. Including external libraries is pointless since any language can call any library, ultimately, so at best you'd be comparing the FFI overhead.
So this shouldn't be taken as "how fast does a real-world Python program do at matrix multiplication", since of course no one writes real-world programs doing matrix multiplication in pure Python. But it can show the relative speed of pure Python at purely computational tasks.
Everyone uses pure python for purely computational tasks. numpy or pytorch has far too few operation to even count as all the computational task. e.g. most of the operations of pandas is written in pure python, and at times I found using specialised libraries could give 10x improvement but with blow to developer experience compared to python.
No one said they don't. Just there still are and always will be parts which are computationally heavy but written in python. e.g. pandas map is 300x slower than numpy map[1] for the same operation. But not everything could be vectorized and even if some complex function could be vectorized it will involve multiple passes over memory for the same array.
Every time you use a for loop, addition or multiplication operators, array indexing, assingment to numerical variable in Python, you are using exactly what is benchmarked in matmul.
> Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries.
If this were the only concern, it should be valid to create a blob of binary and call that function for the optimal performance. (Python's ctypes makes this very easy, for example.) So you want an idiomatic solution instead, and Numpy for matrix computation is considered idiomatic in Python, even more than the pure Python code.
No, creating a binary blob is not a realistic option, and it's still not a Python/Java/C++ whatever solution. It would be handwritten assembly that you use the FFI facilities to call into. So, it's not fair game in any honest benchmark.
And even if using a C library is idiomatic Python, it still has no place in a language benchmark. It's a C library, not a Python implementation.
> No, creating a binary blob is not a realistic option, and it's still not a Python/Java/C++ whatever solution.
Once again, "realistic" is subjective and I would say no "realistic" user will try to multiply arbitrarily-sized matrices in pure Python. (I can see small enough matrices, like 3x3 or 4x4, might be different.) And...
> So, it's not fair game in any honest benchmark. And even if using a C library is idiomatic Python, it still has no place in a language benchmark. It's a C library, not a Python implementation.
...you have correctly figured out that it's unfair for anyone using your (vague) definition of "realistic". It's also true that your proposal is also unfair for anyone using my definition of "idiomatic" however. I can try to defend my definition with quantative arguments, but I don't feel like doing so. In fact I tend to ignore most "language benchmarks" because it is virtually impossible to make them reasonably fair. This one is no exception.
Best "language benchmarks" tend to be more like language showcases with useful commentaries, there will be no single winner but you will get a good sense of pros and cons of each language-implementation-strategy combination. They are generally not advertised as "benchmarks", of course.
It demonstrates that Python needs libraries like NumPy. Few problems are more heavily optimized than matrix multiplication in practice, so comparing matrix multiplication benchmarks across languages with NumPy is not representative of real-world performance for most programming use cases.
It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).
> It demonstrates that Python needs libraries like NumPy.
People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.
> It also means that adding performance to an existing Python program requires dropping into a different language
As stated above, this applies to all languages. BLAS routines used for serious numerical work are hand-vectorized Assembly fine-tuned for each processor architecture, written by a few hyper-experts who do nothing else.
Nobody who needs performant matrix multiplication from C thinks "hey, let me just write two nested loops".
This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.
If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.
It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.
If you need a really fast compiled inner loop then you can often implement it in Python using Numba. Using that you can easily implement something like sparse matrix multiplication in Python.
Of course you do. Every special-case multiplication algorithm you might need already has an optimized implementation that you can just `pip install`, and move on with what you're actually working on.
The whole scientific computing world runs on Python. Straightforward numerics code using NumPy tends to murder C/C++ code in regard to performance, unless that code is written by people who make a living hand-optimizing computational routines.
> This is really overstating how hard it is to compete with matrix multiply libraries.
I'll file this under "talk is cheap". :) I tried it last year and got within 50% of BLAS. Getting above that is tons of work. Which you have to repeat for every processor model, NUMA, and every combination of matrix type (long thin, short wide, etc).
But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.
I don't believe you. OpenBLAS is multithreaded but the code you posted is single-threaded. The inner kernel isn't very well optimized either. So, no way.
I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.
As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...
On my machine single-threaded OpenBLAS (called via NumPy) multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds when compiled with clang++. And yes I used -O3, -march=native, and all that jazz. Btw, your code crashes g++ which doesn't necessarily mean that it is incorret, but it indicates that the code may be difficult for the compiler to optimize. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added. You can look at my code for how to write a decent inner kernel. Writing it in pure C without intrinsics and hoping that the compiler will optimize it definitely will not work.
It also is not true that "Parallelism would be easy to add". Unless your algorithm is designed from the start to exploit multi-threading, attempting to bolt it on later will not yield good performance.
-O2 did improve performance significantly, but it's still 0.7 s for NumPy and 5.1 seconds for your code on 4096x4096 matrices. Either you're using a slow version of BLAS or you are benchmarking with matrices that are comparatively tiny (384x1536 is nothing).
BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...
You're probably now comparing parallel code to single threaded code.
I dunno man. My claim was that for specific cases with unique properties, it's not hard to beat BLAS, without getting too exotic with your code. BLAS doesn't have routines for multiplies with non-contiguous data, various patterns of sparsity, mixed precision inputs/outputs, etc. The example I gave is for a specific case close-ish to the case I cared about.
You're changing it to a very different case, presumably one that you cared about, although 4096x4096 is oddly square and a very clean power of 2... I said right at the beginning of this long digression that what is hard about reproducing BLAS is its generality.
When I run your benchmark with matrices larger than 1024x1024 it errors out in the verification step. Since your implementation isn't even correct I think my original point about OpenBLAS being extremely difficult to replicate still stands.
You need to have enough experience to be able to be a little careful though. This is generally true for most languages, loop unrolling works even better in Python for example, but many Python programmers aren't even aware of this possibility.
> People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.
That's a bold claim. Do you have any benchmarks to back it up? Even if it was as fast as OpenBLAS on your machine when you benchmarked it that doesn't mean it will be as fast on others' machines.
Can you explain how to build your project and how to run the benchmarks? Cause I just spent a few hours disproving another poster's claim of getting OpenBLAS-like performance and I won't want to waste more time (https://news.ycombinator.com/item?id=38867009). While I don't know Nim very well, I dare claim that you don't get anywhere near OpenBLAS performance.
First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range
Reproduction:
- Assuming x86 and preferably Linux.
- Install Nim
- Install a C compiler with OpenMP support (not the default MacOS Clang)
- Install git
The repo submodules MKLDNN (now Intel oneDNN) to bench vs Intel JIT Compiler
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(77, 8) Warning: use `std/os` instead; ospaths is deprecated [Deprecated]
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(101, 8) template/generic instantiation of `bench` from here
/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(106, 21) template/generic instantiation of `gemm_nn_fallback` from here
/home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm.nim(85, 34) template/generic instantiation of `newBlasBuffer` from here
/home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm_data_structure.nim(30, 6) Error: signature for '=destroy' must be proc[T: object](x: var T) or proc[T: object](x: T)
Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.
It was from an older implementation that wasn't compatible with Nim v2. I've commented it out.
If you pull again it should work.
> Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.
It defaults to 1920x1920 * 1920x1920. Note, if you activate the benchmarks versus PyTorch Glow, in the past it didn't support non-multiple of 16 or something, not sure today.
Ok, I will benchmark this more when I have time. My gut feeling is that it is impossible to reach OpenBLAS-like performance without carefully tuned kernels and without explicit SIMD code. Clearly, it's not impossible to be as fast as OpenBLAS otherwise OpenBLAS itself woudln't be that fast, but it is very difficult and takes a lot of work. There is a reason much of OpenBLAS is implemented in assembly and not C.
Anything fast in Python needs to be written in another language, and if your day job isn't multiplying matrices, then you're probably SoL.
Assume the author knows about BLAS, and that the point is to benchmark the language not the FFI. Most people don't actually spend their time bashing vectors together.
You don't need any C knowledge to use numpy. In fact, its conceptual similarity with Matlab is possibly the single most important reason for its popularity. Many other problems do need specialized treatments that would indeed require other languages, but numpy is not a good counterexample.
You're missing the point. The point is that, for any application, Python needs an underlying C library to be fast. So if you need to solve problems where no such library exists, Python is slow.
In other words, Python IS slow, but it can call fast code written in other languages.
That's true but irrelevant here, especially given the original claim:
> adding performance to an existing Python program requires dropping into a different language
...is demonstrably false for a significant class of programs that can be rewritten into the array paradigm. The benchmark should have picked other numerical problem to avoid this issue. The Computer Language Benchmarks Game, for example, uses the `n-body` problem for this purpose.
> It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).
It's actually not that bad. I think it's part of the reason Python became so popular, it's fairly easy to write C code and expose it via python.
> It demonstrates that Python needs libraries like NumPy.
You need libraries to do _anything_ in Python. It's interpreted, so literally any call you make in Python will eventually make it back to something written in a compiled language (like a call to NumPy commands).
It is supposed to show the performance of a language when you have to implement a new algorithm in the language. This may happen if you can't find the algorithm in existing libraries. If you don't like matmul, ignore it and focus on nqueen, sudoku and bedcov.
So you'd be measuring the speed of loops over out calls to c or Fortran (nag library) or their vectorisation and only testing the cost of data representation changes
I think this shows the value of programmer productivity over performance at all costs. Python is one of the most popular languages despite having performance issues for complex algorithms. Users value clarity and ease of expression over performance. That's why Python is primarily used a glue code in these complex tasks.
Reliably benchmarking optimizing VMs like Java and JavaScript is a research problem.
See “Virtual machine warmup blows hot and cold”. The authors run a variety of benchmarks (with a pristine methodology) and find 43% of them provide “bad inconsistent” results.
2x slower than C while not optimal at all is not something I would call 'Slow' in this context. Still pretty strong considering how much more friendly it is to write code on for the average user of the languages in the context discussed here
First, the graph is misleading, stacking times with languages that have half the implementation, they appear faster, until you dig in. I'd suggest producing an alternate graph that shows only the implemented puzzles in every language, or make a unique graph for every language:puzzle.
Second, the examples are taken from rosetta code and are not necessarily what would be the best implementation, or even close to the best implementation, for benchmarking purposes.
Finally, those examples should be reproduced across various hardware platforms, I'm on arm64 Darwin myself, but you might find different results on Intel platforms due to the various compiler optimizations available based on the hardware.
More benchmarks would be interesting to see, such as actual real world operations, e.g. opening a file, reading it, parsing json, opening a socket server, etc.