Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Benchmarking 20 programming languages on N-queens and matrix multiplication (github.com/attractivechaos)
163 points by attractivechaos on Jan 3, 2024 | hide | past | favorite | 186 comments


I love a good benchmark, thanks for putting this together! However, I have a bit of feedback.

First, the graph is misleading, stacking times with languages that have half the implementation, they appear faster, until you dig in. I'd suggest producing an alternate graph that shows only the implemented puzzles in every language, or make a unique graph for every language:puzzle.

Second, the examples are taken from rosetta code and are not necessarily what would be the best implementation, or even close to the best implementation, for benchmarking purposes.

Finally, those examples should be reproduced across various hardware platforms, I'm on arm64 Darwin myself, but you might find different results on Intel platforms due to the various compiler optimizations available based on the hardware.

More benchmarks would be interesting to see, such as actual real world operations, e.g. opening a file, reading it, parsing json, opening a socket server, etc.


Also it may or may not be relevant to not exclude the warm up time for JIT languages. If it is for a calculation you will do continuously, including the warm up time may bias the results materially (I know for .net it can be significant).


Clarification: the examples are not taken from Rosetta code. Only the n-queens algorithm is inspired by (but still different from) an Rosetta code implementation; otherwise this benchmark has nothing to do with Rosetta code.


There is one for data processing here: https://github.com/jinyus/related_post_gen


> It is obvious that c[i], b[k] and a[i][k] can be moved out of the inner loop to reduce the frequency of matrix access. [...] However, most other languages cannot optimize this nested loop. If we manually move a[i][k] to the loop above it, we can often improve their performance.

This is only true when three matrices are independent of each other, and also why C has a `restrict` qualifier that enables this assumption. The benchmark itself has no such assumption because all three variables are defined as `double **`, and it can be verified by assembly outputs. Clang's excellent performance in matmul is probably much more due to autovectorization.


In some cases there are other reasons for hoiting the dereference too. For example, Crystal will check if the array access is out of bounds and by hoisting variables that will be done a lot less seldom, which can have huge effects for code that does a lot of that, like matmul.


Doesn’t that get mostly optimised away by the cpu branch prediction?


The branch does, but you still have the comparison itself, and also the branch instruction to decide and skip (although thanks to branch prediction, your pipeline doesn’t get flushed).

If you have enough idle execution units, you might not see a difference in wall clock time. But with many algorithms you can put those units to good use.


The branch does, but not optimizations that could be done, like vectorization.


Doesn’t include Prolog which has decent constraint solver answers for n-queens and sudoku, which are pretty fast but I don’t know how they would compare in benchmarks:

https://www.metalevel.at/queens/

https://www.metalevel.at/sudoku/


Make a pull request.


Neat!

The Julia matmul implementation has its rows and columns flipped though - unlike C, Julia uses row-major matrices. This has large implications for speed.

Also, the code may be much faster if you enable SIMD in the function, which is disabled in the code because a) the code unnecessarily checks bounds at every index instead of at the top of the function, and b) float SIMD is opt-in since SIMD changes the rounding


Just come here to say this has been fixed in PR #2. The figure has been updated accordingly.


Yes. For this type of test, Julia should be nearly the exact same speed as C unless there is a mistake in the code. Which, to be fair, can happen very easily, especially with implicit typing.


Aren’t JIT languages at a disadvantage since they are benchmarked through the CLI rather than using a benchmarking library to allow JIT to warmup?


For one-time problem runs, the JIT languages in practice will not be given time to warm up. All that matters for a user is how fast the application is in practice. It's not about "making it fair" for languages, it's about measuring how fast they go from nothing to results.

It doesn't make sense to allow "warmup" time for them unless your expected application is a server which for most of the time will be running "warm" (and even then with scalable containers that assumption may not even be true in some cases). For servers, however, what matters is mostly how fast its HTTP library is and how good the async IO is... check Techempower benchmarks for that: https://www.techempower.com/benchmarks/#hw=ph&test=fortune&s...


From nothing to result rarely happens in real live. I hardly see someone to start/stop a program per unit task (like piping commands).


Code compilation is a common example. In many languages, when you do a clean build of a large project, you run a large number of short-lived processes.

With command line tools, the dominant design philosophy is that a program completes one task and exits. The tasks may be large or small, and the same tools are often expected to handle both.


Some language implementations have run time startup costs (bytecode verifier), not-required at run time by other language implementations.

https://www.oracle.com/java/technologies/security-in-java.ht...

Why should that run time startup cost be ignored?


Yes, but the author claims the longest JIT warmup is 0.3 seconds, so it's not an important issue in these benchmarks that take several seconds.


I strongly suspect that the author may have confused the JIT warmup (hard to measure, as you need to ensure that the performance figure have reached the stable point) from the startup overhead (easy to measure).


...and many jitted vms do not even reach a stable point at all, there was a big paper on this a couple of years ago.



Yes, this is the one. Time flies! And VM jits didn't change all that much.


0.3s is significant for a task that only takes 1.14s


We don't know that OpenJDK was "Some JIT-based language runtimes take up to ~0.3 second to compile and warm-up"

Maybe that was PyPy with program times >5 seconds.


What's CLI?


Presumably command line interface/terminal. I can type `dotnet run` in the terminal and provide some options to runa .NET program for example.


I wondered why rust is so far behind C/nim/zig, they should have similar behaviour.

The difference is mostly in matmul.

I see how C, for an n x n matrix, does 2 allocations, while rust does n+1. C's matrix rows are right next to each other, rusts are probably all over the place. Didn't look at nim or zig.

Maybe a slice of slice of double would perform better than a vec of vec of double? Then again, an argument can be made that rust pushes you to vec, so this impl is more honest for how a beginner would do it. Otoh, C has the optimization so why doesn't rust?


The rust code is very unidiomatic, not only because of the Vec of Vecs which I’d say, even if it’s the obvious naive approach, no one experienced wouldn’t choose over a flat slice, the implementation itself is very naive and unidiomatic.


Also it's using checked indexing, which apart from being not idiomatic, is also going to slow things down. A fairer comparison would be to use the unchecked indexing variants.


Yeah, it's a similar problem in C# - ported code 1:1 is not representative since it now can't do out of bounds accesses and is therefore safe but comes at a cost of bounds checks which can be manually elided.


Unchecked indexing is unidiomatic. A more fair comparison world be using iterators if possible but otherwise you have to live with that.


FYI: rust has been updated to avoid unnecessary bound check in PR #4. It now matches the C version in performance.


Zig had a similar issue which I made an MR to fix. I didn't actually notice Rust had the same issue but that's probably just my forgotten knowledge with how the vec! macros expand when nested.


rust is equivalent to C when using static allocation

i sent a pr to fix it


Doesn't your PR monomorphize the function every time N is changed ? I realize it's simpler since it keeps the structure, simplifies allocation of arrays, and elides bound check. But it explodes generated code size and matrix size can't be changed at runtime, which doesn't really match C.

Edit: I have tried making an iterator-based version to elide bound checks, but had to resort to unsafe, and it's barely 50% faster than the original rust version (not as fast as C): https://gist.github.com/anisse/6b580628206293ef242faa7db6219...

Edit 2: updated, and my rust iterator version now ~equivalent to C with no unsafe.

Edit 3: too late, the repo has been updated with an other iterator-based version that is just as fast.


C, Nim and Zig all benefit from static allocation and it's available in rust, so seems fair?

also, `const N` is used in rust's nqueen test, so figured it would be fine.

someone posted an `.iter().zip()` solution which benchmarks only a little slower (+10ms) than static allocation:

https://github.com/attractivechaos/plb2/pull/4#issuecomment-...


I only looked at the C version, and it didn't use a generic over the allocation size, although yes, technically it was fixed at compile time.

I saw the update to the PR, but I still prefer my collect()-based version :-)


Very interesting. My two cents: because the stacked bars for php, ruby, perl and py:cpy, it's impossible to compare the other languages in that first chart. All the chart says is that the benchmark is much slower in those 4 languages relative to "all other languages we tested".

It would be nice to see those other languages in a chart that doesn't include the slower four. Alternatively, you could also show those slower four with "broken" columns like this https://peltiertech.com/broken-y-axis-in-excel-chart/


The second chart is basically the first chart with the slow 4 removed (well, not removed but at such a scale that they're irrelevant).


Ah, duh, I see that now. That's what I get for commenting too quickly ;-)

The choice of stacked bar charts is also strange because the languages are not all comparable

> Every language has nqueen and matmul implementations. Some languages do not have sudoku or bedcov implementations. In addition, I implemented most algorithms in plb2 and adapted a few contributed matmul and sudoku implementations in plb. As I am mostly a C programmer, implementations in other languages may be suboptimal and there are no implementations in functional languages. Pull requests are welcomed!

So the point still stands that many charts doing one thing each would be better than fewer charts doing many things each


> many charts doing one thing each

Would not be a summary.


It would if you have some summary charts at the top that don't include all of the data, but just the key takeaways, with other charts providing more detail for discussion elsewhere


My apologies. One chart for nqueens and one chart for matmul would be fine.


> ” Timing on Apple M1 Macbook Pro”

Given that its become increasingly more common for CPUs to have both Performance & Efficency cores … how do benchmarks ensure they are only being run on the P-cores?


Sibling answered on macOS. Their scheduler prefers by default to run everything on performance and reserves efficiency for background OS tasks by default. On Linux you could set the affinity when running by pinning the cpu affinity either upfront through task set or programmatically at runtime (eg I have 32 cores and 0-15 are performance while the rest are efficiency)


It actually takes a bit of effort to run on the efficiency cores.

I believe Game Mode will push the processes to e cores to keep a consistent game play without thermal throttling.


Having checked the C implementations of the algorithms, I'm a little skeptical because the implementations aren't optimized. Tiling matrix multiplication exploiting SIMD could easily improve performance 100-fold or more. At those speeds the cost of memory transfer usually dominate so the languages that give you the most fine-grained control over how data is laid out in memory tend to win. And it may not be the same languages that are now winning in your benchmark.


I think a benchmark of "naive" implementations is interesting too, because it shows you how fast your code is usually going to run, not how fast it theoretically could run at its best.


> shows you how fast your code is usually going to run

Why would you think that?

Maybe the "naive" implementation just shows how fast the easily removed hotspot in your code is going to run.

"Swap the order of two statements and see the Java code slow down … Swap globals for local variables in a function and see the Python code speed up. Swap language implementations and see the C code speed up."

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


See also https://benchmarksgame-team.pages.debian.net/benchmarksgame/... for more algorithms and code hand tuned as well as plain.


Thanks, and they know: "Plb2 complements the Computer Language Benchmark Games."


These are not the best benchmarks, but Python is indeed as slow as Perl, which I find insane considering that Python has 100-1000 more people working on the interpreter and performance has been a big emphasis the last few years.


Well, for a long time speed was not a priority at all for the team developing cpython. In fact, Python 3 was still a bit slower than Python 2 until a few versions ago.

Recently there have been some decent improvements to CPython's speed, but there are real upper limits to how fast you can make an interpreter. CPython will need JIT compilation if it is ever to break out of its current speed bracket.

JavaScript has had a feature complete JIT reference implementation since 2008 which is a major part of the reason JS applications exploded so much in the 2010s.


The benchmark is not representative of how Python is actually used in practice. It ignores the existence of libraries (explicitly). For example, If numpy, pytorch were used for matmul, the results would be completely different.


Perl people would use PDL or Math::GSL. I look into these benchmarks because I'm interested in VMs, and yeah the Python VM can be quite slow if you are not calling C code.


PHP results: I was stupid enough to write some scientific code in PHP once so know how slow it can be - mostly around array access and manipulation. But if your going to try, use the HHVM interpreter. It's much faster and is a drop in replacement for the PHP interpreter. Hack (https://hacklang.org/) uses that under the hood by default.


HHVM is not a drop-in replacement for a PHP interpreter. The semantics of Hack and PHP have diverged, typically in the direction of eliminating dynamic behavior from Hack that existed in PHP (examples: string -> function coercion, the PHP dual vector-hash-table array type, and non-throwing out-of-bounds array accesses are all gone from Hack). The semantics changes both simplify static analysis and make it easier to JIT fast code.


I put the pure math codes in C extension of PHP. Building C extension for PHP is easier than most of the other high level languages. And then things get blazingly fast.


doesn't look that easy... zend_parse_parameters? pre-baked configure + make scripts?

check out how a modern language deals with this stuff https://bun.sh/docs/api/ffi#usage


Checkout what? PHP also has support for FFI. Or am I missing something?


The state of PHP 8.3 and especially 8.4 is a lot better than HHVM. They have diverged quite a lot to a point that it's no longer a drop-in replacement.

PHP added JIT in 8.0, and these math-heavy tasks can take advantage of it. It's not trivial to fine tine JIT configuration though.

In PHP 8.4 (scheduled Nov 2024), there is a major upgrade to JIT as well.


Also since PHP was built for web requests it builds The cache on the First run. would be interesting to see All languages of the benchmark with a second run


C# code should be using an official package for GEMM which is System.Numerics.Tensors, it is pure C# but runs at max hw efficiency (it is also idiomatic). Or at least use Vector<T> instead of scalar operations. I’d expect this applies to most other popular languages here too.


System.Numerics.Tensors is disqualified because it uses a different algorithm. If you have a faster matmul implementation in Vector<T>, a PR will be much appreciated. Thank you in advance.


Nice to see Go performing well. Would be interesting to see the results with PGO enabled.


Why not run C# as AOT? It’s a one line change.


Do people usually run C# code AOT compiled?


If it's for something stupid like benchmark comparison - yes.


I don't know about people but I do pretty much every where I can. I make such a difference that AWS is heavily investing on lambdas that uses AOT[0].

[0] - https://docs.aws.amazon.com/lambda/latest/dg/dotnet-native-a...


It’s gaining popularity for CLI and serverless, some people use it to replace C++ for writing native libraries and plugins as well.


Yes, everywhere I possible can. Particularly for targeting CLI programs and Linux daemons.


Or use BenchmarkDotNet which, among other things to get an accurate benchmark, does JIT warmup outside of measurement.

( https://github.com/dotnet/BenchmarkDotNet ).


A curious thing about Swift: after https://github.com/attractivechaos/plb2/pull/23, the matrix multiplication example is comparable to C and Rust. However, I don’t see a way to idiomatically optimise the sudoku example, whose main overhead is allocating several arrays each time solve() is called. Apparently, in Swift there is no such thing as static array allocation. That’s very unfortunate.


Figure updated. Now swift is pretty fast on nqueen+matmul but it has the longest green bar (i.e. longest running time for sudoku). This looks ... interesting.

There are only 20 thousand array allocations in total, not a lot. Javascript also has these many arrays allocated/deallocated but it is 4 times as fast.


This is an area where "solvers" are heavily under utilized. For the absolute fastest solution, bespoke implementations are almost certainly required.

However, translating Sudoku and N-Queens into a similar problem that you can feed into a solver can get you a long way. Even better, you can move that solver into whatever language gives you the best optimizations that you can work. Even better, there are almost certainly optimizations in common solvers that you don't want to deal with implementing on your on.


A benchmark I would like to see is a comparison of languages in terms of how fast they are to beginners vs experts. I've been thinking about how to design it to get that result. What I think would work is taking something like these simple puzzles and have maybe a hundred people write up different solutions, so we can compare them using the programmer's level of expertise as one of the factors.


How would you decide who were beginners and who were experts?

Here are naive line-by-line transliterations from an original C program:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

Here exhaustively-optimised + multicore + vector-instruction programs are included:

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


Surprised how well JS (node and others) seem to come out when I've had firsthand experience of switching from JS to Go to speed up an algorithm type question and had the Go version crank through a bruteforce much much faster.

Maybe I made an accidental optimization in my language translation, or maybe there are some operations that are much slower in JS and these benchmarks didn't hit any of them.


Go's AOT compiler is actually not that sophiscated (comparable to -O1 in most C compilers, possibly even worse). Was your program running only for a fraction of a second? Then the JIT may haven't fully warmed up and even a basic AOT compiler has a better chance to win.


Given the times we're in, it would be interesting to test some libraries too, like numpy and pytorch


One thing I always wonder with these benchmarks is: do we want to know the best possible runtime for a language or the runtime of idiomatic/average implementations? Even with C there are loads of compiler flags (and compilers) to choose from.


The data has nothing definite to say about language implementations that were not measured. Let's focus on what we might say about those that we're measured.


Nowadays I would add lines of code and syntax complexity to the benchmark. It is apples and oranges comparison but you could prefer clarity than performance depending on the circumstances. Also, it is relatively easy to do with today language tools.



I wonder if JRuby or Truffle Ruby would have a significant impact on this one for Ruby?


Please make and publish those measurements!


The reality is that it would be very hard to find Python code that does not use NumPy (or some tensor lib), for matmul.

Including time to JIT compile is questionable, why not also include time to compile the compiled languages?


"Nonetheless, because most benchmarks run for several seconds, including the startup time does not greatly affect the results."

https://github.com/attractivechaos/plb2#startup-time


Yes, I saw that and still consider the methodology questionable. A fairer approach might be: time from the cli, including compilation, for compiled languages. (Or warmup the jit compiled code.)


Those are unfounded conjectures. When you show the results for those programs have been greatly affected

As-in "Wtf kind of benchmark counts the jvm startup time?"

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


You should add a chart of the number of gzip'd bytes of source code.


IMO, uncompressed bytes is a better representation, because it can be used to compare relative expressive power for the particular problem. I'd bet Python cleans house here, but the write-only languages are a wild card.


"Expressive power" is a very subjective term, and uncompressed size is a bad proxy as it includes too many variables specific to coding conventions. Compressed size with a stupid enough algorithm (here gzip) is meant to reduce these variables. The true Kolmogorov complexity in comparison can't be computed, and too smart algorithms can start to infer enough about the language itself.


Kolmogorov complexity is absolutely the wrong metric. It doesn't account for big-O, timing, and many other production requirements.

You'll never have a perfect metric here, but human readable size of code base is well justified. Do you write minified javascript?


What I'm asking about is how much energy and time (computation) by a human brain it takes to emit or ingest each program because a human working 40 hour weeks from 18-65 will have 100,000 hours of working time, which at a typing speed of 250 characters per minute and a reading rate of 1500 characters per minute is a total career budget of like a billion characters emitted and 10 billion characters ingested.

For emitting programs, if we assume the program is already fully formed in my brain and I'm just transcribing it, then we would like an accurate physical model of my hands moving over a QWERTY keyboard that can tell us how many joules and milliseconds I will use for each keystroke to type the given sequence of symbols. Ideally we would have a complete model or simulation of my brain so we can measure how many neurons need to fire for each keystroke as well. We don't have that, so we could measure my average milliseconds per character and multiply it by the total count of characters, but a language that is just made up of one symbol function names is probably harder to type than one made out of English words and there's also all the typing mistakes I will make. This is what the simple compression algorithm (gzip) is an attempt to normalize for, that a more verbose but more predictable language is as fast to write and read than an overly terse one. gziping is an imitation of a complex model.



> Do you write minified javascript?

Yes [1] [2] [3]?

[1] https://js1024.fun/demos/2020/46/readme

[2] https://js1024.fun/demos/2022/18/readme

[3] https://github.com/lifthrasiir/roadroller/blob/442caa4/index...

Jokes aside, human doesn't read each (uncompressed) byte anyway. The number of tokens would have been much better than the number of bytes, but even this is unclear because a single token can have multiple perceived words (e.g. someLongEnoughIdentifier) and multiple tokens can even be perceived as a single word for some cases (e.g. C/C++ `#define` is technically two tokens long, but no human would perceive it as such). I would welcome a more realistic estimate than the gzipped size, but I'm confident that it won't be the number of uncompressed bytes.


Why would uncompressed bytes be better? Using a good compression algorithm better approximates the statistical entropy of the code which is at least correlated with e.g., Kolmogorov complexity.


Because humans read and write the uncompressed code. Gzip will hide problems like copy paste.


And hide differences due to label-length personal-preferences.


because it often sits uncompressed on the drive


N-queens and matrix benchmark without Fortran and Wolfram Mathematica?


I was just popping in to make a similar comment. Fortran is probably doing the heavy lifting for most of the entries here, you might as well show how much your language-specific overhead is by including it.


No, it isn't. OP is benchmarking language-native implementations of all algos.


Fortran the king of matrices :-)


Great post: Id be interested to see the difference between different C compilers, and to see what the difference is if the C code is compiled with a C++ compilers, if possible.


Glad to see dart in there, it's normally overlooked. Not a bad result. I wonder what speed it would be if it had been AOT compiled instead of JIT.



And LuaJit surprised me quite a bit!


Odd that nqueens and sudoku have a high correlation but matmul seems to be largely doing its own thing.

nqueen vs. sudoku: 0.531

matmul vs. sudoku: 0.362

matmul vs. nqueen: 0.127


matmul is old and useful -- there's a lot of hardware on a chip that makes it run much faster (prefetch, vectorization, instruction parallelism) and some of these languages have optimizations to expose those things automatically.


Those will not simply give you close to optimal performance (>~90% of peak), even, say, POWER10's 4x4 matmul instruction. You need a structure matching the micro-architecture -- the cache and register structure [1]. That's not a triply-nested loop. Any remotely decent compiler will unroll and vectorize the micro-kernel appropriately, but you may still have to resort to assembler-level prefetch fiddling for the last 10s of percent performance (specifically on avx512). Compilers may recognize the loop structure and replace it with an optimal-ish implementation. I've an idea that FORTRAN H extended did that, but Polly does it in clang.

1. https://www.cs.utexas.edu/users/flame/pubs/blis1_toms_rev3.p...


And some languages that perform array bounds check on each array deindex operation, which basically stops some of the above even if the optimizer can do all of it. Crystal is an example of that, where it would be possible and quite straightforward to write a specialized matrix class instead of the very generic dynamic array implementation.


It has .NET generic.

Would .NET with F# make big difference here?

I'm little surprised Java beat .NET. Is that typical these days?


> I'm little surprised Java beat .NET. Is that typical these days?

Not at all, this is just a bad benchmark, it measure from the CLI run, with no specific flags which is just terrible for cold starts.


+1 curious about F#


This is great, thanks for posting this.

OP - are you interested in pull requests adding support for other languages?


Of course! Please implement at least nqueen and matmul as they have been implemented in every language in the benchmark.


For the love of god, log graph tiny values with large values please. :)


Another alternative here is to plot the reciprocal (Op / Sec) which has the benefit of both the more natural "Bigger is Better" and easier comparison of the fastest runtimes.


"Bigger is Better" may be "more natural" for Op / Sec but secs is "more natural" for "comparison of the fastest runtimes".


Or just make two charts ;-)


Why is c# so slow for Matmul compared to java?


I believe because the C# version has been written using rectangular arrays. This requires every array access to use a multiplication. The Java version uses array-of-arrays and hoisting the inner array out before accessing it in the inner loop.

C# also has arrays-of-arrays, and could (should) be written in the same manner.


I've just done this and it has been merged. The benchmarks table and image haven't been updated yet. But this should bring the C# result to ~2s instead of 4.67s


Thanks for the explanation. Yes this make sense to me.


The implementation isn’t using any modern C# features?


Unfortunately it doesn't. The newest and hottest way to do this is to either use bespoke matmul from System.Numerics.Tensors or at least using Vector<T> for SIMD (which is trivial and not "the last mile" optimization it often seems to be).


Which modern features do you mean? Java and C# code looks similar to me.


How come not all benchmarks appear on all languages? For example, Zig's bar appears to be lower than C's by virtue of sudoku and bedconv being missing.


Someone needs to write sudoku and bedconv for Zig.


What is this supposed to demonstrate?

There appears to be roughly the same code structure, ported to every language, while for some languages, arbitrary optimizations are introduced (such as using `array` instead of `list` in Python).

But nobody working in Python uses matrix multiplication code written in Python. They use NumPy, which is a de facto standard library for people working in the relevant fields. It's as much part of "Python" as list comprehensions.

Without taking such real-world conventions into account, such comparisons say essentially nothing about the languages involved (and their all-important ecosystems).


Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries. Including external libraries is pointless since any language can call any library, ultimately, so at best you'd be comparing the FFI overhead.

So this shouldn't be taken as "how fast does a real-world Python program do at matrix multiplication", since of course no one writes real-world programs doing matrix multiplication in pure Python. But it can show the relative speed of pure Python at purely computational tasks.


> But it can show the relative speed of pure Python at purely computational tasks.

But that's irrelevant if nobody uses "pure Python" for computational tasks.

It's like asking "how well do these languages run on a Lisp machine from 1979?". It simply has no relevance to real-world considerations today.


Everyone uses pure python for purely computational tasks. numpy or pytorch has far too few operation to even count as all the computational task. e.g. most of the operations of pandas is written in pure python, and at times I found using specialised libraries could give 10x improvement but with blow to developer experience compared to python.


Pandas/Numpy uses a lot of C and Scipy a lot of Fortran


No one said they don't. Just there still are and always will be parts which are computationally heavy but written in python. e.g. pandas map is 300x slower than numpy map[1] for the same operation. But not everything could be vectorized and even if some complex function could be vectorized it will involve multiple passes over memory for the same array.

[1]: https://www.exxactcorp.com/blog/Deep-Learning/how-to-speed-u...


Every time you use a for loop, addition or multiplication operators, array indexing, assingment to numerical variable in Python, you are using exactly what is benchmarked in matmul.


Nobody uses python for numerical tasks because it is atrociously slow at those tasks. This benchmark reiterates that.


> Generally, the point of language benchmarks is to show how the languages compare at solving the same problem, without external libraries.

If this were the only concern, it should be valid to create a blob of binary and call that function for the optimal performance. (Python's ctypes makes this very easy, for example.) So you want an idiomatic solution instead, and Numpy for matrix computation is considered idiomatic in Python, even more than the pure Python code.


No, creating a binary blob is not a realistic option, and it's still not a Python/Java/C++ whatever solution. It would be handwritten assembly that you use the FFI facilities to call into. So, it's not fair game in any honest benchmark.

And even if using a C library is idiomatic Python, it still has no place in a language benchmark. It's a C library, not a Python implementation.


> No, creating a binary blob is not a realistic option, and it's still not a Python/Java/C++ whatever solution.

Once again, "realistic" is subjective and I would say no "realistic" user will try to multiply arbitrarily-sized matrices in pure Python. (I can see small enough matrices, like 3x3 or 4x4, might be different.) And...

> So, it's not fair game in any honest benchmark. And even if using a C library is idiomatic Python, it still has no place in a language benchmark. It's a C library, not a Python implementation.

...you have correctly figured out that it's unfair for anyone using your (vague) definition of "realistic". It's also true that your proposal is also unfair for anyone using my definition of "idiomatic" however. I can try to defend my definition with quantative arguments, but I don't feel like doing so. In fact I tend to ignore most "language benchmarks" because it is virtually impossible to make them reasonably fair. This one is no exception.

Best "language benchmarks" tend to be more like language showcases with useful commentaries, there will be no single winner but you will get a good sense of pros and cons of each language-implementation-strategy combination. They are generally not advertised as "benchmarks", of course.


How else would you find the difference between using that library from C and using that library from Python?

pidigits gcc #1

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...

pidigits Python3 #3

https://benchmarksgame-team.pages.debian.net/benchmarksgame/...


It demonstrates that Python needs libraries like NumPy. Few problems are more heavily optimized than matrix multiplication in practice, so comparing matrix multiplication benchmarks across languages with NumPy is not representative of real-world performance for most programming use cases.

It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).


> It demonstrates that Python needs libraries like NumPy.

People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.

> It also means that adding performance to an existing Python program requires dropping into a different language

As stated above, this applies to all languages. BLAS routines used for serious numerical work are hand-vectorized Assembly fine-tuned for each processor architecture, written by a few hyper-experts who do nothing else.

Nobody who needs performant matrix multiplication from C thinks "hey, let me just write two nested loops".


This is really overstating how hard it is to compete with matrix multiply libraries. The main reason those libraries are so big and have had so much work invested in them is their generality: they're reasonably fast for almost any kind of inputs.

If you have a specific problem with constraints you can exploit (e.g. known fixed dimensions, sparsity patterns, data layouts, type conversions, etc.), it's not hard at all to beat MKL, etc... if you are using a language like C++. If you are using python, you have no chance.

It isn't even necessarily that different from a few nested loops. Clang is pretty damn good at autovectorizing, you just have to be a little careful about how you write the code.


If you need a really fast compiled inner loop then you can often implement it in Python using Numba. Using that you can easily implement something like sparse matrix multiplication in Python.


> If you are using python, you have no chance.

Of course you do. Every special-case multiplication algorithm you might need already has an optimized implementation that you can just `pip install`, and move on with what you're actually working on.

The whole scientific computing world runs on Python. Straightforward numerics code using NumPy tends to murder C/C++ code in regard to performance, unless that code is written by people who make a living hand-optimizing computational routines.


> The whole scientific computing world runs on Python

If you ignore the majority of scientific code running on supercomputers doing most of science in C++ and Fortran.

Even in areas where python is used, the majority of the compute runs on C/C++/Fortran, with a little python as glue.

If you think numpy (written in c/c++) murders c/c++ code, you should learn about HPC, where really high performance happens. They don't use numpy.


> This is really overstating how hard it is to compete with matrix multiply libraries.

I'll file this under "talk is cheap". :) I tried it last year and got within 50% of BLAS. Getting above that is tons of work. Which you have to repeat for every processor model, NUMA, and every combination of matrix type (long thin, short wide, etc).


This gets to 90% of BLAS: https://github.com/dsharlet/array/blob/38f8ce332fc4e26af0832...

The less involved versions still get ~70%.

But this is also quite general. I’m claiming you can beat BLAS if you have some unique knowledge of the problem that you can exploit. For example, some kinds of sparsity can be implemented within the above example code yet still far outperform the more general sparsity supported by MKL and similar.


I don't believe you. OpenBLAS is multithreaded but the code you posted is single-threaded. The inner kernel isn't very well optimized either. So, no way.


I should have mentioned somewhere, I disabled threading for OpenBLAS, so it is comparing one thread to one thread. Parallelism would be easy to add, but I tend to want the thread parallelism outside code like this anyways.

As for the inner loop not being well optimized... the disassembly looks like the same basic thing as OpenBLAS. There's disassembly in the comments of that file to show what code it generates, I'd love to know what you think is lacking! The only difference between the one I linked and this is prefetching and outer loop ordering: https://github.com/dsharlet/array/blob/master/examples/linea...


So I actually tested your code: https://gist.github.com/bjourne/c2d0db48b2e50aaadf884e4450c6...

On my machine single-threaded OpenBLAS (called via NumPy) multiplies two single precision 4096x4096 matrices in 0.95 seconds. Your code takes over 30 seconds when compiled with clang++. And yes I used -O3, -march=native, and all that jazz. Btw, your code crashes g++ which doesn't necessarily mean that it is incorret, but it indicates that the code may be difficult for the compiler to optimize. For comparison, my own matrix multiplication code (https://github.com/bjourne/c-examples/blob/master/libraries/...) run in single-threaded mode takes 0.89 seconds. Which actually beats OpenBLAS, but OpenBLAS retakes the lead for larger arrays when multi-threading is added. You can look at my code for how to write a decent inner kernel. Writing it in pure C without intrinsics and hoping that the compiler will optimize it definitely will not work.

It also is not true that "Parallelism would be easy to add". Unless your algorithm is designed from the start to exploit multi-threading, attempting to bolt it on later will not yield good performance.


The makefile asks for -O2 with clang. I find that -O3 almost never helps in clang. (In gcc it does.)

Here's what I see:

   $ clang++ --version
   clang version 18.0.0

   $ time make bin/matrix
   mkdir -p bin
   clang++ -I../../include -I../ -o bin/matrix matrix.cpp  -O2 -march=native -ffast-math -fstrict-aliasing -fno-exceptions -DNDEBUG -DBLAS  -std=c++14 -Wall -lstdc++ -lm -lblas
   1.25user 0.29system 0:02.74elapsed 56%CPU (0avgtext+0avgdata 126996maxresident)k
   159608inputs+120outputs (961major+25661minor)pagefaults 0swaps

   $ bin/matrix
   ...
   reduce_tiles_z_order time: 3.86099 ms, 117.323 GFLOP/s
   blas time: 0.533486 ms, 849.103 GFLOP/s

   $ OMP_NUM_THREADS=1 bin/matrix
   ...
   reduce_tiles_z_order time: 3.89488 ms, 116.303 GFLOP/s
   blas time: 3.49714 ms, 129.53 GFLOP/s
My inner loop in perf: https://gist.github.com/dsharlet/5f51a632d92869d144fc3d6ed6b... BLAS inner loop in perf (a chunk of it, it is unrolled massively): https://gist.github.com/dsharlet/5b2184a285a798e0f0c6274dc42...

Despite being on a current-ish version of clang, I've been getting similar results from clang for years now.

Anyways, I'm not going to debate any further. It works for me :) If you want to keep writing code the way you have, go for it.


-O2 did improve performance significantly, but it's still 0.7 s for NumPy and 5.1 seconds for your code on 4096x4096 matrices. Either you're using a slow version of BLAS or you are benchmarking with matrices that are comparatively tiny (384x1536 is nothing).


BLAS is getting almost exactly 100% of the theoretical peak performance of my machine (CPU frequncy * 2 fmadd/cycle * 8 lanes * 2 ops/lane), it's not slow. I mean, just look at the profiler output...

You're probably now comparing parallel code to single threaded code.


No, multi-threaded OpenBLAS improves performance to 0.15s.


I dunno man. My claim was that for specific cases with unique properties, it's not hard to beat BLAS, without getting too exotic with your code. BLAS doesn't have routines for multiplies with non-contiguous data, various patterns of sparsity, mixed precision inputs/outputs, etc. The example I gave is for a specific case close-ish to the case I cared about.

You're changing it to a very different case, presumably one that you cared about, although 4096x4096 is oddly square and a very clean power of 2... I said right at the beginning of this long digression that what is hard about reproducing BLAS is its generality.


When I run your benchmark with matrices larger than 1024x1024 it errors out in the verification step. Since your implementation isn't even correct I think my original point about OpenBLAS being extremely difficult to replicate still stands.


You need to have enough experience to be able to be a little careful though. This is generally true for most languages, loop unrolling works even better in Python for example, but many Python programmers aren't even aware of this possibility.


> People use matrix multiplication libraries (often written in Assembly) from every language if they really care about performance. That's because such libraries incorporate 100 PhD theses' worth of tricks that no individual can hope to reinvent in the course of solving another problem. There is absolutely nothing special about Python in this context.

You don't have to use Assembly.

Case in point, this is as fast as OpenBLAS: https://github.com/mratsim/weave/tree/master/benchmarks/matm...


That's a bold claim. Do you have any benchmarks to back it up? Even if it was as fast as OpenBLAS on your machine when you benchmarked it that doesn't mean it will be as fast on others' machines.


The code is open-source

I have benches on i5-5257U (dual core from old MBP15), i9-9980XE (Skylake-X 18 cores), Dual Xeon Gold 6132, AMD 7840U.

See: https://github.com/mratsim/laser/blob/master/benchmarks%2Fge...

And using my own threadpool instead of OpenMP - https://github.com/mratsim/weave/issues/68#issuecomment-5692... - https://github.com/mratsim/weave/pull/94


Can you explain how to build your project and how to run the benchmarks? Cause I just spent a few hours disproving another poster's claim of getting OpenBLAS-like performance and I won't want to waste more time (https://news.ycombinator.com/item?id=38867009). While I don't know Nim very well, I dare claim that you don't get anywhere near OpenBLAS performance.


First we can use Laser, which was my initial BLAS experiment in 2019. At the time in particular, OpenBLAS didn't properly use the AVX512 VPUs. (See thread in BLIS https://github.com/flame/blis/issues/352 ), It has made progress since then, still, on my current laptop perf is in the same range

Reproduction:

- Assuming x86 and preferably Linux.

- Install Nim

- Install a C compiler with OpenMP support (not the default MacOS Clang)

- Install git

The repo submodules MKLDNN (now Intel oneDNN) to bench vs Intel JIT Compiler

```

git clone https://github.com/mratsim/laser

cd laser

git submodule update --init --recursive

nim cpp -r --outdir:build -d:danger -d:openmp benchmarks/gemm/gemm_bench_float32.nim

```

This should output something like this

```

Laser production implementation

Collected 10 samples in 0.230 seconds

Average time: 22.684 ms

Stddev time: 0.596 ms

Min time: 21.769 ms

Max time: 23.603 ms

Perf: 624.037 GFLOP/s

OpenBLAS benchmark

Collected 10 samples in 0.216 seconds

Average time: 21.340 ms

Stddev time: 3.334 ms

Min time: 19.346 ms

Max time: 27.502 ms

Perf: 663.359 GFLOP/s

MKL-DNN JIT AVX512 benchmark

Collected 10 samples in 0.201 seconds

Average time: 19.775 ms

Stddev time: 8.262 ms

Min time: 15.625 ms

Max time: 43.237 ms

Perf: 715.855 GFLOP/s ```

Note: the Theoretical peak limit is hardcoded and used my previous machine i9-9980XE.

It maybe that your BLAS library is not named libopenblas.so, you can change that here: https://github.com/mratsim/laser/blob/master/benchmarks/thir...

Implementation is in this folder: https://github.com/mratsim/laser/tree/master/laser/primitive...

in particular, tiling, cache and register optimization: https://github.com/mratsim/laser/blob/master/laser/primitive...

AVX512 code generator: https://github.com/mratsim/laser/blob/master/laser/primitive...

And generic Scalar/SSE/AVX/AVX2/AVX512 microkernel generator (this is Nim macros to generate code at compile-time): https://github.com/mratsim/laser/blob/master/laser/primitive...

I'll come back later with details on how to use my custom HPC threadpool Weave instead of OpenMP (https://github.com/mratsim/weave/tree/master/benchmarks/matm...). As a side bonus it also has parallel nqueens implemented.


The compilation command errors out for me:

/home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(77, 8) Warning: use `std/os` instead; ospaths is deprecated [Deprecated] /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(101, 8) template/generic instantiation of `bench` from here /home/bjourne/p/laser/benchmarks/gemm/gemm_bench_float32.nim(106, 21) template/generic instantiation of `gemm_nn_fallback` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm.nim(85, 34) template/generic instantiation of `newBlasBuffer` from here /home/bjourne/p/laser/benchmarks/gemm/arraymancer/blas_l3_gemm_data_structure.nim(30, 6) Error: signature for '=destroy' must be proc[T: object](x: var T) or proc[T: object](x: T)

Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.


Ah,

It was from an older implementation that wasn't compatible with Nim v2. I've commented it out.

If you pull again it should work.

> Anyway the reason for your competitive performance is likely that you are benchmarking with very small matrices. OpenBLAS spends some time preprocessing the tiles which doesn't really pay off until they become really huge.

I don't get why you think it's impossible to reach BLAS speed. The matrix sizes are configured here: https://github.com/mratsim/laser/blob/master/benchmarks/gemm...

It defaults to 1920x1920 * 1920x1920. Note, if you activate the benchmarks versus PyTorch Glow, in the past it didn't support non-multiple of 16 or something, not sure today.

Packing is done here: https://github.com/mratsim/laser/blob/master/laser/primitive...

And it also support pre-packing which is useful to reimplement batch_matmul like what CuBLAS provides and is quite useful for convolution via matmul.


Ok, I will benchmark this more when I have time. My gut feeling is that it is impossible to reach OpenBLAS-like performance without carefully tuned kernels and without explicit SIMD code. Clearly, it's not impossible to be as fast as OpenBLAS otherwise OpenBLAS itself woudln't be that fast, but it is very difficult and takes a lot of work. There is a reason much of OpenBLAS is implemented in assembly and not C.


Anything fast in Python needs to be written in another language, and if your day job isn't multiplying matrices, then you're probably SoL.

Assume the author knows about BLAS, and that the point is to benchmark the language not the FFI. Most people don't actually spend their time bashing vectors together.


You don't need any C knowledge to use numpy. In fact, its conceptual similarity with Matlab is possibly the single most important reason for its popularity. Many other problems do need specialized treatments that would indeed require other languages, but numpy is not a good counterexample.


You're missing the point. The point is that, for any application, Python needs an underlying C library to be fast. So if you need to solve problems where no such library exists, Python is slow.

In other words, Python IS slow, but it can call fast code written in other languages.


That's true but irrelevant here, especially given the original claim:

> adding performance to an existing Python program requires dropping into a different language

...is demonstrably false for a significant class of programs that can be rewritten into the array paradigm. The benchmark should have picked other numerical problem to avoid this issue. The Computer Language Benchmarks Game, for example, uses the `n-body` problem for this purpose.


> It also means that adding performance to an existing Python program requires dropping into a different language, which is not only complicated, but also requires engineers capable in both Python and C (or similar).

It's actually not that bad. I think it's part of the reason Python became so popular, it's fairly easy to write C code and expose it via python.


> It demonstrates that Python needs libraries like NumPy.

You need libraries to do _anything_ in Python. It's interpreted, so literally any call you make in Python will eventually make it back to something written in a compiled language (like a call to NumPy commands).


> What is this supposed to demonstrate?

It is supposed to show the performance of a language when you have to implement a new algorithm in the language. This may happen if you can't find the algorithm in existing libraries. If you don't like matmul, ignore it and focus on nqueen, sudoku and bedcov.


So you'd be measuring the speed of loops over out calls to c or Fortran (nag library) or their vectorisation and only testing the cost of data representation changes


I think this shows the value of programmer productivity over performance at all costs. Python is one of the most popular languages despite having performance issues for complex algorithms. Users value clarity and ease of expression over performance. That's why Python is primarily used a glue code in these complex tasks.


[flagged]


Reliably benchmarking optimizing VMs like Java and JavaScript is a research problem.

See “Virtual machine warmup blows hot and cold”. The authors run a variety of benchmarks (with a pristine methodology) and find 43% of them provide “bad inconsistent” results.

Paper summary: https://blog.acolyer.org/2017/11/07/virtual-machine-warmup-b...

Paper: https://dl.acm.org/doi/pdf/10.1145/3133876


I would expect Mojo to perform better in these settings.


Mojo is on the list and performed well. Have a look


its 2x slower than C in matmul which is something I wouldn't expect mojo to be slow at


2x slower than C while not optimal at all is not something I would call 'Slow' in this context. Still pretty strong considering how much more friendly it is to write code on for the average user of the languages in the context discussed here


IMO it looks way too verbose compared to the (faster) Julia implementation in this repo.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: