I think to many programmers assembly is the "GOTO" of programming languanges: From the day you start learning to program, you are told that all this fancy high-level-language stuff is there so you do not have to deal with assembly. So most people never go there.
I did go there, briefly, about ten years ago. It was wicked fun. But all in all, I may have written maybe 20 or 30 instructions of assembly in total. I did try to rewrite a few small, heavily-used functions from our code base in assembly only to discover that the code I came up with was practically identical to what the compiler emitted. At that point I figured that the people who told me that "you can't beat the compiler" were probably right and called it a day[0]. Alas, I never had the hardcore performance requirements that would make go back there. But it was fun to get a taste of it.
[0] At the same time, I was kind of proud that I did not get beat by the compiler. Then again, those functions were fairly trivial.
> At that point I figured that the people who told me that "you can't beat the compiler" were probably right and called it a day
On that subject...my understanding [0] is:
These days, the best bet for beating the compiler is to use vendor intrinsics (for SIMD, encryption, bit-twiddling, etc). Shaving an instruction off the inner loop might give you a few percent; using SIMD lets you operate on 256 or 512 bits per instruction instead of 8, 16, 32, or 64. You might be able to show your inner loop is memory-bound (and thus prove further improvements have to come from algorithmic improvements / better cache locality, rather than continuing to fiddle with instructions).
The compiler automatically uses SIMD sometimes, but it can't do so reliably:
* The transformations require things the compiler isn't allowed to do, like increasing alignment of key variables or altering the larger algorithm.
* code that might run on older processor revisions needs multiple implementations selected at runtime. I think gcc has some magic extension ("target_clones"?) to do this relatively easily; otherwise you might need to write your own logic to decide which function pointer to use.
Note that each "vendor intrinsic" matches one assembly instruction, and it's valuable to understand assembly while writing them, but the actual code you check in can end in .cc (C++) or .rs (Rust) or whatever. Doing so means it can be inlined into functions written in the higher-level language, you don't have to encode knowledge about the platform's calling convention into your code, etc.
[0] Not from personal experience. Corrections welcome.
Your understanding is very good for someone without personal experience.
Automatic compiler use of SIMD is rarely that great unless you're in a nice big loop doing nice regular things. I've pretty much never seen it on the stuff I do.
Using intrinsics gets you 95% of the way there. I reach for asm only when I absolutely have to. It is a huge PITA. My irritation at the "bro, just write a .s file" people peaks when I'm trying to write a 200LOC function with 10 different variants based on (say) pipeline depth and unroll width. Yeah, because I'd like to spend the next year doing register allocation by hand.
The compiler is really good at doing routine stuff, and when I hand-edit the asm to do things that better fit my idea of regalloc and scheduling I usually make things worse. Where the compiler falls down is instruction selection and stuff that borders on algorithm design.
For example, I built a shift-or string matcher in SIMD where a first-stage was OK to have false positives (positives in shift-or are represented by zeros in the bit vector). I was able to get a big performance boost by tolerating these false positives when shifting SIMD bits and bringing in some zeros, but no compiler is going to know that a few false positives are OK in that circumstance.
IMO the best way to work is with intrinsics, a tiny bit of embedded asm for things that you can't get intrinsics for (I had to resort to gcc asm blocks to make a cmov happen) and close inspection of your object file (at least on the hotspots) to ensure that the code you're getting is what you think you're getting. It's possible to make minor screwups and suddenly see dozens of extra instructions pushing everything in and out of memory for no good reason.
The other place you can beat the compiler is by doing deeper/wider pipelining of branch free code. This is a dark art. Often going branch free is 10-20% worse than branchy when you have 1 iteration happening at a time but it will scale better when you are going lots of stuff at once - if you have (say) 12 different copies of your loop body happening in one iteration, and there's a mildly unpredictable branch per loop body, the branch miss on one iteration stops all the others from progressing too!
I occasionally blog on these things at branchfree.org and have some more low-level stuff brewing shortly.
> These days, the best bet for beating the compiler is to use vendor intrinsics (for SIMD, encryption, bit-twiddling, etc)
Well, I would not count that as "beating" the compiler so much, but as "using it", "helping it", or something like that. ;-)
Anyway, when I still wrote C code for a living, we were using the OpenWatcom compiler, and as far as I could figure out at the time, it made no effort whatsoever to support SIMD, and the only way to use them was to drop to assembly. (OpenWatcom's support for inline assembly and even inline binary code was very nice, though.)
When I hobby-program, I use assembly. I find it extremely relaxing; doing most things in assembly requires attention and concentration, so it's like solving a puzzle or like physically building something; I don't think you go into 'problem solving' mode very much, so maybe it's a break from that.
Higher level languages let you skip most of the 'menial' work of laying out the code, so your brain power gets spent a pretty different way. I like each. Obviously some languages are better suited to certain tasks.
I don't have a point; I'm just sharing what sprang to mind when I read your comment.
I will say that for something like C, having some experience with assembly makes it a lot easier to get a feel for pointers, the stack. I genuinely think everybody should try it at least once; it's just enjoyable talking almost-directly to the machine.
I always describe it as: Assembly is dead simple, it is only hard to write nontrivial programs in it. Manual register allocation, other micromanagement, and the necessary changes when you change the logic, that is hell.
During my programming socialization, assembly was always characterized as this terrible thing they did back in the 1960s, so I was reluctant to even try to begin with. Secondly, I had read quite a bit about how nowadays, compilers were so good, it was not worth the effort.
My initial motivation to even give assembly a try was not performance, but accessing CPU-specific features (RDTSC).
Another old chestnut - learning some assembly allows you to read it, even if you don't ever need to write any. There is real value in understanding which instructions your compiler is emitting.
Beating the compiler is made easier if you can look at the compiler's answers.
I did go there, briefly, about ten years ago. It was wicked fun. But all in all, I may have written maybe 20 or 30 instructions of assembly in total. I did try to rewrite a few small, heavily-used functions from our code base in assembly only to discover that the code I came up with was practically identical to what the compiler emitted. At that point I figured that the people who told me that "you can't beat the compiler" were probably right and called it a day[0]. Alas, I never had the hardcore performance requirements that would make go back there. But it was fun to get a taste of it.
[0] At the same time, I was kind of proud that I did not get beat by the compiler. Then again, those functions were fairly trivial.