Here's a godbolt link: https://godbolt.org/z/Z6vYAS Looking at the disassembly t...

gpderetta · on Feb 12, 2020

> The non-exception version manages to avoid the conditional branch by using a conditional move, which would avoid a pipeline stall on a branch mis-prediction.

branches are usually superior to conditional moves for predictable conditions as they break dependency chains. In case the exceptional code path is taken, the cost of the misprediciton is dwarfed by the cost of unwinding the stack.

This is interesting actually, the fact that the compiler uses a conditional move in the error checking case could mean that the compiler has no useful branch probability model for that branch in the error checking case, but even when using __builtin_expect, the compiler still prefers the conditional move.

abainbridge · on Feb 12, 2020

> branches are usually superior to conditional moves for predictable conditions as they break dependency chains.

Interesting, not heard that before. Do you know of somewhere I can read about this?

gpderetta · on Feb 12, 2020

Agner Fog is the usual go-to reference. For this specific case, you can also google any of Linus rants on conditional moves (they used to be very high latency, although today they are not so much of an issue). This one for example: https://yarchive.net/comp/linux/cmov.html

ncmncm · on Feb 13, 2020

It is complicated to describe when cmov is slow and when it is fast. As a rule of thumb, if the next loop iteration data operations depend on a cmov in this one, and around, cmov will be slow. If not, it is very, very fast. Use of cmov can make quicksort 2x as fast.

Gcc absolutely won't generate two cmov instructions in a basic block. Clang, for its part, abandons practically all optimization of loops that could conceivably generate a throw.

abainbridge · on Feb 12, 2020

Nice. Like every other topic, there's more complexity if you keep looking harder.

abainbridge · on Feb 11, 2020

The problem with benchmarks is that I never see any that estimate the impact of the extra code size on programs the size of, say, Photoshop. It takes annoying long to load such a program. Is code size part of that problem? Probably. Is the bloat added by exceptions significant? I'd like to know.

ncmncm · on Feb 12, 2020

When it takes a program too long to load, it is because the program is doing too much non-exception work. The exception-handling code is not even being loaded unless it's throwing while it loads, which would just be bad design.

abainbridge · on Feb 12, 2020

I think the exception code _is_ loaded. It is only a theoretical possibility that loading it could be avoided.

I just built the following code with g++ v7.4 (from MSYS64 on Windows):

    #include <math.h>
    #include <stdexcept>

    void exitWithMessageException() {
        if (random() == 4321)
            throw std::runtime_error("Halt! Who goes there?");
    }

    int main() {
        exitWithMessageException();
        return 1234;
    }

The generated code mixed the exception handlers with the hot-path code. Here are the address ranges of relevant chunks:

    100401080 - Hot path of exitWithMessageException
    100401099

    10040109a - Cold path of exitWithMessageException
    10040113f

    100401140 - Start of main

gpderetta · on Feb 12, 2020

Interesting, GCC 7.x seems to simply puts the cold branch on a separate nop-padded cacheline.

GCC 9 [1] instead moves the exception throwing branch into a cold clone of exitWithMessageException function. The behaviour seems to have changed on starting from GCC 8.x.

[1] https://godbolt.org/z/PKKZ8m

abainbridge · on Feb 12, 2020

Ooo, fancy. There is still a long way from just that to actually getting the savings in a real program running on a real operating system. For example, if I have thousands of source files, each with a few hundred bytes of cold exception handlers, do they get coalesced into a single block for the whole program?

gpderetta · on Feb 12, 2020

coalescing functions in the same section should be the linker job, yes.

nurettin · on Feb 12, 2020

Code paths introduced in order to execute any potential stack unrolling are inefficient and they make your code slow. Especially tight loops. This was common knowledge back in 2000s.

ncmncm · on Feb 13, 2020

Common knowledge, but not correct. Code to destroy objects has to be generated for regular function returns, and is jumped into by the exception handler too. Managing resources by hand, instead, would also require code, but you have to write it. Its expense arises from its fragility.

nurettin · on Feb 13, 2020

What I meant was inserting stack frames into assembly, which are dissimilar to calling free, slowing things down.

lallysingh · on Feb 12, 2020

But the relevant comparison is the cost of exception handling vs the cost of manual error checking.

Of course if you don't check for or otherwise handle errors, the program will be faster. It's literally doing less work.

abainbridge · on Feb 13, 2020

OK, another Godbolt link: https://godbolt.org/z/SiRvBR

This one adds functions that call the exception-based and error code based functions in a simple for loop. Both handle the error.

Unless I've screwed up somewhere, I think the result is that in the exception case, the body of the inner loop contains 13 instructions, while the error code case contains 5.

Also, the generated code for the exception case is harder to read and understand. When writing performance critical code I like to eye-ball the disassembly just to make sure the compiler didn't do anything unexpected. This task is hard enough already in non-trivial functions, I certainly don't want it getting any harder.