You often want to look at the widest stuff as close to the bottom as you have appropriate context for, then follow up by working upward from that to try to arrive at an explanation.
Some recent-ish examples:
- I was optimizing some code whose main tasks should have been networking and ML. There was a suspiciously wide chunk with a name indicating something about date times. An bit later we had a solid 10% improvement win.
- I had some code with strange behavior under load, undergoing some kind of a performance phase transition before the CPU (or other normal resources) were anywhere near maxed out. I grabbed a flamegraph under normal conditions and under load. The `main` loop was wider, but that's not helpful. Walking up the graph a little bit, next to some function the code was supposed to be calling there was a block named sched_yield which was huge and didn't exist in the normal trace. The root cause was just a strange (broken) concurrency mechanism in some underlying logging code, causing logging to pile up and hog all the resources past a certain request threshold.
The colors are a red herring. They exist just to make it easier to keep track of where you are in the graph, much like how ragged text is easier to read than justified.
Height is sometimes interesting. It represents a deep call stack. I find that happens most frequently in error handling code, and if something is called enough to make its way into a sampling-based flamegraph it's often worth taking a peek at for other reasons. Runtime is a function of width though, and height doesn't play a role.
Another point worth keeping in mind is that a lot of the benefit is in being able to quickly fine a plausible explanation for the performance issue. If you find that your application is spending most of its time in `read` calls, perhaps you'll want to do less of that, perhaps you'll want to submit a perf improvement to the kernel (unlikely?), but what you definitely want to do is look at the next level above the offending code, then the level above that, .... What you'll find depends on your application, but the difference between being slow because of locking and synchronization is very different from error handling or just executing the happy path a ton of times, and that insight will help govern your next actions.
Some recent-ish examples:
- I was optimizing some code whose main tasks should have been networking and ML. There was a suspiciously wide chunk with a name indicating something about date times. An bit later we had a solid 10% improvement win.
- I had some code with strange behavior under load, undergoing some kind of a performance phase transition before the CPU (or other normal resources) were anywhere near maxed out. I grabbed a flamegraph under normal conditions and under load. The `main` loop was wider, but that's not helpful. Walking up the graph a little bit, next to some function the code was supposed to be calling there was a block named sched_yield which was huge and didn't exist in the normal trace. The root cause was just a strange (broken) concurrency mechanism in some underlying logging code, causing logging to pile up and hog all the resources past a certain request threshold.
The colors are a red herring. They exist just to make it easier to keep track of where you are in the graph, much like how ragged text is easier to read than justified.
Height is sometimes interesting. It represents a deep call stack. I find that happens most frequently in error handling code, and if something is called enough to make its way into a sampling-based flamegraph it's often worth taking a peek at for other reasons. Runtime is a function of width though, and height doesn't play a role.
Another point worth keeping in mind is that a lot of the benefit is in being able to quickly fine a plausible explanation for the performance issue. If you find that your application is spending most of its time in `read` calls, perhaps you'll want to do less of that, perhaps you'll want to submit a perf improvement to the kernel (unlikely?), but what you definitely want to do is look at the next level above the offending code, then the level above that, .... What you'll find depends on your application, but the difference between being slow because of locking and synchronization is very different from error handling or just executing the happy path a ton of times, and that insight will help govern your next actions.