Binary Search Is a Pathological Case for Caches (2012)

jandrewrogers · on Nov 24, 2014

This article demonstrates why performance intensive software systems (like database engines) implement many of their own data structures and algorithms, often not even reusing those implementations internally, and avoid standard libraries for many use cases. The suboptimality of any particular use case can be quite large relative to a purpose-built implementation for that spot in the software. It is a lot more work but it can also dramatically improve real-world throughput and variance because you have visibility into exactly what the access pattern will be.

If you measure these things at large scales you will see surprisingly smooth and regular constructive interference patterns that emerge from the interaction of boundary constants in the software and hardware. This is a difficult type of variance to eliminate. That obvious solution is to inject pseudo-randomization into scheduling and boundary constants. This dampens the patterns but even if you turn the randomization knobs pretty high the patterns do not go away! And you can't go too crazy because that randomization has its own performance costs as the algorithms move further away from their theoretical optimum.

Some latency hiding computing architectures, like barrel processors, actually hash memory addressing in hardware to eliminate locality so that you do not see these kinds of locality and aliasing effects. However, most modern CPUs are specifically designed to exploit locality so this technique generalizes poorly.

IgorPartola · on Nov 24, 2014

This to me signals that our hardware is just not designed for the kind of workloads we actually want to do. Trying to control cache lines is akin to trying to drive a car from the back seat using a hanger, a pool cue, and rope. You can do it, but why not get in the driver's seat? Sadly, the modern CPU doesn't let you have the type of explicit control you really want over the very fast memory here. You are always at the mercy of some specifics of how the automagic works.

This is where a GPU or a co-processor would be nice. When you can dedicate hardware and memory (cache and RAM) to specific processes, you get to write algorithms that can fully take advantage of the hardware. This can be very useful where, for example, you run a database server: you don't care about random other tasks being extremely fast: bash, cron, STMP server, etc. can all take a long time, but the all-important database should be very fast. And that's when you can run its sorting algorithms on a few dozen dedicated cores where you are allowed to directly place data into the on die cache (which would no longer be cache, but just another time of RAM).

lorenzhs · on Nov 24, 2014

Manually managing the cache would defeat the purpose of a cache. Computing with what to fill the cache and filling it would likely take more time than handling cache misses.

Even specifying the caching strategy (whether it be associativity, eviction strategy, or something else) seems very unlikely to be beneficial to me. Caches are fast because they are hard-wired and close to the registers. For them to be useful, they need to be able to decide in the blink of an eye whether a value is already present or not. Doing that in software would be absurd. That is why the CPU vendor tries to find a strategy that performs well in a broad range of scenarios.

I have no idea what GPUs have to do with this, but co-processors are nothing new. What you are describing is a NUMA system with all the background stuff pinned to one node and your actual workload to the other (or both).

EDIT: Also, 99.9% of the time, the CPU is better at caching than a human could hope to be. You don't know the input data in advance, how you want to know what caching strategy would be optimal? From what information do you draw conclusions as to how the cache should behave for optimal performance? There's a reason programmers don't manage registers manually (because the compiler is almost always better at it), what gives you the impression that managing the cache manually would have a positive outcome?

IgorPartola · on Nov 24, 2014

While what you are saying is true for caches, I am talking about a CPU architecture where the near-die memory is not a cache at all. If I can request that my process get direct access to what used to be the L2 cache, with, let's say, a capacity of 4 MB's, I could then place my 2 MB array that needs sorting into it, and sort it without ever leaving L2. I then relinquish control over this fast hardware buffer, so the next process can use it.

There are of course a lot of issues with this model in a time sharing multiprocess OS: how do you request access, how do you ensure clean allocation, etc.? Those are problems that would need to be resolved. This does work well in PIC's: you get some small amount of RAM (something like 64-512 bytes), and then you can attach a separate RAM chip via i2c or SPI. Perhaps if a CPU die was surrounded by a reasonable number of such buffers (something like 16 x 8 MB), we could write completely different type of code.

MichaelGG · on Nov 24, 2014

Intel architecture certainly has instructions to both preload into cache, as well as perform writes that bypass the cache. But some are a bit new, and I've no idea how to really use them from a high level language. It feels like they'd be very useful. If I need to search some large blob for a single word, caching everything while I read is obviously very sub optimal.

lorenzhs · on Nov 24, 2014

This article is a great example of the very tangible effects caused by hardware properties that are often disregarded. Very interesting read!

Also, I can't help but note the (probably non-existent) connection to 3-way partitioning for quicksort (which is consistently ~10-15% (iirc) faster than traditional quicksort). It is interesting how we seem to assume that surely factors of two must be the best choice, computers working in binary and all. But sometimes this is true only in theory, but not in practice (binary search), sometimes it's good but not optimal (quicksort), and sometimes it outright is the provably worst choice (dynamic arrays)!

ntoshev · on Nov 24, 2014

I wish the author included data sorted in a van Emde Boas layout, which should be the cache-friendly equivalent of a sorted array to be searched with binary search.

andreasvc · on Nov 24, 2014

I believe it is rather memory hungry.

TheLoneWolfling · on Nov 24, 2014

Personally, trying to automagically cache controlled at the hardware level has... issues. It far too often is/becomes a leaky abstraction.

The problem is that hardware is too inflexible. And trying to explicitly manage things in software, that is from the operating system, has other issues. It's expensive, both in terms of time taken for the context switch / etc - and it tends to wipe out a large chunk of cache too.

Modern processors are mainly cache, in terms of die area. There's plenty of room for adding new bits and pieces if it helps speed things up.

So, what's between the hardware and the OS? Well, what about an FPGA? So you have the hardware read cache if present, but have the FPGA make the decisions about how to manage the cache. What to keep, what to toss, that sort of thing. This FPGA could also be used for basic system calls / scheduling, and perhaps even for specialized disk access (read: something like "this process can read in this range without consulting the OS") (As OS time can be most of the latency when reading from an SSD.) - with a system to fallback to a full syscall if necessary.

You could perhaps even implement the branch prediction unit in the FPGA.

agumonkey · on Nov 24, 2014

old reddit thread : http://www.reddit.com/r/programming/comments/xe1zz/binary_se...