Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Manually managing the cache would defeat the purpose of a cache. Computing with what to fill the cache and filling it would likely take more time than handling cache misses.

Even specifying the caching strategy (whether it be associativity, eviction strategy, or something else) seems very unlikely to be beneficial to me. Caches are fast because they are hard-wired and close to the registers. For them to be useful, they need to be able to decide in the blink of an eye whether a value is already present or not. Doing that in software would be absurd. That is why the CPU vendor tries to find a strategy that performs well in a broad range of scenarios.

I have no idea what GPUs have to do with this, but co-processors are nothing new. What you are describing is a NUMA system with all the background stuff pinned to one node and your actual workload to the other (or both).

EDIT: Also, 99.9% of the time, the CPU is better at caching than a human could hope to be. You don't know the input data in advance, how you want to know what caching strategy would be optimal? From what information do you draw conclusions as to how the cache should behave for optimal performance? There's a reason programmers don't manage registers manually (because the compiler is almost always better at it), what gives you the impression that managing the cache manually would have a positive outcome?



While what you are saying is true for caches, I am talking about a CPU architecture where the near-die memory is not a cache at all. If I can request that my process get direct access to what used to be the L2 cache, with, let's say, a capacity of 4 MB's, I could then place my 2 MB array that needs sorting into it, and sort it without ever leaving L2. I then relinquish control over this fast hardware buffer, so the next process can use it.

There are of course a lot of issues with this model in a time sharing multiprocess OS: how do you request access, how do you ensure clean allocation, etc.? Those are problems that would need to be resolved. This does work well in PIC's: you get some small amount of RAM (something like 64-512 bytes), and then you can attach a separate RAM chip via i2c or SPI. Perhaps if a CPU die was surrounded by a reasonable number of such buffers (something like 16 x 8 MB), we could write completely different type of code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: