People have had the same ideas you've had, and it's gotten as far as commercial ...

People have had the same ideas you've had, and it's gotten as far as commercial products, but the general experience has been that "sufficiently smart compiler" tends to be very lackluster in practice. Any such effort needs to be extremely cognizant of what hardware is fundamentally good and bad at, and what compilers are fundamentally good and bad at.

Let's start with branch prediction. Modern hardware branch predictors are pretty phenomenal. They can predict a basic loop with 100% accuracy (i.e., they can not only predict that the backedge is normally taken, but can predict that the loop is about to stop). In the past decade, we've seen improvements in indirect branch predictor. The classic "branch taken/branch not taken" hint is fundamentally coarser and less precise than the hardware branch predictor: it cannot take any advantage of dynamic information. Replacing the hardware branch predictor with a compiler's static branch predictor is only going to make the situation worse.

Now you might be thinking of somehow encoding the branch predictor state in hardware. This runs into the issue of encoding microarchitecture in your hardware ISA. If you ever need to change microarchitecture, you face the dilemma of having to make a new, incompatible ISA, or emulating old microarchitectural details that no longer work well. MIPS got burned with this by delay slots. In general, the rule of thumb hardware designers use is "never expose your microarchitecture in your ISA."

A related topic is the issue of memory size. Memory is relatively expensive compared to ALUs: we can stamp out more ALUs than we know how to fill. We have hit the limit of physics in sizing of our register files and L1 caches (electrons move at finite speed!), and the technologies we use to make them work quickly are also prohibitively power-hungry. This leads to the memory hierarchy of progressively larger, but cheaper and more efficient memories. The rule of thumb for performance engineers is that you count cycles only if you fit in L1 cache; otherwise, you're counting cache misses, since those cache misses will dwarf any cycle count savings you're likely to scrounge up.