It's older than POWER8. The same thing exists on Cell, Xenon, and I think PPC970/POWER4 too IIRC.
And it's actually pretty common for CPUs to encode hints in unused nop equivalents. RISC-V does that too. X86 does it too with Intel's Control-Flow Enforcement Technology being an example.
Ideally the architecture has the foresight to carve out a specific 'hint' space as "is a NOP today, may do something magical on tomorrow's CPU" rather than having to retrospectively borrow encodings that theoretically did something nop-like.
Sometimes "specified but useless" encoding space gets taken back for other reasons -- Arm originally blew 1/16th of the entire encoding space on NOPs by allowing and documenting the instruction condition code 0xF to mean "never execute" because it happened to fall neatly out of the original design. That later got taken back somewhere around Armv4 or v5 to be used for new instructions...
I'd argue that PowerPC and most modern archs have done this. For a couple decades all of the major archs have documented 'preferred NOPs'. They then document that other NOPs will give them same results but hint that this might cause your code to run very slow, on the order of worst case of code not optimized for newer uarchs.
That's not quite the same as a true hint space, though, because it documents that code can rely on the NOP behaviour even if not the performance. Arm's HINT space is specifically "must NOP now, must not be used by software": https://developer.arm.com/documentation/ddi0602/2022-06/Base...
> That's not quite the same as a true hint space, though, because it documents that code can rely on the NOP behaviour even if not the performance.
But that's what a hint is. By definition a hint can't affect execution results, only performance.
And or r0, r0, r0 has been defined as the one true Power NOP for decades now, with these semantics documented in the main Power ISA doc (including or r1,r1,r1 being a hint for SMT priority). Whoever was being cute in the golang compiler simply didn't read the relevant docs in the first place.
Some of the insns later defined in the Arm hint space are definitely not NOPs, though -- the pointer-authentication ones will write results to registers, for instance. The requirement is "if code that knows about the newly architected instruction uses it correctly but runs on a CPU where it's a NOP the behaviour is acceptable", not "code written before the insn in hint space is allocated can use this assuming it is always going to be a pure NOP and will never alter registers, undef, etc". The contract between the code and the ISA is subtly different.
I agree that in this specific case the software side was in the wrong (though if the official disassembly had been "hint insn 1" rather than "or r1,r1,r1" I suspect the bug wouldn't have got into the code in the first place).
The official x86 NOP is at 90h (and actually originally meant "xchg ax, ax" --- https://news.ycombinator.com/item?id=10437070). All the other new sequences meant to be NOPs on older processors were allocated from the reserved space (e.g. segment override prefixes on intrasegment conditional jumps becoming branch hints on the P4.)
There's actually several official nops on x86, each of different instruction length. To be fair, I think they back ported this behavior to the official arch after Microsoft relied on it. AMD defines up to 11 byte sequeces, Intel defines up to 9 byte sequences.
That's basically combinations of "regular NOP", "NOP with an ignored prefix", and "NOP with a ModRM". The regular one has been there since the 8086, the prefixed NOPs are 386+ (I believe on 8086 the 6x row aliases to 7x conditional jumps, and the 186 and 286 will cause a #UD exception), and the latter is P6+ but has a very interesting history: https://www.jookia.org/wiki/Nopl
If you look at the linked Raymond Chen article, the number does communicate the priority to be used, although it seems that not all 32 are used (and 0 is already used as the "true NOP".)
Thanks for pointing this out! I'd read through all of Raymond Chen's architecture articles a couple of years ago, so I didn't bother clicking through. The linked article does indeed go into detail on this! I guess it's time to dive back into them.
Because then you can only use that instruction if you're sure the target CPU supports it... or else your program will crash.
Sequences like "or r1, r1, r1" are going to be harmless when run on a processor that is unaware of its special meaning, which is exactly what you want for a "hint" instruction.
I wonder how design decisions like this happen. Surely there's existing reserved/invalid, instead of existing "valid and theoretically useless, but may possibly be emitted by a sufficiently stupid compiler" opcode space they could've used?
On x86, Intel has both reserved-invalid (generates an exception) and reserved-valid (acts as a NOP) space for future instruction extensions which are intended for code that's exclusively used on CPUs supporting them and code that only serves as a hint for supporting CPUs but harmless otherwise.
It is useful to use an existing legal op code for this, because the code is still compatible with older (or less optimized) cpus that don’t exhibit the behavior.
The reason you do this is because you have an instruction that means nothing. You save space by recycling it. Why bother making a new instruction? That's against the whole idea of RISC.
It's not their problem if the compiler writer doesn't read the documentation. I mean, the documentation is literally written for them.
I haven't done a ton of PowerPC assembly but I knew this. Frankly, it seems that the people writing the PowerPC output didn't know PPC that well, which is always a problem. There are probably more bugs than this.
Many processors have a feature called SMT that can run multiple "threads" of execution on a core at once: https://en.wikipedia.org/wiki/Multithreading_(computer_archi.... For example, you might schedule another thread to run when waiting for a memory load to complete. On POWER processors you can actually set priorities for these threads, much like you might do for those running in software.
> What is a hardware thread? I thought threads were a software abstraction?
This is a great question! Threads are a programming abstraction, but not just in software. Processors present an "architected state" to the programmer. Internally, the machine itself has a much more complex set of logic managing that presented state. The feature we're talking about here is generally called Simultaneous Multi-threading or SMT.[1] POWER was the first architecture to support this in commercial devices back in 2004.[2]
What's important to understand is that the processor core itself is made up of multiple units, each of which have their own internal state, task management, and methods of communication with each other. They don't just run a linear "fetch/decode/execute/store" cycle, even in a pipeline. If you were to do that, you'd end up with an extreme under-utilization of the silicon (or put a different way, someone else would figure out how to get better performance than you).
With SMT enabled, multiple real programming threads can execute at the exact time, while the core "figures out" how to share resources between them, increasing the effective throughput of the device at the expense of silicon.
With the newest POWER cores you can get up to SMT-8, and (if you want) even tune cores down to SMT-4, SMT-2, or just turn the feature off and dedicate all resources to a single thread while the operating system is running.
All of that is presented to the programmer as by way of the architected registers and behaviors in the processor. In reality there's a huge number of other registers and machinery running behind the scenes that isn't visible to user-space, the operating system, or in some cases even the hypervisor (if present).
Source: I worked in firmware for a couple of generations of these devices, was lead for a simulator team, and was functional owner for the launch vehicle for POWER7.
And it's actually pretty common for CPUs to encode hints in unused nop equivalents. RISC-V does that too. X86 does it too with Intel's Control-Flow Enforcement Technology being an example.