Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An instruction oddity in the ppc64 (PowerPC 64-bit) architecture (utcc.utoronto.ca)
123 points by zdw on Jan 21, 2023 | hide | past | favorite | 29 comments


It's older than POWER8. The same thing exists on Cell, Xenon, and I think PPC970/POWER4 too IIRC.

And it's actually pretty common for CPUs to encode hints in unused nop equivalents. RISC-V does that too. X86 does it too with Intel's Control-Flow Enforcement Technology being an example.


Ideally the architecture has the foresight to carve out a specific 'hint' space as "is a NOP today, may do something magical on tomorrow's CPU" rather than having to retrospectively borrow encodings that theoretically did something nop-like.

Sometimes "specified but useless" encoding space gets taken back for other reasons -- Arm originally blew 1/16th of the entire encoding space on NOPs by allowing and documenting the instruction condition code 0xF to mean "never execute" because it happened to fall neatly out of the original design. That later got taken back somewhere around Armv4 or v5 to be used for new instructions...


I'd argue that PowerPC and most modern archs have done this. For a couple decades all of the major archs have documented 'preferred NOPs'. They then document that other NOPs will give them same results but hint that this might cause your code to run very slow, on the order of worst case of code not optimized for newer uarchs.


That's not quite the same as a true hint space, though, because it documents that code can rely on the NOP behaviour even if not the performance. Arm's HINT space is specifically "must NOP now, must not be used by software": https://developer.arm.com/documentation/ddi0602/2022-06/Base...


> That's not quite the same as a true hint space, though, because it documents that code can rely on the NOP behaviour even if not the performance.

But that's what a hint is. By definition a hint can't affect execution results, only performance.

And or r0, r0, r0 has been defined as the one true Power NOP for decades now, with these semantics documented in the main Power ISA doc (including or r1,r1,r1 being a hint for SMT priority). Whoever was being cute in the golang compiler simply didn't read the relevant docs in the first place.


Some of the insns later defined in the Arm hint space are definitely not NOPs, though -- the pointer-authentication ones will write results to registers, for instance. The requirement is "if code that knows about the newly architected instruction uses it correctly but runs on a CPU where it's a NOP the behaviour is acceptable", not "code written before the insn in hint space is allocated can use this assuming it is always going to be a pure NOP and will never alter registers, undef, etc". The contract between the code and the ISA is subtly different.

I agree that in this specific case the software side was in the wrong (though if the official disassembly had been "hint insn 1" rather than "or r1,r1,r1" I suspect the bug wouldn't have got into the code in the first place).


Thanks; that also explains why x86 would bother to champion specific NOP sequences specifically for NOP purposes.


The official x86 NOP is at 90h (and actually originally meant "xchg ax, ax" --- https://news.ycombinator.com/item?id=10437070). All the other new sequences meant to be NOPs on older processors were allocated from the reserved space (e.g. segment override prefixes on intrasegment conditional jumps becoming branch hints on the P4.)


There's actually several official nops on x86, each of different instruction length. To be fair, I think they back ported this behavior to the official arch after Microsoft relied on it. AMD defines up to 11 byte sequeces, Intel defines up to 9 byte sequences.

    90                              NOP
    6690                            66 NOP
    0f1f00                          NOP DWORD ptr [EAX]
    0f1f4000                        NOP DWORD ptr [EAX + 00H]
    0f1f440000                      NOP DWORD ptr [EAX + EAX*1 + 00H]
    660f1f440000                    66 NOP DWORD ptr [EAX + EAX*1 + 00H]
    0f1f8000000000                  NOP DWORD ptr [EAX + 00000000H]
    0f1f840000000000                NOP DWORD ptr [EAX + EAX*1 + 00000000H]
    660f1f840000000000              66 NOP DWORD ptr [EAX + EAX*1 + 00000000H]
    66660f1f840000000000            66 66 NOP DWORD ptr [EAX + EAX*1 + 00000000H]
    6666660f1f840000000000          66 66 66 NOP DWORD ptr [EAX + EAX*1 + 00000000H]
https://stackoverflow.com/questions/25545470/long-multi-byte...


That's basically combinations of "regular NOP", "NOP with an ignored prefix", and "NOP with a ModRM". The regular one has been there since the 8086, the prefixed NOPs are 386+ (I believe on 8086 the 6x row aliases to 7x conditional jumps, and the 186 and 286 will cause a #UD exception), and the latter is P6+ but has a very interesting history: https://www.jookia.org/wiki/Nopl


The part of me that loves low-level programming is wondering why they didn't overload all of

    or rN, rN, rN    (for all N 0..31)
so that a thread could communicate, not only that it's priority should be lowered, but also provide an additional number alongside that request.


If you look at the linked Raymond Chen article, the number does communicate the priority to be used, although it seems that not all 32 are used (and 0 is already used as the "true NOP".)


Thanks for pointing this out! I'd read through all of Raymond Chen's architecture articles a couple of years ago, so I didn't bother clicking through. The linked article does indeed go into detail on this! I guess it's time to dive back into them.


The single comment to TFA further points out that some of the values (registers) trigger other operations e.g. pause, yield, …


Why overload an existing instruction at all ? Why not introduce a new instruction to accomplish this ? It seems very hacky.


Because then you can only use that instruction if you're sure the target CPU supports it... or else your program will crash.

Sequences like "or r1, r1, r1" are going to be harmless when run on a processor that is unaware of its special meaning, which is exactly what you want for a "hint" instruction.


I wonder how design decisions like this happen. Surely there's existing reserved/invalid, instead of existing "valid and theoretically useless, but may possibly be emitted by a sufficiently stupid compiler" opcode space they could've used?

On x86, Intel has both reserved-invalid (generates an exception) and reserved-valid (acts as a NOP) space for future instruction extensions which are intended for code that's exclusively used on CPUs supporting them and code that only serves as a hint for supporting CPUs but harmless otherwise.


Opcode space is limited on fixed-width instruction CPUs and expensive on CPUs with potentially ‘unlimited’ numbers of instructions such as the x86.

Why waste part of it if you can instead take it from an instruction that already wasted some opcode space?


It is useful to use an existing legal op code for this, because the code is still compatible with older (or less optimized) cpus that don’t exhibit the behavior.


The reason you do this is because you have an instruction that means nothing. You save space by recycling it. Why bother making a new instruction? That's against the whole idea of RISC.

It's not their problem if the compiler writer doesn't read the documentation. I mean, the documentation is literally written for them.


I haven't done a ton of PowerPC assembly but I knew this. Frankly, it seems that the people writing the PowerPC output didn't know PPC that well, which is always a problem. There are probably more bugs than this.


I wonder if the valid NOPs "payload" are used by debuggers?


What is a hardware thread? I thought threads were a software abstraction?


Many processors have a feature called SMT that can run multiple "threads" of execution on a core at once: https://en.wikipedia.org/wiki/Multithreading_(computer_archi.... For example, you might schedule another thread to run when waiting for a memory load to complete. On POWER processors you can actually set priorities for these threads, much like you might do for those running in software.


> What is a hardware thread? I thought threads were a software abstraction?

This is a great question! Threads are a programming abstraction, but not just in software. Processors present an "architected state" to the programmer. Internally, the machine itself has a much more complex set of logic managing that presented state. The feature we're talking about here is generally called Simultaneous Multi-threading or SMT.[1] POWER was the first architecture to support this in commercial devices back in 2004.[2]

What's important to understand is that the processor core itself is made up of multiple units, each of which have their own internal state, task management, and methods of communication with each other. They don't just run a linear "fetch/decode/execute/store" cycle, even in a pipeline. If you were to do that, you'd end up with an extreme under-utilization of the silicon (or put a different way, someone else would figure out how to get better performance than you).

With SMT enabled, multiple real programming threads can execute at the exact time, while the core "figures out" how to share resources between them, increasing the effective throughput of the device at the expense of silicon.

With the newest POWER cores you can get up to SMT-8, and (if you want) even tune cores down to SMT-4, SMT-2, or just turn the feature off and dedicate all resources to a single thread while the operating system is running.

All of that is presented to the programmer as by way of the architected registers and behaviors in the processor. In reality there's a huge number of other registers and machinery running behind the scenes that isn't visible to user-space, the operating system, or in some cases even the hypervisor (if present).

Source: I worked in firmware for a couple of generations of these devices, was lead for a simulator team, and was functional owner for the launch vehicle for POWER7.

[1] https://en.wikipedia.org/wiki/Simultaneous_multithreading

[2] https://en.wikipedia.org/wiki/POWER5


Then there is 'microthreading' in hardware, as implemented in an academic proof-of-concept in D-RISC by:

http://apple-core.info/ via many small cores arranged in so called 'microgrids':

https://web.archive.org/web/20161121135317/svp-home.org/micr...


POWER can be configured for SMT4 or SMT8.

SMT8 was the default for P9 however, for SAP on P10, it appears that the recommendation is now to use SMT4 instead.

Interestingly, Google just found an older comment that I made awhile ago. https://news.ycombinator.com/item?id=30354228


Physically the POWER 8 & 9 chips are build in SMT4 and SMT8 configurations (the resulting dies are slightly different).

Then on top of that, you can configure the actual number of SMT from 1 to chip maximum, at runtime.


I think it's Power's version of SMT (or Hyperthreading as it is known in the Intel world)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: