Well actually, on the types of CPUs that OP refers to (128 threads i.e. AMD Threadripper), L3 cache is only shared within each pair of CCXs that form a CCD. If you launch a program with 32 threads, they may have 1, 2, 3 or 4 distinct L3 caches to work with.
Moreover, unless thread pinning is enforced, a given thread will bounce around between different cores during execution, so the number of distinct L3 caches in action will not be constant.
Of course you have the same story with memory, accessing another thread's memory is slower if that thread is on another CCD.
TL;DR NUMA makes life hard if you want to get consistent performance from parallelism.
Moreover, unless thread pinning is enforced, a given thread will bounce around between different cores during execution, so the number of distinct L3 caches in action will not be constant.
Of course you have the same story with memory, accessing another thread's memory is slower if that thread is on another CCD.
TL;DR NUMA makes life hard if you want to get consistent performance from parallelism.