Worth noting - IIRC the stream mentioned that it could do up to 16 simultaneous ...

paulmd · on May 7, 2016

I was thinking about this earlier - the transition from a warp of 32 to a warp of 64 that Pascal supposedly made sounds exactly like how you would accelerate multiple projections (by at least a factor of 4).

edit: apparently Pascal still has a warp-size of 32.

pandaman · on May 7, 2016

I am not sure I follow, how does processing 64 vertices instead of 32 at once accelerates anything, provided the total number of ALUs is the same?

paulmd · on May 7, 2016

A major concept in GPGPU programming is "warp coalescing".

Threads are executed an entire warp at a time (32 or 64 threads). All threads execute all paths though the code block - eg if ANY thread executes an if-statement, ALL threads execute an if-statement. The threads where the if-statement is false will execute NOP instructions until the combined control flow resumes. This formula is followed recursively if applicable (this is why "warp divergence/branch divergence" absolutely murders performance on GPUs).

When threads execute a memory load/store, they do so as a bank. The warp controller is designed to combine these requests if at all possible. If 32 threads all issue a request for 32 sequential blocks, it will combine them into 1 request for 32 blocks. However, it cannot do anything if the request isn't either contiguous (a sequential block of the warp-size) or strided (a block where thread N wants index X+0, N+1 wants thread X+0, X+2N, etc). In other words - it doesn't have to be contiguous, but it does have to be uniform. The resulting memory access will be broadcast to all units within the warp, and this is a huge factor in accelerating compute loads.

Having a warp-size of 64 hugely accelerates certain patterns of math, particularly wide linear algebra.

edit: apparently Pascal still has a warp-size of 32.

pandaman · on May 7, 2016

Wow, memory access on NVidia is pretty bad. AMD has a separate unit that coalesces memory requests and goes through cache so if you do strided loads, for example, the next load will likely be reading data cached by the previous one and it does not matter how many lanes are active. AMD has 64-wide "wraps" btw and it does not seem superior to NV in computation on the same number of ALUs.

paulmd · on May 7, 2016

I did my grad research on disease transmission simulations on GPUs, so this is super interesting to me. Could you please hit me with some papers or presentations?

The NVIDIA memory model also goes through L1 cache - but that's obviously not very big on a GPU processor (also true on AMD IIRC). Like <128 bytes per thread. It's great if your threads hit it coalesced, otherwise it's pretty meaningless.

pandaman · on May 7, 2016

I program AMD chips in game consoles so I use a different set of manuals but AMD has a lot of docs available to public at http://developer.amd.com/resources/documentation-articles/de...

At glance there is a lot of legacy stuff so I'd look at anything related to GCN, Sea Islands and Southern Islands. Evergreen, R600-800 etc are legacy VLIW ISA as far as I know.

SXX · on May 7, 2016

There also fairly recent GCN3 ISA from 2015 available that shed light on their modern hardware architecture.

robbies · on May 7, 2016

Well, sheds light on their Compute Unit architecture.

tostitos1979 · on May 7, 2016

A friend of mine will be starting an epidemiology grad program this fall. Do you have some good basic pointers on the use of GPUs in the field? Also .. what is your opinion of the field in general?

jhj · on May 7, 2016

the hardware performs some coalescing, but it's complicated...

memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.

thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.

pandaman · on May 7, 2016

Thanks, this is about how I imagined all GPUs work. With most games using interleaved vertex buffers it would be a very strange decision to penalize this access pattern.

robbies · on May 7, 2016

Actually...that might be changing on GCN. Check out Graham Wihlidal's presentation from this past GDC. Advocates some benefits of de-interleaved VBs

http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf

pandaman · on May 7, 2016

It might, but most likely it won't: splitting position data from the rest of vertex had been around since, at least, PS3 (for the similar reasons of culling prims) and yet how many games have done the Edge's tri-culling in that generation? And even if you go ahead and split off position the rest of the vertex is still better be interleaved.

jhj · on May 7, 2016

It wouldn't because as you say the ALUs are the same, it just would change the unit that the warp scheduler deals with (and relevant changes to kernel occupancy that it would entail).

paulmd · on May 7, 2016

Unless you halve the precision going into the ALU (FP16 vs FP32), of course...

jhj · on May 7, 2016

the warp scheduler can dual-issue instructions; it has to because otherwise 192 fp32 alus can't be used by the 4 warp schedulers (4 x 32) effectively. increased ipc can do this for you. this was maxwell though; the ratio of resources to schedulers to sm has changed a lot in pascal.

look up "dual-issue":

http://on-demand.gputechconf.com/gtc/2013/presentations/S346...

jhj · on May 7, 2016

Warp size is still 32 in Pascal.

grondilu · on May 7, 2016

I've found the application to multi-screens setups more exciting.

gavanwoolery · on May 7, 2016

That's exciting as well, but for whatever reason I've always personally found multi-monitor gaming to be a little annoying (even with compensated projection). Probably ok if you have super small monitor bezels though.