Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Worth noting - IIRC the stream mentioned that it could do up to 16 simultaneous projections at little additional performance cost. This is important for VR...a big part of the cost, when you are dumping many vertices to the GPU, is performing a transform on each vertex (a four component vector multiplied by a 4x4 matrix) +. Even bigger cost comes from filling the resulting polygons, which if done in two passes (as is fairly common) results in something that violates cache across the tiles that get filled. So, in other words, its expensive to render something twice, as is needed for each eye in VR - from what they have shown, their new architecture largely reduces this problem.

+ This is a "small" part of the cost, but doing 5m polygons at 60 fps can result in about 30 GFLOPS of compute for that single matrix operation (in reality, there are many vertex operations and often many more fragment operations).



I was thinking about this earlier - the transition from a warp of 32 to a warp of 64 that Pascal supposedly made sounds exactly like how you would accelerate multiple projections (by at least a factor of 4).

edit: apparently Pascal still has a warp-size of 32.


I am not sure I follow, how does processing 64 vertices instead of 32 at once accelerates anything, provided the total number of ALUs is the same?


A major concept in GPGPU programming is "warp coalescing".

Threads are executed an entire warp at a time (32 or 64 threads). All threads execute all paths though the code block - eg if ANY thread executes an if-statement, ALL threads execute an if-statement. The threads where the if-statement is false will execute NOP instructions until the combined control flow resumes. This formula is followed recursively if applicable (this is why "warp divergence/branch divergence" absolutely murders performance on GPUs).

When threads execute a memory load/store, they do so as a bank. The warp controller is designed to combine these requests if at all possible. If 32 threads all issue a request for 32 sequential blocks, it will combine them into 1 request for 32 blocks. However, it cannot do anything if the request isn't either contiguous (a sequential block of the warp-size) or strided (a block where thread N wants index X+0, N+1 wants thread X+0, X+2N, etc). In other words - it doesn't have to be contiguous, but it does have to be uniform. The resulting memory access will be broadcast to all units within the warp, and this is a huge factor in accelerating compute loads.

Having a warp-size of 64 hugely accelerates certain patterns of math, particularly wide linear algebra.

edit: apparently Pascal still has a warp-size of 32.


Wow, memory access on NVidia is pretty bad. AMD has a separate unit that coalesces memory requests and goes through cache so if you do strided loads, for example, the next load will likely be reading data cached by the previous one and it does not matter how many lanes are active. AMD has 64-wide "wraps" btw and it does not seem superior to NV in computation on the same number of ALUs.


I did my grad research on disease transmission simulations on GPUs, so this is super interesting to me. Could you please hit me with some papers or presentations?

The NVIDIA memory model also goes through L1 cache - but that's obviously not very big on a GPU processor (also true on AMD IIRC). Like <128 bytes per thread. It's great if your threads hit it coalesced, otherwise it's pretty meaningless.


I program AMD chips in game consoles so I use a different set of manuals but AMD has a lot of docs available to public at http://developer.amd.com/resources/documentation-articles/de...

At glance there is a lot of legacy stuff so I'd look at anything related to GCN, Sea Islands and Southern Islands. Evergreen, R600-800 etc are legacy VLIW ISA as far as I know.


There also fairly recent GCN3 ISA from 2015 available that shed light on their modern hardware architecture.


Well, sheds light on their Compute Unit architecture.


A friend of mine will be starting an epidemiology grad program this fall. Do you have some good basic pointers on the use of GPUs in the field? Also .. what is your opinion of the field in general?


the hardware performs some coalescing, but it's complicated...

memory accesses in a warp do not necessarily have to be contiguous, but it does matter how many 32 byte global memory segments (and 128 byte l1 cache segments) they fall into. the memory controller can load 1, 2 or 4 of those 32 byte segments in a single transaction, but that's read through the cache in 128 byte cache lines.

thus, if every lane in a warp loads a random word in a 128 byte range, then there is no penalty; it's 1 transaction and the reading is at full efficiency. but, if every lane in a warp loads 4 bytes with a stride of 128 bytes, then this is very bad: 4096 bytes are loaded but only 128 are used, resulting in ~3% efficiency.


Thanks, this is about how I imagined all GPUs work. With most games using interleaved vertex buffers it would be a very strange decision to penalize this access pattern.


Actually...that might be changing on GCN. Check out Graham Wihlidal's presentation from this past GDC. Advocates some benefits of de-interleaved VBs

http://www.wihlidal.ca/Presentations/GDC_2016_Compute.pdf


It might, but most likely it won't: splitting position data from the rest of vertex had been around since, at least, PS3 (for the similar reasons of culling prims) and yet how many games have done the Edge's tri-culling in that generation? And even if you go ahead and split off position the rest of the vertex is still better be interleaved.


It wouldn't because as you say the ALUs are the same, it just would change the unit that the warp scheduler deals with (and relevant changes to kernel occupancy that it would entail).


Unless you halve the precision going into the ALU (FP16 vs FP32), of course...


the warp scheduler can dual-issue instructions; it has to because otherwise 192 fp32 alus can't be used by the 4 warp schedulers (4 x 32) effectively. increased ipc can do this for you. this was maxwell though; the ratio of resources to schedulers to sm has changed a lot in pascal.

look up "dual-issue":

http://on-demand.gputechconf.com/gtc/2013/presentations/S346...


Warp size is still 32 in Pascal.


I've found the application to multi-screens setups more exciting.


That's exciting as well, but for whatever reason I've always personally found multi-monitor gaming to be a little annoying (even with compensated projection). Probably ok if you have super small monitor bezels though.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: