In the coarse graining code, you use an @parameter-for. Doesn’t that lead to some pretty large code size unrolling that? Or is that less of an issue on GPU?
It doesn’t. The batch size is just 8. This is a very good trick and often needed to archive peak performance in memory bound kernels. You can checkout the equivalent code in cuda aswell :)
Great write up! I learned a lot!