I think this is what George Hotz is doing with tiny corp, but I have to admit I ...

I think this is what George Hotz is doing with tiny corp, but I have to admit I have little hope. Making asynchronous SIMD code fast is very difficult as a base point, let alone without internal view of decisions like “why does this cause a sync” or even “will this unnecessary copy ever get fixed?”. Unfortunately AMD and especially Intel don’t “develop in the open”, so even if the drivers are open sourced, without context it’ll be an uphill battle.

To give some perspective, see @ngimel’s comments and PRs in Github. That’s what AMD and Intel are competing against, along with confidence that optimizing for ML customers will pay off (clearly NVIDIA can justify the investment already).