Reading comments in this thread, it feels like people still can't believe that i...

nl · on April 9, 2021

NNUE is weird. No one outside the chess/shogi community talks about it because they seem to have a very case of strong not-invented-here syndrome. It's hand-optimised CPU code that doesn't (yet) run on the GPU (which is why it is "faster").

To be fair, they do want to embed it into a consumer friendly application, and the integration for embedding TF or something that can run pytorch models on a GPU without python is non-trivial.

If someone ported it to a GPU, it almost certainly would be faster. See http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986

There is a PyTorch port available, but no benchmarks unfortunatly. It does seem to be fairly widely used for training though, which is indicative of the speed gains available.

mantap · on April 10, 2021

GPU have high latency. For a chess engine like stockfish which is designed to search as many positions as possible, the latency of a GPU is a big problem.

Engines like LC0 that do use the GPU work by searching fewer positions but with a heavier eval function. This makes the latency less relevant because it is a smaller percentage of the GPU time.

nl · on April 10, 2021

3D computer games are much more latency sensitive than chess, and work on GPUs very successfully.

This seems like a solvable problem.

johbjo · on April 10, 2021

Board game AI works by searching through the state space, and evaluating each state with the neural network, then picking the move with best expectation.

So it needs to load the comparatively tiny game state (chess board) into the GPU for each evaluation. The more game states it can evaluate per move, the better it is. It can be in the order of millions.

hakuseki · on April 9, 2021

Is NNUE used by any Go AI? I've only heard of it being used for shogi and chess.

nl · on April 9, 2021

Yes you are correct. I wrote "go" when I meant "shogi" - fixed.

ummonk · on April 9, 2021

Was the NNUE trained on the CPU though? It’s intuitive that with the small size of a chessboard and the ability to use incremental update of evaluations, it will be faster to do neural network evaluation on CPU. Training, on the other hand, is typically highly parallelizable, so it would be a big development if we are able to do it faster on CPU.

Cybiote · on April 9, 2021

It looks like training is on CPU too: https://github.com/nodchip/Stockfish/blob/master/src/nnue/ev....

whimsicalism · on April 9, 2021

World of difference btwn "can be run" and "is trained on"

bob1029 · on April 9, 2021

You will find most developers don't actually understand how fast a single x86 core can be when used appropriately. Most are too busy hiding from real hardware concerns to think about such things.