Reading comments in this thread, it feels like people still can't believe that in some case neural networks can be faster on CPU instead of GPU.
In fact, there is already real world use for neural networks that only optimized for CPU: an chess engine. More specifically, chess engine that use NNUE (Efficiently Updated Neural Networks) [1] like Stockfish 12 [2]. It run much faster while consuming less watts compared to GPU, can be run on average CPU, and managed to beat GPU based neural networks! [3]
This model (NNUE) already exist far earlier than the model discussed in this thread, yet there is almost no discussion about it on HN nor Reddit's r/MachineLearning
NNUE is weird. No one outside the chess/shogi community talks about it because they seem to have a very case of strong not-invented-here syndrome. It's hand-optimised CPU code that doesn't (yet) run on the GPU (which is why it is "faster").
To be fair, they do want to embed it into a consumer friendly application, and the integration for embedding TF or something that can run pytorch models on a GPU without python is non-trivial.
There is a PyTorch port available, but no benchmarks unfortunatly. It does seem to be fairly widely used for training though, which is indicative of the speed gains available.
GPU have high latency. For a chess engine like stockfish which is designed to search as many positions as possible, the latency of a GPU is a big problem.
Engines like LC0 that do use the GPU work by searching fewer positions but with a heavier eval function. This makes the latency less relevant because it is a smaller percentage of the GPU time.
Board game AI works by searching through the state space, and evaluating each state with the neural network, then picking the move with best expectation.
So it needs to load the comparatively tiny game state (chess board) into the GPU for each evaluation. The more game states it can evaluate per move, the better it is. It can be in the order of millions.
Was the NNUE trained on the CPU though? It’s intuitive that with the small size of a chessboard and the ability to use incremental update of evaluations, it will be faster to do neural network evaluation on CPU. Training, on the other hand, is typically highly parallelizable, so it would be a big development if we are able to do it faster on CPU.
You will find most developers don't actually understand how fast a single x86 core can be when used appropriately. Most are too busy hiding from real hardware concerns to think about such things.
In fact, there is already real world use for neural networks that only optimized for CPU: an chess engine. More specifically, chess engine that use NNUE (Efficiently Updated Neural Networks) [1] like Stockfish 12 [2]. It run much faster while consuming less watts compared to GPU, can be run on average CPU, and managed to beat GPU based neural networks! [3]
This model (NNUE) already exist far earlier than the model discussed in this thread, yet there is almost no discussion about it on HN nor Reddit's r/MachineLearning
[1] https://www.chessprogramming.org/NNUE
[2] https://www.chessprogramming.org/Stockfish_NNUE
[3] https://www.chess.com/blog/the_real_greco/evolution-of-a-che...