NNUE

Joost Buijs · Post by **Joost Buijs** » Sat May 01, 2021 12:24

Rein Halbersma wrote: Sat May 01, 2021 11:06
Joost Buijs wrote: Fri Apr 30, 2021 08:14 Like Alpha Zero you can start with a random network and have the engine play a number of games against itself and update the network on the fly depending upon the outcome of the games. This is called 'reinforcement learning'. The problem with this method is that you have to play a huge number of games before the network reaches an acceptable level of play. I never read the Alpha Zero paper, but I think for chess they used something like 40 million games.
Reinforcement learning should be strictly superior compared to supervised learning, even for NNUE or Scan-pattern draughts programs. Mathematically, supervised learning is just a single iteration in the RL loop, so if you stop after one round, they are equivalent. If you continue the loop and only pick a new network when it improves, you can never get worse.

The question is whether you will gain much from continuous playing and retraining. For Scan-based patterns, I highly doubt this. The eval is almost completely optimized as soon as you have ~100M positions and cannot be made to overfit after that in my experience. For NNUE, there might be more capacity for overfitting that you might then try to reduce by adding more positions and retraining.

The AlphaZero neural networks are a few orders of magnitude more expensive and require much more positions and data to reach the limit of their predictive power. That's why it was so expensive to train and generate all these games (~1700 years in terms of single PC years). IIRC, the training games were with Monte Carlo Tree search with only 1600 nodes per search, that's just a tiny amount of search and a huge amount of CPU/GPU cycles to the eval. The AlphaZero eval also picks up a big part of pattern-based tactics because it's such a large neural network. For Scan-patterns, it's the reverse: a huge amount of search and a tiny amount of eval cycles.

I agree.

It's a pity that it is not worthwhile trying to train larger/deeper networks for 10x10 Draughts because 12 bit Scan-based patterns already seem to cover 99% of the important features. In fact NNUE for Draughts is a step backwards because it is slower, it's just the fun of trying to get it perform at the same level.

For training the huge draw tendency of Draughts is a problem too. Above a certain level engines don't lose anymore, even when you give the opponent a 100 fold speed advantage. From draws the network won't learn anything. It looks like that on current hardware 10x10 Draughts is getting a bit trivial. Maybe it needs other rules like removing compulsory captures, this could make the game more complex.

Joost Buijs · Post by **Joost Buijs** » Tue May 04, 2021 19:19

When spending enough time on it a NNUE style network can probably get on par with pattern based evaluation, it solely depends upon the data used for training.

My last attempt was training the network on positions derived from a single 3 move ballot, quiet positions that were taken from the leaves of the quiescence search. After removing all doubles, randomly taking 1 out of 5 positions and labeling them with the score of a 4 ply search based on the evaluation of the previous network, and training the network with this newly generated data, the result clearly got better.

The last 90 move in 1 minute match against KR ended: Kingsrow 1.62 vs Ares v1.2: 2 wins, 0 losses, 156 draws, 0 unknowns, a 4 Elo difference.

This is not so bad considering the slow speed of the network, quantization will make it faster, but I have not the slightest idea what it will do with the quality of the network.

Sidiki · Post by **Sidiki** » Thu May 06, 2021 00:17

Very very good result coming from NNUE, I Just downloaded the pdn of dxp game.

Good work.

Friendly, Sidiki

Joost Buijs · Post by **Joost Buijs** » Thu May 06, 2021 08:54

Sidiki wrote: Thu May 06, 2021 00:17 Very very good result coming from NNUE, I Just downloaded the pdn of dxp game.

Good work.

Friendly, Sidiki

Thanks!

Yesterday I found out that in the last match I forgot to enable Kingsrow's EGTB while in the previous match it was enabled. Maybe this explains the better result.

There is still a lot of work to do, like I already told Bert I was set back by a very subtle bug in the hash-table that costed me a lot of time because I assumed it was a problem of the network and not something in the search.

I'm still busy optimizing the network by using better data and maybe try some other architecture(s), when I'm satisfied with the result I want to convert it from 32 bit floating point to 8 bit integer, this will make it 4 times as fast. Currently it runs at max 1.4 MN/s on a single core and it could go to 5.6 MN/s (possibly even faster because the incremental update is not optimized very well yet). My goal is to reach 8 MN/s.

Sidiki · Post by **Sidiki** » Thu May 06, 2021 10:35

Joost Buijs wrote: Thu May 06, 2021 08:54
Sidiki wrote: Thu May 06, 2021 00:17 Very very good result coming from NNUE, I Just downloaded the pdn of dxp game.

Good work.

Friendly, Sidiki
Thanks!

Yesterday I found out that in the last match I forgot to enable Kingsrow's EGTB while in the previous match it was enabled. Maybe this explains the better result.

There is still a lot of work to do, like I already told Bert I was set back by a very subtle bug in the hash-table that costed me a lot of time because I assumed it was a problem of the network and not something in the search.

I'm still busy optimizing the network by using better data and maybe try some other architecture(s), when I'm satisfied with the result I want to convert it from 32 bit floating point to 8 bit integer, this will make it 4 times as fast. Currently it runs at max 1.4 MN/s on a single core and it could go to 5.6 MN/s (possibly even faster because the incremental update is not optimized very well yet). My goal is to reach 8 MN/s.

Reach 8 MN/s will give a better level, because if at 1.4,we have these results.
I think that we are slowly discovering power of NNUE.

Thanks again to all programers of this forum for the fun that you give to us, engines users, God bless you all.

Friendly, Sidiki.

Krzysztof Grzelak · Post by **Krzysztof Grzelak** » Thu May 20, 2021 13:59

Excuse me ask Bert, as the GUI Damage.

Joost Buijs · Post by **Joost Buijs** » Sat Jul 03, 2021 15:42

With very short time-controls (6 seconds per game) and single-core NNUE networks remain somewhat weaker than N-tuple networks. This is basically a speed issue, the speed difference of a factor 3 is difficult to overcome. However, with somewhat longer time-controls (1 minute per game) and multi-core NNUE gets reasonably on par with the N-tuple programs.

My last NNUE network consists of 128 inputs and 256x32x32x1 neurons, currently only 100 inputs are used 2x 45 men + 2x 5 number of kings, in practice the number of kings for 1 color never seem to exceed 5, for safety reasons I've put a restriction on it. Positions with black to move are flipped, the network always looks from white's point of view. The drawback is that I have to use 2 separate accumulators, one for each side to move, the positive effect is that it seems to give the network a notion of tempo.

Quantization currently is 16 bits, the accumulator has to be 16 bits anyway because it contains 256 additions of 128 weights. The input-weights seem to be by far the most important weights, using 8 bits would make the resolution too low. I tried to use 8 bit quantization for the remainder of the network, it worked but it wasn't much faster, the results weren't better too, for simplicity I decided to leave it 16 bits.

The network has been trained on approx. 1.7 billion semi-quiet positions labeled with the scores of a 4 ply search using an older version of the network that was trained on positions labeled with the outcome of 8 million games with 66 msec. per move time control. The software used for training makes use of libTorch v1.81 and CUDA v11.1. Using libTorch makes it easier to keep data preparation and training in a single C++ program.

Training on ~1.7 billion positions takes about 5 minutes per epoch on a RTX-3090 (not using sparsity), after ~30 epochs the network usually gets reasonably good, after ~85 epochs it doesn't improve anymore. The optimizer used is AdamW with AMSGrad and weight-decay, when the MSE gets below a certain threshold the program automatically switches to SGD with Momentum, somehow this always gives me the best results.

A few days ago I had the program play a match against Kingsrow 1.62 with 6P EGDB (16 vs 16 threads on my 10980XE), 1 minute for 90 moves, with an equal result. Kingsrow 1.62 vs Ares v1.2: 1 wins, 1 losses, 156 draws, 0 unknowns, maybe a lucky shot, at least it shows that the NNUE network isn't much worse.

Sidiki · Post by **Sidiki** » Mon Jul 05, 2021 20:54

Joost Buijs wrote: Sat Jul 03, 2021 15:42 With very short time-controls (6 seconds per game) and single-core NNUE networks remain somewhat weaker than N-tuple networks. This is basically a speed issue, the speed difference of a factor 3 is difficult to overcome. However, with somewhat longer time-controls (1 minute per game) and multi-core NNUE gets reasonably on par with the N-tuple programs.

My last NNUE network consists of 128 inputs and 256x32x32x1 neurons, currently only 100 inputs are used 2x 45 men + 2x 5 number of kings, in practice the number of kings for 1 color never seem to exceed 5, for safety reasons I've put a restriction on it. Positions with black to move are flipped, the network always looks from white's point of view. The drawback is that I have to use 2 separate accumulators, one for each side to move, the positive effect is that it seems to give the network a notion of tempo.

Quantization currently is 16 bits, the accumulator has to be 16 bits anyway because it contains 256 additions of 128 weights. The input-weights seem to be by far the most important weights, using 8 bits would make the resolution too low. I tried to use 8 bit quantization for the remainder of the network, it worked but it wasn't much faster, the results weren't better too, for simplicity I decided to leave it 16 bits.

The network has been trained on approx. 1.7 billion semi-quiet positions labeled with the scores of a 4 ply search using an older version of the network that was trained on positions labeled with the outcome of 8 million games with 66 msec. per move time control. The software used for training makes use of libTorch v1.81 and CUDA v11.1. Using libTorch makes it easier to keep data preparation and training in a single C++ program.

Training on ~1.7 billion positions takes about 5 minutes per epoch on a RTX-3090 (not using sparsity), after ~30 epochs the network usually gets reasonably good, after ~85 epochs it doesn't improve anymore. The optimizer used is AdamW with AMSGrad and weight-decay, when the MSE gets below a certain threshold the program automatically switches to SGD with Momentum, somehow this always gives me the best results.

A few days ago I had the program play a match against Kingsrow 1.62 with 6P EGDB (16 vs 16 threads on my 10980XE), 1 minute for 90 moves, with an equal result. Kingsrow 1.62 vs Ares v1.2: 1 wins, 1 losses, 156 draws, 0 unknowns, maybe a lucky shot, at least it shows that the NNUE network isn't much worse.

Hi Joost,

This it's a great improvement, good work. It show that NNUE have a chance to have a place in draughts world.

Joost Buijs · Post by **Joost Buijs** » Tue Jul 06, 2021 08:32

Sidiki wrote: Mon Jul 05, 2021 20:54 Hi Joost,

This it's a great improvement, good work. It show that NNUE have a chance to have a place in draughts world.

It's certainly interesting. However, I don't think that it will improve the level of play. In my opinion programs like Kingsrow and Scan are already at the highest level reachable.

Sidiki · Post by **Sidiki** » Tue Jul 06, 2021 19:37

Joost Buijs wrote: Tue Jul 06, 2021 08:32
Sidiki wrote: Mon Jul 05, 2021 20:54 Hi Joost,

This it's a great improvement, good work. It show that NNUE have a chance to have a place in draughts world.
It's certainly interesting. However, I don't think that it will improve the level of play. In my opinion programs like Kingsrow and Scan are already at the highest level reachable.

That's true, Kingsrow and Scan are on the top level. They done some road to reach this level, and are based on another programming style.

I'm just, seeing the results of Nnue's dxp game against the best engines , saying that with another programming style, it's possible to Ares, Damage and the others to reach such level.

Congratulations and many thanks again to all programmers, you Joost, Ed, Bert, Fabien, Rain and the others for taking time to reach higher the level of draughts and give us fun. GOD BLESS YOU All !!!!

Friendly, Sidiki.

Joost Buijs · Post by **Joost Buijs** » Thu Jul 08, 2021 09:45

By clipping the network weights and biases after each batch between -0.99 and +0.99 and retraining the network (which took about 1 day) the network quality with int8 quantization improved quite a bit. Maybe not clipping the biases could be better because I use 32 bit for these in the inference code anyway.

This morning I've added 8 bit dot-product SIMD code which improved the speed by 20 to 40%. It depends a little bit upon the position it looks at. Because the network remains slow I use hashing in quiescence (like most chess programs do), this gives speedwise a considerable overhead but the tree remains somewhat smaller.

If everything goes like expected I will have the final version of the program ready within a month or so. This will give Krzysztof the opportunity to test it several weeks before the tournament starts.

Sidiki · Post by **Sidiki** » Thu Jul 08, 2021 17:44

Joost Buijs wrote: Thu Jul 08, 2021 09:45 By clipping the network weights and biases after each batch between -0.99 and +0.99 and retraining the network (which took about 1 day) the network quality with int8 quantization improved quite a bit. Maybe not clipping the biases could be better because I use 32 bit for these in the inference code anyway.

This morning I've added 8 bit dot-product SIMD code which improved the speed by 20 to 40%. It depends a little bit upon the position it looks at. Because the network remains slow I use hashing in quiescence (like most chess programs do), this gives speedwise a considerable overhead but the tree remains somewhat smaller.

If everything goes like expected I will have the final version of the program ready within a month or so. This will give Krzysztof the opportunity to test it several weeks before the tournament starts.

Hi Joost

This it's already a good new. We hope to see more and more stronger engines.

Friendly, Sidiki

Joost Buijs · Post by **Joost Buijs** » Mon Jul 12, 2021 09:13

Hi Sidiki,

I have no clue how strong the program will play. After changing to int8 quantization the speed became higher but the quality of evaluation became less. Since this is more or less a training problem I want to leave it like it is and try to improve the training algorithm.

Multi threaded (16 threads) with time controls of 1 minute per game the level currently fluctuates between -18 and 0 Elo vs Kingsrow 1.62, I never tried to play against other programs so I have no idea how it will perform against a variety of other programs.

Since I've been experimenting with different network sizes and architectures and I didn't want to change the code each time the current version has no incremental update for the accumulator and does only 2.5 mnps on a single core. Of course I would like to fix this before Krzysztof's tourney starts but vacation is coming and I don't know whether I can find time for it.

With 384 neurons in the first layer the network clearly gets better but also somewhat slower, it is all a trade-off between quality and speed. It takes an enormous amount of time to check everything, this is not something you can do in a couple of days or weeks. It took at least 4 years before pattern evaluation was at the current level, considering this it is likely that there is still room for improvement for neural nets as well.

Sidiki · Post by **Sidiki** » Mon Jul 12, 2021 19:14

Hi Joost,

This it's already very, you know that programs are like a plaint board, it can't never be perfect. The author must all the time add or remove something and this it's a question of time.

We will be glad to accept all the update, if you give us this chance.
Scan had many versions since 1.0's.

I don't know if you know that Kingsrow latest version it's now 1.63 since 2nd july.

Just to said that we are with all each of you that permit to us to have the update.

A program will never be perfect.

Thanks for your and the others.

Friendly, Sidiki.

Joost Buijs · Post by **Joost Buijs** » Tue Jul 13, 2021 06:24

Sidiki wrote: Mon Jul 12, 2021 19:14
We will be glad to accept all the update, if you give us this chance.
Scan had many versions since 1.0's.

I don't know if you know that Kingsrow latest version it's now 1.63 since 2nd july.

Just to said that we are with all each of you that permit to us to have the update.

A program will never be perfect.

Friendly, Sidiki.

Hi Sidiki,

As soon as I'm satisfied with the performance of the program you will get a copy as promised. The current version is slower than it could be, this is something I want to address first. Another thing is that it needs fast AVX2, not all computers have this or have a slow implementation like the AMD Zen1.

Thanks for noticing me about Kingsrow 1.63, I don't look at this forum very often and missed it completely.

Like you said, a program will never be perfect.

World Draughts Forum

NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE