NNUE

Sidiki · Post by **Sidiki** » Sun Jan 03, 2021 03:30

BertTuyt wrote: ↑
Sat Jan 02, 2021 12:26
Sidiki, here they are.

There were 2 unknowns which were both a win for Kingsrow (game 48 and game 84), for which I corrected the scores in the file.
Result 22 W Kingsrow, 136 draw, Elo difference 49.

Bert

Thanks very much Bert.Thanks

Sidiki

BertTuyt · Post by **BertTuyt** » Sun Jan 03, 2021 11:09

Sidiki, you are welcome.

As this is the last day of my Xmas-holiday, from now on progress (if any) will be much slower.
Sorry for that.

Bert

Krzysztof Grzelak · Post by **Krzysztof Grzelak** » Sun Jan 03, 2021 16:13

BertTuyt wrote: ↑
Tue Dec 29, 2020 23:47
With some optimization in the AVX2 (256) code (thanks also to Joost), i was able to increase the Damage search-speed with NNUE to 6.0 MN/sec.
This yielded a small improvement in strength.
Unfortunately my processor does not support AVX-512, which could further improve SIMD performance.

Recent DXP match against Kingsrow (with same settings), and perspective Kingsrow 31W, 127D, which yield an Elo difference of 69.

So small steps....
Will keep you posted.

Bert

Please take a look AVX-512.

BertTuyt · Post by **BertTuyt** » Sun Jan 03, 2021 18:01

Krzysztof,

I did, and I'm certain this will have a positive impact on strength.
But unfortunately my (current) processor does not support AVX-512.

Bert

Krzysztof Grzelak · Post by **Krzysztof Grzelak** » Sun Jan 03, 2021 18:21

BertTuyt wrote: ↑
Sun Jan 03, 2021 18:01
Krzysztof,

I did, and I'm certain this will have a positive impact on strength.
But unfortunately my (current) processor does not support AVX-512.

Bert

I apologise I forgot about it.

Sidiki · Post by **Sidiki** » Thu Jan 07, 2021 07:28

BertTuyt wrote: ↑
Sun Jan 03, 2021 11:09
Sidiki, you are welcome.

As this is the last day of my Xmas-holiday, from now on progress (if any) will be much slower.
Sorry for that.

Bert

Hi Bert,

Happy New Year to you and others programmers.

Understood, we will waiting for the new progress done by Damage.
What can explain this gain of strength, because the lost rate decreased since NNUE'S concept.

Damage is't more faster or learned more good positions?

Because i looked all the 158 games and i seen a great progress.

Thanks

Sidiki.

BertTuyt · Post by **BertTuyt** » Tue Jan 12, 2021 20:49

In a previous post I shared some results with NNUE.
It is obvious that NNUE compared with the Scan pattern method, has a negative impact on search-speed.
And that often after an more-or-less equal game Damage is out-searched in the late middle-game.

So far I was not able to test further options with AVX-512 VNNI as i don't have an up to data processor for this purpose.
To evaluate the effects of a faster search speed, I played 2 additional matches where Damage has 2x and 3x the available time for a game (65 moves).

All games were played with 1 core, and both Damage as Kingsrow had only the 6p DB.
Also for an apples to apples comparison KR did not use an opening book.

The network Damage used was trained on 80M positions (excluding EBGB positions), 191 inputs and NN architecture 256x32x32x1

Match results (perspective KR):
2x think time (so 2 min/game, KR 1 min/game): 10W, 2L, 146D, ELO = 18
3 x think time (so 3 min/game, KR 1 min/game): 3W, 0L, 155D, ELO = 7

This clearly shows the potential, although for the next years a 3x speed increase for NNUE is unlikely, but 2 x seems doable.

Another remark , we are just at the beginning, so further optimization of the NN (different architecture) and better training, multiple NNs, are all things which need to be studied, but sure will reveal further improvements.

I want to provide the opportunity to all to experiment with NNUE, as these days (most likely) only Joost and undersigned are working on this (at least for 10x10 international Draughts).
For this reason I want to prepare a proof of concept where in the Scan code I will implement NNUE, based upon the principles I now use (and still some code traces from Jonathan, so thanks for that), so all can have a better look at it, and do not require to completely re-invent the wheel.

Not sure, but if all works out well I hope to provide these sources during forthcoming weekend.

Bert

Joost Buijs · Post by **Joost Buijs** » Wed Jan 13, 2021 09:11

Inspired by Berts enthusiasm about NN and ML in general, also by the good results I've got with a NNUE style NN in my chess engine, I started 2 weeks ago to implement a fully connected NN in the draughts engine I'm sometimes working on. Contrary to Berts implementation with int16 SIMD code I decided to start by using float32 to avoid all hassles with quantization, just as a proof of principle. I still have some 22M draught games from previous experiments which translates into a database of 1.4B unique positions, I could use these for training. In practice 1.4B positions are difficult to handle (one needs fast amounts of memory for this) so I used only 240M.

The netwerk structure is borrowed from the Shogi and Chess NNUE implementation with a smaller input layer of 192:256x32x32x1. Contrary to Bert I use 2 separate inputs for the side to move, this also to make the size of input layer a multiple of 16.

First of all using float32 for inference is slow, Clang is very good at vectorizing float32 multiply-add loops with fused SIMD instructions, but it still remains 32 bit and floating point. My implementation still lacks incremental update of the first network layer, which makes it even slower. For each evaluation I have to calculate 256 inner-products with a vector length of 192 for the first layer alone. As an example: on my core i9-10980XE the engine with NN evaluation does 9 mnps, with pattern evaluation it does 180 mnps (the NN is 20 times slower). From experience with my chess engine I know that adding incremental update of the first layer and using int16 SIMD instructions the NPS will get around 7.5 times higer, with int8 SIMD code the NPS gets almost 15 times higher. I don't know if this translates one to one to the draughts engine but I expect to get at least a 10 times higher NPS after I've implemented incremental update, int8 quantization and fused SIMD code for the calculation of the inner-products.

The first result when using Kingsrow as a baseline was not very good (-74 Elo), by relabeling the 240M training positions with the score of a 4 ply full-width search I was able to get the performance up to -32 Elo. The games it lost were always lost in the end-game, not in the mid-game, the network clearly lacked some end-game knowledge. The culprit was that the set of positions I used for training didn't contain positions with less than 6 pieces, after generating a new set of 20M (fully random) games and subtracting all positions with less than 6 pieces (about 50M), labeling them with a 4 ply search and adding them to the position database I was able to get the Elo up to -11. This is where it currently stands.

Things I still have to do are: Implementing incremental update of the first layer (this is straightforward), adding int16 quantization (easy) or int8 quantization (difficult to get right) and of course SIMD code. On my i9-10980XE I can use AVX-512-VNNI to calculate the inner-products e.g. for int16:

Code: Select all

int vnni_dot_product(int16_t* w1, int16_t* w2, int32_t* offset)
{
              __m512i v1 = _mm512_loadu_si512(w1);
              __m512i v2 = _mm512_loadu_si512(w2);
              __m512i v3 = _mm512_loadu_si512(offset);
              __m512i vresult = _mm512_dpwssd_epi32(v3, v1, v2);
              return _mm512_reduce_add_epi32(vresult);
}

Unfortunately this doesn't work on AMD, so I have to stick with AVX2 for the time being.

Kingsrow came in very handy for all the tests I did, without it things would not have been so easy. Thanks to Ed!

Edit: As a remark I have to add that KR played with 6P EGDB and my engine played without EGDB. Using an EGDB can also solve the lack of endgame knowledge in the network as Bert's results show.

Attached are the games of the latest test-match the engine played against Kingsrow.

Joost

Ed Gilbert · Post by **Ed Gilbert** » Fri Jan 15, 2021 12:51

Joost, very interesting, thanks for posting this update. Your results with floating point and non-incremental eval are already very good.

It wasn't clear to me, are your training positions labeled as WLD, or with an eval score from a quick search? Also, what time controls did you use for the matches used to measure performance?

-- Ed

BertTuyt · Post by **BertTuyt** » Fri Jan 15, 2021 15:41

I was able, after some challenges to incorporate nnue in Scan.
The first test seemed to work.

Game result with the usual parameters (1 min/ 65 moves Game, 6p DB each, no book, 1 core), 10W KR, 148D. Which translates into an ELO of 22.
The nn was the previous so: 191:256x32x32x1.
In my case i can only use avx-256 (as i have an older processor), and all weights are in int16

I will also check parallel performance, to secure that this also works, From design the implementation should be thread-safe, but the proof of the pudding is the eating. With low nps i assume nnue will benefit more from the additional cores, but the actual match will provide the final answer.

Next to that I also want to test with a double think time for Scan nnue, to compensate for the missing AVX-512 vnni, and to get an idea about network potential. This test was the easiest to do (as i trust the 1 core implementation , so don't need to observe constant eval scores).
Test (DXP Match) so far so good, with 50 games and 50 draws. But like the football parallel, in the end KR will win

.

If all works well, i hope to share the Scan sources with nnue in the weekend (i need to improve readability of some changes).
Keep in mind that this work would be impossible without the base from Jonathan, and the support and idea exchange with Joost.
I really hope that others will embark on the nnue voyage, and share results and new insights in this forum.

Im not 100% sure that nnue will bring a similar revolution like the patterns-based eval, but it is really interesting, and we are only starting. Next to that i really like this black box approach where the nn has totally no pre-defined features, which is the case for the current pattern based evals.

As i personal believe that we are close to the performance optimum, i don't expect nnue to surpass the current state-of the art evals by a huge margin (if any margin), i would already applaud an on-par behavior.

On the other hand im sure we will see much progress in CPU nn HW acceleration (like avx 512 vnni), similar in the way that 64bit processors and progress in instruction sets, enabled efficient bitboard implementations.

Bert

Joost Buijs · Post by **Joost Buijs** » Fri Jan 15, 2021 17:39

Ed Gilbert wrote: ↑
Fri Jan 15, 2021 12:51
Joost, very interesting, thanks for posting this update. Your results with floating point and non-incremental eval are already very good.

It wasn't clear to me, are your training positions labeled as WLD, or with an eval score from a quick search? Also, what time controls did you use for the matches used to measure performance?

Ed, I used positions labeled with the score of a 4 ply full-width search with pattern evaluation (without hash or pruning, just plain a-b).
For the match that I posted the time control is 1 minute for 90 moves. I have to add that with shorter time controls the performance gets a lot worse.

It's very difficult to get the training set right, which positions to use and which not. To much randomness seems to saturate the network. Yesterday I tried a different data set and the result was a total disaster.

Edit: I suspect it is not optimal to label the positions with evaluation scores, WLD would probably work better (like I do with my chess engine). The problem is that I don't have an infinite amount of computing resources, labeling e.g. 500M positions WLD by playing out a game from each position will take weeks, even with short time controls of lets say 66 msec. per move.

Joost

BertTuyt · Post by **BertTuyt** » Fri Jan 15, 2021 20:40

The last match Kingsrow - Scan nnue with the latter (= Scan) 2 min game and KR 1 min game ended with (perspective KR) 7W, 0L, 151D, which yields an Elo difference of 15.

Bert

Rein Halbersma · Post by **Rein Halbersma** » Fri Jan 15, 2021 22:00

Joost Buijs wrote: ↑
Wed Jan 13, 2021 09:11
In practice 1.4B positions are difficult to handle (one needs fast amounts of memory for this) so I used only 240M.

I have gotten distracted with many other interesting machine learning projects (non-draughts) that kind of had to happen right now before anything else. But I will soon start working again on my Keras/Tensorflow pipeline for Scan-like pattern evals.

One improvement that is possible for any eval training pipeline (NNUE, patterns whatever) using Keras/Tensorflow (or PyTorch for that matter) is to stream the data from disk to the optimizer. That way, you need much less RAM than the entire database. E.g. for Kingsrow, Ed supplied me with a ~10Gb file. Loading that into memory and feeding it to Keras expanded it temporarily to ~48Gb of RAM during optimization. When done from disk in batches, it should be configurable to get it below ~16Gb without much speed loss for *arbitarily* large files on disk.

Rein Halbersma · Post by **Rein Halbersma** » Fri Jan 15, 2021 22:01

[hit quote instead of edit by accident]

Joost Buijs · Post by **Joost Buijs** » Fri Jan 15, 2021 23:29

Rein Halbersma wrote: ↑
Fri Jan 15, 2021 22:00

Joost Buijs wrote: ↑
Wed Jan 13, 2021 09:11
In practice 1.4B positions are difficult to handle (one needs fast amounts of memory for this) so I used only 240M.
I have gotten distracted with many other interesting machine learning projects (non-draughts) that kind of had to happen right now before anything else. But I will soon start working again on my Keras/Tensorflow pipeline for Scan-like pattern evals.

One improvement that is possible for any eval training pipeline (NNUE, patterns whatever) using Keras/Tensorflow (or PyTorch for that matter) is to stream the data from disk to the optimizer. That way, you need much less RAM than the entire database. E.g. for Kingsrow, Ed supplied me with a ~10Gb file. Loading that into memory and feeding it to Keras expanded it temporarily to ~48Gb of RAM during optimization. When done from disk in batches, it should be configurable to get it below ~16Gb without much speed loss for *arbitarily* large files on disk.

Maybe a coincidence, I just told Bert this afternoon that I'm busy modifying my pyTorch DataLoader in such a way that it can read chunks of data from disk. On my AMD PC with 128 GB RAM I can load at max. 625M positions in memory because each position is 192 bytes, there is a way to load positions as 192 bits and translate them on the fly to 192 bytes when needed for the training pipeline, but in Python this will be slow. Reading a chunk from disk seems easier, and with SSD it will be fast enough.

World Draughts Forum

NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE