NNUE

Discussion about development of draughts in the time of computer and Internet.
Post Reply
Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: NNUE

Post by Joost Buijs » Wed Oct 12, 2022 10:24

Another possibility I'm thinking about is to use small overlapped patterns e.g. 3 bits or 5 bits and combining the indices by means of a NN. With one-hot encoding you will get a huge number of inputs, but with NNUE this is not a big problem. The incremental update will be slower because for most squares more than 1 pattern has to be updated, however the number of neurons in the first layer can probably be smaller and this could compensate for the loss of speed.

Calculating the dot-products and copying the accumulator (which for each update has to be done twice due to RELU) takes a lot of time, if we can decrease the number of neurons in the first layer by a factor of 4 (my guess) this could totally compensate for the loss of speed due to the necessary increased number of updates.

BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: NNUE

Post by BertTuyt » Thu Dec 29, 2022 13:21

Between xmas and new year i experimented a little with NNUE and endgame databases.

In this case i focused on the 4m - 4m positions. With random games around 4M positions were generated, with white to move, and both white as black had no capture. With the endgame databases from Ed, all 4M positions were WDL labeled.
Training was done with Python, and the network was 90 input (45 white, 45 black), 1st layer 256 neurons, 2nd layer 32, 3th layer 32, and the 4th layer 3 neurons. The last layer used softmax activation, and the 3 outputs indicted the probability for Lost, Win or Draw.

I used the total training set, end did not separate in training and testing.
The accuracy in the end was 92%.

I did not study all specific errors yet (which do the large numbers is also difficult).
But one case i already examined a little more.
In total 473 positions were found where the database (from ED) indicated a win , but the neural network thought it was a loss (for white).
As far as i checked, this was due to a small combination for white, so apparently when using this tables, one needs to do a proper Qsearch().

What i also need to verify, to what extend this special trained NN , is much better compared with the "general" trained NN.
If this is the case, one could use (in the future) dedicated NNs for specific position clusters.

I'm also planning to do a similar exercise for 5m - 5m, 6m - 6m, and 7m-7m, for which databases do not exist.
For these positions labeling will be more challenging, but not impossible.

Bert

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: NNUE

Post by Joost Buijs » Thu Apr 27, 2023 12:13

The difficulty with NNUE when using different networks for different position clusters is how to switch between them. Switching at the root of the search is easy and not very time consuming, but switching in the leaves (what you actually would like to do) is very time consuming because you have to reinitialize your NNUE accumulator each time you make a switch (which of course happens numerous times during search). The other option is to permanently keep 2 accumulators, one for 40 to 15 pieces and one for 14 to 2 pieces, this will cause a slowdown too. When this entire process is faster than probing an EGDB (too large to fit in memory) it could mean an improvement. For positions where there exists no EGDB for (e.g. 9 to 14 pieces) it could mean an improvement too. On the other hand, I don't expect much improvement anymore, single core 99% of the games between strong programs are already drawn, multi core this figure is even higher. As usual the proof of the pudding is in the eating.

Another problem is how to get enough positions to train the network on, an EGDB-generator can output all relevant positions with corresponding score (this will be a huge number), for the remaining positions it is quite difficult, when I select all positions with 9 to 14 pieces from my position-database I roughly get 8M unique positions, this is in my opinion not enough to train a strong network on. Generating random positions doesn't work either, in my experience training positions got to have some relevance to the games you play. To get a larger number of unique positions I have to generate games with more spread, but I always find that training a network on positions taken from games with too much spread also makes it play weaker.

BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: NNUE

Post by BertTuyt » Tue May 23, 2023 13:58

Joost,

you are right, creating NNUE endgame databases is not without, here some answers:

* In my case i get the positions from actual game-play. For example I run a game (or match) against another program, and I collect all the endgame positions (in the current test 5 man x 5 man) which are "found" during the search. For a game between 5 - 10 minutes, I find (in average) 2M - 3M positions.
* For this purpose i use a second table, and the hash-value is used as index into this table.
* The score of the search (so exact value, or lower/upper bound) is also stored. Here i need to study further, as I'm not able to see the difference between a database draw , and a zero score.
* Some positions are frequently found, and final score is only available when search-depth is sufficient.
* In some cases the score is very high or low, and did not pass the win or lose-threshold. Maybe for these positions i need to start a post-search to determine final outcome.
* So making sure that the labels for the positions are valid, remains a challenge.
* Mixing with actual scores is (i guess) no real problem, as we already do it when we include endgame databases. In the current design the NN has 3 (softmax activation) outputs (win, draw, lose), and the one with the highest value is chosen.
* It would be possible to derive a score from the 3 (w, d, l) , not sure this is better.
* I intend to completely re-calculate the NN, as the number of inputs is limited (for 5x5 this is 10), this might not be a huge issue. Also to calculate the db-score with compression (or when you need to load a db-block first in memory) is also time-consuming, so maybe the time penalty is acceptable.
* Finally, i also don't expect a huge elo-gain, or elo-gain at all. But anyway it is fun :D
* And the main paradox, im enjoying pension, but I seem to have less time for programming these days, guess this sounds familiar.....

Bert

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: NNUE

Post by Joost Buijs » Sun May 28, 2023 09:05

Bert,

Your method of collecting positions from game-play is not very different from what I am doing. Instead of using a linear hash-table I use a binary tree with the 64 bit Zobrist-hash of the position as key. The problem with this is that it gets rather slow when the tree gets very large (by very large I mean > 64 GB). In your case you have to keep a list of positions for every index, I have no idea which system will work faster in practice.

I always use positions that appear during games with normal time-control, when I use positions extracted from games with very fast time-control the results are usually a lot worse, mainly because the labels are less accurate with short thinking times.

I only collect positions from the tips of the quiescence-search and never from the regular search-tree. With the 6P EGDB that I currently use it doesn't take a very deep search to perfectly label position with <= 14 discs, there are some exceptions though, but I assume you are able to label most positions with <= 10 discs within a reasonable amount of time. Labeling positions > 14 discs with WDL accurately is a time consuming process, this will take ages, even on a large computer, it is just how much time and energy you want to spend on it.

Despite having a very large number of positions there always occurs some over-fitting, I try to mitigate this problem by using AdamW() with weight-decay as optimizer, weight-decay is comparable to L2 regularization.

As activation function for WDL labels I never tried using Softmax(), I solely use tanh(), tanh() accepts [-1, 0, +1] labels and has a steeper derivative than sigmoid(), therefore it trains somewhat faster. Of course the learning-rate also has influence the time it takes to train a network.

In the end anything that has to do with neural networks is trial and error, there is no recipe that tells you what will work and what doesn't work, although the longer I'm busy with it my intuition for things that work and things that don't work gets better.

Joost

Post Reply