NNUE

Rein Halbersma · Post by **Rein Halbersma** » Tue Jan 19, 2021 15:41

BertTuyt wrote: ↑
Tue Jan 19, 2021 14:34
Rein, interesting posts.

I tend to agree that in the end nn or nnue is a better abstraction for an evaluation function.
However the current HW (although improving) imposes a too big nps penalty for nnue, which still favors the Scan-like patterns.
Im however convinced that the balance will change in the next years, especially with progress is self learning frameworks as you described in your first post, and next generation processors which will contain standardized nn-engines.

I get a deja vue feeling, as I thought about bitboards already in the 70-ties, as this was pioneered (to my knowledge) by the chess program Kaissa. But the power of 8-bit processors at that point in time, favored still the traditional mailbox approach for board representation.

So if you want to write the best draughts program in the world, stick to scan-patterns I would propose.
However if you want to prepare for the future, and co-write history, embark on the nn and nnue train.
The good news, the train has left the station already.

Bert

My highly speculative money will be on PN networks (that connects each Piece to it's Neighbors in a small rectangular area) as the next best thing. It's a step up in complexity after Jonathan Kreuzer's raw piece networks ("P" type). Keras has a special layer already for it (I linked to it earlier in this thread). The drawback is that I don't think that it's efficiently updateable.

BertTuyt · Post by **BertTuyt** » Sat Jan 23, 2021 09:31

I also run a somewhat larger match with tmgr (the tool developed by Ed), to get a better baseline.
Match conditions Scan_31nnue vs Scan_31, played on an Intel Core i7 8700K, 2-move start positions, TC 75 moves in 1 minute, books off, 6-piece dbs, 1 search thread.

Code: Select all

[ 1]: 0.473 score,   632 games,    1 wins,   35 losses,   595 draws,   1 unk
[ 2]: 0.475 score,   632 games,    2 wins,   34 losses,   595 draws,   1 unk
[ 3]: 0.474 score,   632 games,    1 wins,   34 losses,   594 draws,   3 unk
[ 4]: 0.468 score,   632 games,    2 wins,   42 losses,   588 draws,   0 unk
[ 5]: 0.481 score,   632 games,    0 wins,   24 losses,   605 draws,   3 unk
[ 6]: 0.470 score,   632 games,    0 wins,   38 losses,   590 draws,   4 unk
total 0.473 score,  3792 games,    6 wins,  207 losses,  3567 draws,  12 unk
elo diff -18.5

This result was slightly better than reported by Ed.
Most likely as my processor runs at 4.3 Ghz, whereas de clock speed for Ed was between 3.0 - 3.2 Ghz (at least that's my assumption).

For me i have now a good benchmark to measure future performance improvement against.

Bert

BertTuyt · Post by **BertTuyt** » Sun Jan 24, 2021 14:42

Herewith some nnue eval details (first part).

The main 3 nnue_eval functions are:
eval_nnue_position_init()
eval_nnue_position_increment()
eval_nnue_position()

The eval_nnue_position_init() should be called at the root of the search.
Base function is to fill the accumulator, containing the incremental values of the 256 neurons of the input layer.

The increment function needs to have a translation of the internal bitboard position to the actual board squares, and will map this on the input vector.
See below the code (as example) for the white-man.

Code: Select all

	Bit wm = position.wm();  // white man
	while (wm)
	{
		_BitScanForward64(&ubit, wm);

		wm ^= bit::bit((Square)ubit);

		sq = square_to_std((Square)ubit) - 6; // -5 -1

		nnue_draughts[iphase]->layers[0].layer_addinput(sq, accumulator);
	}

In the case of Scan() the 2 engine specific interfaces used for this purpose are position.wm() (gets the white-man bitboard), and square_to_std() which converts internal bitboard number, to external board square.

The other part of eval_nnue_position_init() is (most likely) self explaining.

Bert

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2021 11:23

Last week(s) i focused on the study of avx2 and if it was possible to accelerate the nn layer calculations.
I implemented a new approach, which increased the search speed from 4.0 mnps to 5.0 mnps.
With this version i did another base test with tmgr.

See results below.

Code: Select all

Match stats Scan 3.1 nnue vs. Scan 3.1

[ 1]: 0.490 score,   632 games,    2 wins,   14 losses,   615 draws,   1 unk
[ 2]: 0.486 score,   632 games,    3 wins,   21 losses,   607 draws,   1 unk
[ 3]: 0.481 score,   632 games,    2 wins,   26 losses,   602 draws,   2 unk
[ 4]: 0.491 score,   632 games,    2 wins,   13 losses,   615 draws,   2 unk
[ 5]: 0.478 score,   632 games,    1 wins,   29 losses,   602 draws,   0 unk
[ 6]: 0.483 score,   632 games,    1 wins,   23 losses,   607 draws,   1 unk
total 0.485 score,  3792 games,   11 wins,  126 losses,  3648 draws,   7 unk
elo diff -10.6

Bert

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2021 11:31

For those interested herewith the code to calculate layer 2 (256x32) en layer 3 (32x32)

Bert

Code: Select all

static inline void relu_layer_vec16_v0(const int xinput, const int xoutput, int32_t* input, int16_t* output, int32_t* bias, const int16_t* weight)
	{
		__m256i temp0;

		const int32_t* input_end = input + (xinput >> 1);

		__m256i zero = _mm256_setzero_si256();
		__m256i max = _mm256_set1_epi32(kfixedmaxshift);

		__m256i ds1 = _mm256_load_si256((__m256i*)bias);
		__m256i ds2 = _mm256_load_si256((__m256i*)(bias + 8));
		__m256i ds3 = _mm256_load_si256((__m256i*)(bias + 16));
		__m256i ds4 = _mm256_load_si256((__m256i*)(bias + 24));

		for (; input < input_end; input++, weight += 64) // 4 *  8 neuron block
		{
			temp0 = _mm256_set1_epi32(*input); // load the 2 input values & fill the vector !

			ds1 = _mm256_add_epi32(ds1, _mm256_madd_epi16(temp0, _mm256_load_si256((__m256i*)weight))); // ds1 += madd()
			ds2 = _mm256_add_epi32(ds2, _mm256_madd_epi16(temp0, _mm256_load_si256((__m256i*)(weight + 16)))); // ds2 += madd()
			ds3 = _mm256_add_epi32(ds3, _mm256_madd_epi16(temp0, _mm256_load_si256((__m256i*)(weight + 32)))); // ds3 += madd()
			ds4 = _mm256_add_epi32(ds4, _mm256_madd_epi16(temp0, _mm256_load_si256((__m256i*)(weight + 48)))); // ds4 += madd()
		}

		ds1 = _mm256_min_epi32(ds1, max); // clamp to <= fixedmax
		ds1 = _mm256_max_epi32(ds1, zero); // clamp to >= 0
		ds1 = _mm256_srai_epi32(ds1, kqshift); // shift right 

		ds2 = _mm256_min_epi32(ds2, max); // clamp to <= fixedmax
		ds2 = _mm256_max_epi32(ds2, zero); // clamp to >= 0
		ds2 = _mm256_srai_epi32(ds2, kqshift); // shift right 

		ds3 = _mm256_min_epi32(ds3, max); // clamp to <= fixedmax
		ds3 = _mm256_max_epi32(ds3, zero); // clamp to >= 0
		ds3 = _mm256_srai_epi32(ds3, kqshift); // shift right 

		ds4 = _mm256_min_epi32(ds4, max); // clamp to <= fixedmax
		ds4 = _mm256_max_epi32(ds4, zero); // clamp to >= 0
		ds4 = _mm256_srai_epi32(ds4, kqshift); // shift right 

		__m256i result1 = _mm256_packus_epi32(ds1, ds2); // pack into 16 values (16-bit)
		result1 = _mm256_permute4x64_epi64(result1, 0xD8);

		_mm256_store_si256((__m256i*)output, result1); // and write into output

		__m256i result2 = _mm256_packus_epi32(ds3, ds4); // pack into 16 values (16-bit)
		result2 = _mm256_permute4x64_epi64(result2, 0xD8);

		_mm256_store_si256((__m256i*)(output +16), result2); // and write into output
	}

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2021 11:46

A short explanation of one of the other eval routines eval_nnue_position_increment().

As stated in a previous post, this routine does an incremental update of the 256 outputs of layer 1.
This function is called after every move, and uses the previous (from) and new bitboard postion (to).

It consists of 4 loops, where the position differences are found for white man, black man, white king and black king.

These differences are calculated with an xor.
In case the bitsquare is part of the to bitboard then the layer_addinput() is called, otherwise the layer_subinput() is called.

See below code.

Code: Select all

Bit wm = from.wm() ^ to.wm();  // white man
	while (wm)
	{
		_BitScanForward64(&ubit, wm);

		wm ^= bit::bit((Square)ubit);

		sq = square_to_std((Square)ubit) - 6; // -5 -1

		if (bit::bit((Square)ubit) & to.wm())
			nnue_draughts[iphase]->layers[0].layer_addinput(sq, accumulator);
		else
			nnue_draughts[iphase]->layers[0].layer_subinput(sq, accumulator);
	}

the 2 functions addinput() and subinput() are also (relatively) straightforward,

Code: Select all

inline void layer_addinput(uint32_t i, int16_t outputs[])
	{
		assert(i >= 0 && i < inputcount);

		simd::add_vec16(outputs, &weights[weightstart + i * outputcount], outputcount);
	}

	inline void layer_subinput(uint32_t i, int16_t outputs[])
	{
		assert(i >= 0 && i < inputcount);

		simd::sub_vec16(outputs, &weights[weightstart + i * outputcount], outputcount);
	}

Code: Select all

// (16 bit) input = input + weight
	static inline void add_vec16(int16_t* input, const int16_t* weight, size_t count)
	{
		assert(((int64_t)input & 31) == 0);
		assert(((int64_t)weight & 31) == 0);
		assert((count % 16) == 0);

		const int16_t* input_end = input + count;

		for (; input < input_end; input += 16, weight += 16)
		{
			_mm256_store_si256((__m256i*)input,
				_mm256_add_epi16(
					_mm256_load_si256((__m256i*)input),
					_mm256_load_si256((__m256i*)weight))); // sum all 16 and store in input
		}
	}

	// (16 bit) input = input - weight
	static inline void sub_vec16(int16_t* input, const int16_t* weight, size_t count)
	{
		assert(((int64_t)input & 31) == 0);
		assert(((int64_t)weight & 31) == 0);
		assert((count % 16) == 0);

		const int16_t* input_end = input + count;

		for (; input < input_end; input += 16, weight += 16)
		{
			_mm256_store_si256((__m256i*)input,
				_mm256_sub_epi16(
					_mm256_load_si256((__m256i*)input),
					_mm256_load_si256((__m256i*)weight))); // subtract all 16 and store in input
		}
	}

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2021 12:36

The disadvantage of nnue, compared with the pattern based eval, is the calculation costs.

I did some timings (based upon a 26 ply search, base search around 47.952 seconds) with next results:
* incremental update 7.26 sec (15%)
* layer 2 (256x32), 12.467 sec (26%)
* layer 3 (32x32), 2.233 sec (5%)
* layer 4 (32x1), 0.0 sec (0%)

So the eval takes approximately 46% of the total search time (which is huge compared with a pattern based eval).
Maybe with 8bit weights and/or avx-512 we could halve this, which would yield a 6.5 mnps speed.
My expectation is that with this speed, we could approach Scan within 5 elo.

Further improvements then must be based on better networks (multiple networks for different game phase, different topology or better quality of weights).

So the quest continues

Bert

Joost Buijs · Post by **Joost Buijs** » Tue Feb 02, 2021 07:11

The last week I've done many experiments with the NN, my impression is that it's disc-play is very good, but that it has difficulties in understanding the value of kings. The fact that a single king is worth ~3.5 men and that additional kings are worth a lot less is for the NN difficult to grasp. Maybe adding an extra input 'one king' for the (side to move and the opponent) to make the NN extra aware of this situation could help. Normally you would expect that the NN is able to deduce this from the board position, but somehow it has difficulties with this.

Instead of adding extra inputs it's possible to use the inputs now being used for the side to move, these are not necessary anyway when you rotate the board and flip the colors depending upon the side to move.

Joost

CheckersGuy · Post by **CheckersGuy** » Wed Feb 03, 2021 12:45

I' ve now dived a little into the topic and implemented NNUE for my checkers engine. It works suprisingly well even if you train only on the game result and not on shallow-searches or static-eval. However, I think much of the advantage in chess comes from using different input-features (half-kp and so on) and not plain and simple p-type networks. Even though I like the simplicity of P-type networks, there's gotta be a better input-feature where the ratio input_size/num_incr_updates is much higher.

Joost Buijs · Post by **Joost Buijs** » Thu Feb 04, 2021 07:04

The major difference between a simple feed-forward NN and the NNUE network they use in chess is that the chess one has a different set of input-weights for each king location. The other difference is that they have 2 sets of inputs, one for the side to move and one for the other side in which they (according to the original article) feed the position from 1 ply earlier.

In chess you have 6 different piece-types with tens of different interactions, and you can add inputs for things like enpassant or castling. I think that most features for checkers/draughts like structure, balance and tempo can be encoded pretty well in a simple feed-forward network. I'm not a draughts player myself, to me each disk looks the same, it's not so easy to find other features that could be added.

Joost Buijs · Post by **Joost Buijs** » Mon Feb 08, 2021 08:11

After many tests from last week with my draughts NN I came to the conclusion that it has serious difficulties modeling the value of a king when I use the location of kings as input to the NN. However using the number of kings instead of location as input seems to give the NN a better understanding of the true value of a king. Kings are very sparse and jump all over the place, I can imagine that it is very difficult for the NN to conclude something from this.

Using the number of kings has the additional advantage that the total number of inputs gets 60 less, I suppose that with incremental update this won't make much of a difference speed wise, but it gives some extra room to add other input parameters.

Yesterday I let my program run another match against Kingsrow, 1 minute for 90 moves each, this time on a single core instead of 16 to get a better grip on the true difference in strength. The match ended at +51 Elo in favor of Kingsrow, which was to be expected because my program still uses float32 inference and lacks incremental update. My program ran at 0.55 mnps vs 18 mnps for Kingsrow, a speed difference of 32 times. For chess this speed difference would imply a difference of ~360 Elo, for draughts this seems to have less impact.

As Bert already showed, it's possible to reach a node speed of ~5 mnps with incremental update of the input layer and int16 quantization. With int8 quantization it could be even faster. The coming week I will focus on adding incremental update and int16 quantization.

Another thing I want to do is to build a better database of training positions. My last network was trained for 100 epochs with 1.48 billion distict positions labeled with the score of a 4 ply search, but I still have the impression that the network sometimes encounters positions that it doesn't fully understand, probably due to a lack of these positions in the training data.

Joost

Joost Buijs · Post by **Joost Buijs** » Mon Feb 08, 2021 19:04

This morning I analyzed some of the losses of yesterdays match and I found that they were not all caused by bad evaluation of the NN. There was a problem with my quiescence search, several weeks ago I did some experiments with my q-search and I simply forgot to remove that code.

So I reran the match and this time the Elo came out at -35. Kingsrow 1.62 vs Ares v1.2: 17 wins, 1 losses, 140 draws, 0 unknowns
This is not so bad considering the large speed difference. With a ten fold speed increase and multicore the difference will get a lot smaller.
At least I can use the -35 Elo as a baseline to compare future developments with.

Joost

Edit: I can't attach the PDN file of the match, the board attachment quota seems to be reached.

Sidiki · Post by **Sidiki** » Tue Feb 09, 2021 03:29

Joost Buijs wrote: ↑
Mon Feb 08, 2021 19:04
This morning I analyzed some of the losses of yesterdays match and I found that they were not all caused by bad evaluation of the NN. There was a problem with my quiescence search, several weeks ago I did some experiments with my q-search and I simply forgot to remove that code.

So I reran the match and this time the Elo came out at -35. Kingsrow 1.62 vs Ares v1.2: 17 wins, 1 losses, 140 draws, 0 unknowns
This is not so bad considering the large speed difference. With a ten fold speed increase and multicore the difference will get a lot smaller.
At least I can use the -35 Elo as a baseline to compare future developments with.

Joost

Edit: I can't attach the PDN file of the match, the board attachment quota seems to be reached.

Hi Joost AND THE OTHERS,

I read all the last messages, and one thing still to be important; the nodes/s (SPEED) VS the eval.

If the EVAL or TRAINNING DATABASE (in the case of NNUE) it's POWERFUL, the NODES/S can their have a bad influence on it?

Just a question if Kingsrow run to 32 mb/s with 1 min / 90 moves against Damage that run to 10 mb/s with 10 min/90 moves,
if Kingsrow win all the series of DXP GAMES; that's mean that Kingsrow won due to his GOOD EVAL or his powerful SPEED?

Joost Buijs · Post by **Joost Buijs** » Tue Feb 09, 2021 08:45

Sidiki wrote: ↑
Tue Feb 09, 2021 03:29

Joost Buijs wrote: ↑
Mon Feb 08, 2021 19:04
This morning I analyzed some of the losses of yesterdays match and I found that they were not all caused by bad evaluation of the NN. There was a problem with my quiescence search, several weeks ago I did some experiments with my q-search and I simply forgot to remove that code.

So I reran the match and this time the Elo came out at -35. Kingsrow 1.62 vs Ares v1.2: 17 wins, 1 losses, 140 draws, 0 unknowns
This is not so bad considering the large speed difference. With a ten fold speed increase and multicore the difference will get a lot smaller.
At least I can use the -35 Elo as a baseline to compare future developments with.

Joost

Edit: I can't attach the PDN file of the match, the board attachment quota seems to be reached.
Hi Joost AND THE OTHERS,

I read all the last messages, and one thing still to be important; the nodes/s (SPEED) VS the eval.

If the EVAL or TRAINNING DATABASE (in the case of NNUE) it's POWERFUL, the NODES/S can their have a bad influence on it?

Just a question if Kingsrow run to 32 mb/s with 1 min / 90 moves against Damage that run to 10 mb/s with 10 min/90 moves,
if Kingsrow win all the series of DXP GAMES; that's mean that Kingsrow won due to his GOOD EVAL or his powerful SPEED?

Hi Sidiki,

This is not so easy to answer, because with Draughts you also have to take the huge draw tendency into account.

The best you could do to determine which engine evaluates better is to play a large number of high speed games and to give the NN-engine extra time to compensate for the loss in node speed.

In the future NN's will get faster too, NN's are such a hype that most processor manufacturers are working on new instructions to improve inference speed.

Personally I think that NN-evaluation will get the upper hand by having the engine learn from self-play with reinforcement learning. Supervised learning (as we currently use) has the problem that it also learns the bad habits from the data-set we use for training.

On the other hand, pattern evaluation is already that good that with a fast computer, multi-core and longer thinking times all games between strong engines will end in a draw anyway.

Maybe NN's won't add much for Draughts, but it is just the fun of playing with it.

Joost

Sidiki · Post by **Sidiki** » Tue Feb 09, 2021 22:23

Joost Buijs wrote: ↑
Tue Feb 09, 2021 08:45

Sidiki wrote: ↑
Tue Feb 09, 2021 03:29

Joost Buijs wrote: ↑
Mon Feb 08, 2021 19:04
This morning I analyzed some of the losses of yesterdays match and I found that they were not all caused by bad evaluation of the NN. There was a problem with my quiescence search, several weeks ago I did some experiments with my q-search and I simply forgot to remove that code.

So I reran the match and this time the Elo came out at -35. Kingsrow 1.62 vs Ares v1.2: 17 wins, 1 losses, 140 draws, 0 unknowns
This is not so bad considering the large speed difference. With a ten fold speed increase and multicore the difference will get a lot smaller.
At least I can use the -35 Elo as a baseline to compare future developments with.

Joost

Edit: I can't attach the PDN file of the match, the board attachment quota seems to be reached.
Hi Joost,
Hi Joost AND THE OTHERS,

I read all the last messages, and one thing still to be important; the nodes/s (SPEED) VS the eval.

If the EVAL or TRAINNING DATABASE (in the case of NNUE) it's POWERFUL, the NODES/S can their have a bad influence on it?

Just a question if Kingsrow run to 32 mb/s with 1 min / 90 moves against Damage that run to 10 mb/s with 10 min/90 moves,
if Kingsrow win all the series of DXP GAMES; that's mean that Kingsrow won due to his GOOD EVAL or his powerful SPEED?
Hi Sidiki,

This is not so easy to answer, because with Draughts you also have to take the huge draw tendency into account.

The best you could do to determine which engine evaluates better is to play a large number of high speed games and to give the NN-engine extra time to compensate for the loss in node speed.

In the future NN's will get faster too, NN's are such a hype that most processor manufacturers are working on new instructions to improve inference speed.

Personally I think that NN-evaluation will get the upper hand by having the engine learn from self-play with reinforcement learning. Supervised learning (as we currently use) has the problem that it also learns the bad habits from the data-set we use for training.

On the other hand, pattern evaluation is already that good that with a fast computer, multi-core and longer thinking times all games between strong engines will end in a draw anyway.

Maybe NN's won't add much for Draughts, but it is just the fun of playing with it.

Joost

Hi Joost

Thanks for taking time to answer so deeply.

So we will wait for the improvment of NNUE that we hope will with a good supervising will give best results.

In Chess actually the best engine are these taht use NNUE's concept.

Thanks again

Sidiki.

World Draughts Forum

NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE

Re: NNUE