Eval tuning

Rein Halbersma · Post by **Rein Halbersma** » Wed Aug 12, 2015 21:20

Very interesting experiments, Michel!

I am just curious how much of a better fit the Dragon-ML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50M-parameter eval()?

Not sure if you do log-likelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the Dragon-ML one, and how many iterations did the optimization for Dragon-ML version go on after that hypothetical point to full convergence?

MichelG · Post by **MichelG** » Thu Aug 13, 2015 09:14

Rein Halbersma wrote:Very interesting experiments, Michel!

I am just curious how much of a better fit the Dragon-ML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50M-parameter eval()?

Not sure if you do log-likelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the Dragon-ML one, and how many iterations did the optimization for Dragon-ML version go on after that hypothetical point to full convergence?

You can't really do that, because the training set is tightly coupled to the engine and its search algorithm itself. I am not quite sure what you mean by 'optimization trace'.

When you increase the training set, at some point you will reach a limit, where adding more will not improve the eval function anymore. What you can do at this point, is to increase the number of features in your eval function. The eval accuracy will go up, but then it's speed will go down.

I think this is a major difference between scan and dragon. Scan has few features, but runs fast. Dragon has many features (including global mobility, breakthrough tables), but it runs slower. There is always a tradeoff.

Rein Halbersma · Post by **Rein Halbersma** » Thu Aug 13, 2015 10:51

MichelG wrote:
Rein Halbersma wrote:Very interesting experiments, Michel!

I am just curious how much of a better fit the Dragon-ML eval is compared to Scan. So if you would run the Scan eval() on your 900M training set, how well does it predict the game scores compared to your own 50M-parameter eval()?

Not sure if you do log-likelihood or least squares, so the metric is a bit tricky. But e.g. if you store the full optimization trace: after how many iterations would the Scan eval() be as good a fit as the Dragon-ML one, and how many iterations did the optimization for Dragon-ML version go on after that hypothetical point to full convergence?
You can't really do that, because the training set is tightly coupled to the engine and its search algorithm itself. I am not quite sure what you mean by 'optimization trace'.

I meant the output that numerical optimizers usual give, so for each iteration in their solver they print out the objective function (sum of squares, log-likelihood, whatever), the gradient and/or the step size.

If you would use the Scan eval() and feed it your training positions, and evaluate your optimization criterion, how high would that be? And after how many iterations does the Dragon-ML eval() reach that level of prediction? And how much does Dragon-ML improve on that when it reaches convergence. So I'm interested in the relative predictive power for Scan and Dragon-ML evals.

MichelG · Post by **MichelG** » Thu Aug 13, 2015 12:56

Rein Halbersma wrote: I meant the output that numerical optimizers usual give, so for each iteration in their solver they print out the objective function (sum of squares, log-likelihood, whatever), the gradient and/or the step size.

If you would use the Scan eval() and feed it your training positions, and evaluate your optimization criterion, how high would that be? And after how many iterations does the Dragon-ML eval() reach that level of prediction? And how much does Dragon-ML improve on that when it reaches convergence. So I'm interested in the relative predictive power for Scan and Dragon-ML evals.

i think i understand what you mean. But you can't compare 2 eval functions this way. Scan en dragon are trained on different datasets. For example, dragon trains on 'balanced' positions only. Running the scan eval on the dragon dataset would give it poor results, as would running dragon on the scan dataset.

Apart from that, the internal iterations of the solver don't have any real meaning. At the end of the process, you end up with the answer of the linear algebra problem you posed. How it got there, and whether it used internal iterations is irrelevant in my opinion.

I think a better question is to look at the correlation coefficient between the eval function of a position, and the outcome of the game that results from that.

But then there is no objective means of determining the game-outcome of a position. As long we don't have a 40-piece database, the game outcome depends on by which program, with which settings the game is produced.

I don't have the numbers available, but i know the cc will be a small number.

In the end, the only objective test is to have programs play against eachother

Ed Gilbert · Post by **Ed Gilbert** » Tue Dec 29, 2015 15:17

I put a link to kingsrow beta test version 1.57a at the kingsrow download web page. This version uses an eval function that was created automatically using logistic regression. Against the classic kingsrow it scores 0.608 in engine matches. Against dragon it scores essentially equal.

Right now the eval uses patterns in rectangular 4x4 board regions, plus some non-pattern material features. My next step is to see if I can get an improvement by using 5x5 board patterns.

The eval was created using a training set of 0.97M games, 72.2M positions. Training games are a combination of classic kr vs classic kr, classic kr vs auto-tuned kr, and auto kr vs auto kr. I found that using only classic vs classic games resulted in an eval with rather weak performance. I also tried a large number of games scan vs kingsrow, but using these games for training did not give good results either. I used 5-move and 6-move start positions in an attempt to get good coverage of the board patterns and avoid duplicate games. But I still removed 80K games because they were too similar to other games. I think it might be useful to add some randomness to the eval during training games, but I didn't try it.

Every time I increased the number of training positions, I got an increase in the strength of the resulting eval. With 72M training positions, I imagine I have reached the point of diminishing returns, but I don't really know. Regression takes only a few hours to converge, but it takes a while to play all the training games. I used dxp to play the games. But I don't like the way the time controls that are hard-wired into dxp encourage using most of the time in the earlier part of the games, so I created a new dxp message type DXP_GAMEREQ_FISCHER. All the training games were played using either 0.1sec or 0.12sec per move time increment.

I'm using gradient descent to optimize the weights. I'm also adding L2 regularization to the gradients to try to prevent some of the weights from getting too large. This was a problem with some of the patterns that appear in only a small number of training positions. Regularization is added only to the pattern gradients, not to the material and other non-pattern features.

I started off using a cross-entropy cost function, because this is supposed to be the preferred method for logistic regression. But I never got good results with it. The evals were not as good as the classic kingsrow eval. When I tried using a mean squared error cost function I got better results. I still don't know why cross-entropy didn't work well for me. The optimization seemed to converge ok but the weights just didn't play well.

-- Ed

Krzysztof Grzelak · Post by **Krzysztof Grzelak** » Tue Dec 29, 2015 16:23

Thank you for the information and the new versions of the program Ed.

BertTuyt · Post by **BertTuyt** » Tue Dec 29, 2015 18:10

Ed, I also just started in my holiday period to code some logistic regression stuff.
So far I only have 100K games , which I generated with a 4 ply search and material only evaluation.
The only thing used, next to the material only evaluation, was a 7P DB.
To get some randomness I used a random shuffle of the root moves, and in the end nodes of the search I also added a small random score (between -2 and 2).

From the position set , I only used the positions with equal material, no kings, and no capture or capture thread.
I also did not differentiate between game phases (yet).

I also started with the Scan approach, which is 16 regions, with 4x4 board squares (= 8 board positions).
So 16 * 6561 = 104976 parameters.

Now I'm working on the gradient descent (learning rate is 0.1), but so far convergence is very poor, so I might do something wrong.
After 10.000 iterations the cost function reduced from 0.6931 to 0.6891.
I did not include a regularization term yet.
( Question what is the initial cost you see, and after how many iterations do you see convergence, what is the cost level you reach normally?).

The cost function I use is sigma ( -Result * log(sigmoid(position)) - ( 1- result) * log( 1- sigmoid(position)))
In the end I divide the cost function by the number of positions.

The initial feature vector is only with zeros.
Result is 0 for a lost position, 0.5 draw, and 1.0 win (all perspective white).
So the initial loss for every position (lost, draw, and win) is log(0.5), so average loss also log(0.5), which is the aforementioned 0.6931......

Bert

Ed Gilbert · Post by **Ed Gilbert** » Tue Dec 29, 2015 22:25

So far I only have 100K games , which I generated with a 4 ply search and material only evaluation.
The only thing used, next to the material only evaluation, was a 7P DB.
To get some randomness I used a random shuffle of the root moves, and in the end nodes of the search I also added a small random score (between -2 and 2).

I think it's probably good to start as you did with only material in the evaluation. Using the classic kr, the eval is very much biased for certain types of formations, and it affects the quality of the training samples.

I used 6pc dbs for all matches. I removed positions that have <= 6 pieces. But even before that, I removed any games where more than the first 35% of the moves were identical to another training game.

Following the Texel tuning method, I did a qsearch of each game position, and used the pv leaf node pos from the qsearch as the training positions. I removed identical qsearch positions that were from the same game. I also removed qpositions that have > 38 pieces.

From the position set , I only used the positions with equal material, no kings, and no capture or capture thread.

I auto-tuned the weights of men, and kings, so I had to use positions with non-zero attributes of these features. 22% of the training positions have a men imbalance, and 6% have a king imbalance.

I also did not differentiate between game phases (yet).

I used the same scheme as the chess guys, so I have a set of weights for beginning phase and another set for the end phase. The eval uses phase interpolation between both sets, similar to scan. My definition of phase is coarser than scan's, which I think goes from 0 to 300. Mine goes from 0 to 80. It's a function of the number of men on the board and which half of the board they occupy.

I also started with the Scan approach, which is 16 regions, with 4x4 board squares (= 8 board positions).
So 16 * 6561 = 104976 parameters.

But half of those regions are just mirror images of the other half, so you only really need weights for 8 regions. In other words, the same set of weights for the rectangular region of squares 1,2,16,17 can be used for the region of squares 34,35,49,50.

( Question what is the initial cost you see, and after how many iterations do you see convergence, what is the cost level you reach normally?).

You're using cross-entropy cost function, which has different magnitudes than the mean squared error that I'm using, so you really can't compare. For number of iterations, I'm doing about 9 iterations of mini-batches, then another 20 or more full iterations. Each mini-batch is 1/50 of an iteration, so 450 mini-batches before switching to full iterations. Using mini-batches is a nice time saver.

The initial feature vector is only with zeros.
Result is 0 for a lost position, 0.5 draw, and 1.0 win (all perspective white).

I'm doing the same, except all is from black's perspective, in keeping with my preference for English checkers conventions :-)

-- Ed

Walter · Post by **Walter** » Fri Jan 08, 2016 22:33

As most of you know I am really interested in a program that would support the Killer-rule. If I have understood past conversations correctly, then this would require 1) modification of move generator, 2) new evaluation function and 3) new endgame databases.

I believe that changing the move generator never was the big issue. Would it be right to say that with several people now working on eval learning that the second hurdle has also become a lot less difficult to overcome?

Am I right in thinking that with eval learning in place it would now take days/weeks instead of months to create a good quality eval specifically for Killer?

Rein Halbersma · Post by **Rein Halbersma** » Sat Jan 09, 2016 01:28

Walter wrote:As most of you know I am really interested in a program that would support the Killer-rule. If I have understood past conversations correctly, then this would require 1) modification of move generator, 2) new evaluation function and 3) new endgame databases.

I believe that changing the move generator never was the big issue. Would it be right to say that with several people now working on eval learning that the second hurdle has also become a lot less difficult to overcome?

Am I right in thinking that with eval learning in place it would now take days/weeks instead of months to create a good quality eval specifically for Killer?

first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine

second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.

Krzysztof Grzelak · Post by **Krzysztof Grzelak** » Sat Jan 09, 2016 09:23

Sorry for asking. And where can I get the program Killer.

Walter · Post by **Walter** » Sat Jan 09, 2016 17:30

Krzysztof Grzelak wrote:Sorry for asking. And where can I get the program Killer.

I tried to modify Scan today. You can download the result with below link. I give absolutely no guarantees that it works properly. You can let it play against itself, e.g. with commands

time 1
2

I tried it a few times and it seems to have a tendency for white to win. Could be a bug in my modification or could be that in Killer the first one to move has an advantage.

http://www.killerdraughts.org/killerscan.zip

Walter · Post by **Walter** » Sat Jan 09, 2016 20:12

Rein Halbersma wrote: first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine

second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.

Your first hypothesis might be right. Today I changed the move generator of Scan to introduce the Killer-rule. This seems to work and I am playing a DXP match using the Damage version that can use Scan as engine (thanks Bert!!). After watching some of the games I think indeed that Killer databases are more important than a Killer-eval. With neither of these available the search alone does not seem to go towards Killer-wins. The potential wins are outside the search depth and neither the eval nor the database (I only kept the 2 pieces DB which I am assuming still applies to Killer) is able guide the search in maximizing the potential. I am guessing the a 6 Killer-DB would provide more 'direction' than a Killer-eval. Unfortunately, Scan does not include the code for generating databases so I cannot explore this further.

Your second hypothesis would be my best guess as well. I am sure that one day we will find out

Walter · Post by **Walter** » Sun Jan 10, 2016 08:37

KilleScan 1min 80 moves.pdn: (216.95 KiB) Downloaded 318 times

Rein Halbersma wrote: first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine

second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.

I played two DXP matches with the KillerScan version. In about 50% of the games one of the Scan engines wins. With identical search, identical eval, no databases and identical resources it would seem fair to conclude that even though the eval score is close to 0.00 there actually is an interesting positional advantage for one side. Learning from this would indeed probably make the program stronger in regular draughts as well. It could result in having the better position more often but against a strong opponent it would still all result in a draw of course.

Rein Halbersma · Post by **Rein Halbersma** » Sun Jan 10, 2016 13:43

Walter wrote:
KilleScan 1min 80 moves.pdn
Rein Halbersma wrote: first hypothesis: just plug a regular draughts eval into a Killer engine with Killer databases and you are getting already a very good Killer engine

second hypothesis: a specifically tuned Killer eval is not just better at Killer, but also at regular draughts (because the enormously high drawing percentage of regular draughts does not give enough information to the fitting procedure). This would also apply to humans: study Killer middle games to become better at regular draughts.
I played two DXP matches with the KillerScan version. In about 50% of the games one of the Scan engines wins. With identical search, identical eval, no databases and identical resources it would seem fair to conclude that even though the eval score is close to 0.00 there actually is an interesting positional advantage for one side. Learning from this would indeed probably make the program stronger in regular draughts as well. It could result in having the better position more often but against a strong opponent it would still all result in a draw of course.

The old (~1998) version of the Dragon program generated the Dam 2.2/Moby Dam/Scan compatible databases. The source is available from http://hjetten.home.xs4all.nl/DragonDra ... .Win32.zip You could modify the move generator there as well and make 2-6 pc dbs for Killer. These can be plugged in to both Moby Dam and Scan. It would be interesting to modify Moby Dam as well and see if a Killer match between Moby Dam / Scan gives a bigger score difference than a regular draughts match.

World Draughts Forum

Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Kingsrow with auto tuned eval

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning