Eval tuning

Rein Halbersma · Post by **Rein Halbersma** » Fri Aug 07, 2015 23:33

I am racking my brain about how Dragon / Scan tune their eval. I understand logistic regression and how to do the optimization. What I can't really grasp is the sheer number of parameters. If I understand e.g. Fabien's explanation correctly, Scan has 16 patterns of 8 squares each. So that is 3^8 * 16 = 100K parameters. Michel is talking about close to a million? (can't find the reference right now).

Now I am assuming that estimating these parameters needs at the very least 1 game per parameter, and perhaps a dozen more. I did some simulations in R (statistics programming language) with randomly generated data, and it was not possible to estimate an eval function on my 32Gb machine for a 100K parameters.

Now R is a very memory hungry environment. OTOH, writing Newton's method in C++ is doable (and I'm looking at some libraries for it) but I wonder how to get around having to compute a 100K x 100K matrix inverse that Newton's method uses in its root finding. Perhaps using quasi-Newton methods?

@Fabien/Michel: how did you tune your evals? Some numerical libraries or self-written optimizers? Any details are appreciated!

MichelG · Post by **MichelG** » Sat Aug 08, 2015 00:23

Dragon uses conjugated gradient algorithm for doing that. I found that in the 'numerical recipes in c' book. I made it run multi-threaded and minimized the memory footprint.

Gamefase 3 (the endgame) is currently the biggest and learned from 900 million examples to learn 17 million parameters. This takes about a week to run on my computer and uses about 17 GB of memory. Main limits are lack of memory and the slowness of the random memory accesses.

Smaller problems run much faster; learning 10000 weights from 1 million examples just takes a couple of seconds.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 08, 2015 00:39

MichelG wrote:Dragon uses conjugated gradient algorithm for doing that. I found that in the 'numerical recipes in c' book. I made it run multi-threaded and minimized the memory footprint.

Gamefase 3 (the endgame) is currently the biggest and learned from 900 million examples to learn 17 million parameters. This takes about a week to run on my computer and uses about 17 GB of memory. Main limits are lack of memory and the slowness of the random memory accesses.

Smaller problems run much faster; learning 10000 weights from 1 million examples just takes a couple of seconds.

Thanks for the explanation. Numerical analysis classes were over 20 years ago for me (been using libraries ever since), time to brush it up

Fabien Letouzey · Post by **Fabien Letouzey** » Sat Aug 08, 2015 08:26

Rein Halbersma wrote:Now I am assuming that estimating these parameters needs at the very least 1 game per parameter, and perhaps a dozen more.

That would be true only with "old statistics" which almost universally relied on maximum likelihood. Historically though, ANN "weight decay" showed the power of what is now called "regularisation": https://en.wikipedia.org/wiki/Regulariz ... thematics). The theory behind it is that we apply an a-priori probability distribution on weights (bias) in addition to the data: https://en.wikipedia.org/wiki/Maximum_a ... estimation. In practice, we add an error term that penalises "complex" weight vectors. In the mid-90's, Othello programs used ad-hoc solutions against overfitting which was seen in reverse: the "rare pattern" problem. Some even generated artificial examples (as in Backgammon).

So in conclusion, there is no longer a simple relation between the number of weights and the number of required examples. It should be noted though that a large enough number of examples will always fix problems, even without regularisation.

I did some simulations in R (statistics programming language) with randomly generated data, and it was not possible to estimate an eval function on my 32Gb machine for a 100K parameters.

Now R is a very memory hungry environment. OTOH, writing Newton's method in C++ is doable (and I'm looking at some libraries for it) but I wonder how to get around having to compute a 100K x 100K matrix inverse that Newton's method uses in its root finding. Perhaps using quasi-Newton methods?

As you have seen, any method that stores the matrix directly (inverse or not) is dead in the water. All applications I have seen use gradient methods, though quasi-Newton ones might also be applicable.

Additionally, the system that we want to solve is very sparse in the following way: for any example (position), the number of non-zero attributes (features) is bounded by a small constant. With 16 patterns, only 16 configurations can occur in a single position. The other configurations have no impact and should therefore be skipped during the computation for that position. Another way of seeing this is that each pattern is a single (discrete) feature and we represent data with 1-of-C encoding because the methods that we use require continuous attributes (http://www.faqs.org/faqs/ai-faq/neural- ... ion-7.html). You can see for example in the Buro papers that all the maths look very complicated for this reason while they simply represent scalar products of a sparse vector (features) by a dense one (weights).

What we use in practice is called "list of lists" (LIL) on this page: https://en.wikipedia.org/wiki/Sparse_matrix. The learning problem pretty much requires that we use this representation. R might or might not have it available. I remember that Weka universally uses this, because text mining was a hot subject in the early 2000's.

@Fabien/Michel: how did you tune your evals? Some numerical libraries or self-written optimizers? Any details are appreciated!

I use basic gradient descent in my own code. It's fast enough for my needs: a few hours on a single core. CG is better, but more complicated. I suggest you read ANN resources instead of statistics: linear and logistic regressions are equivalent to a single neuron.

Fabien Letouzey · Post by **Fabien Letouzey** » Sat Aug 08, 2015 08:34

Fabien Letouzey wrote:I use basic gradient descent in my own code. It's fast enough for my needs: a few hours on a single core. CG is better, but more complicated. I suggest you read ANN resources instead of statistics: linear and logistic regressions are equivalent to a single neurone.

I should add that I use sampling to compute the gradient. I think it's called "mini batches" in ANN terminology. I only compute the full gradient in the last few passes.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 08, 2015 11:08

Fabien Letouzey wrote:
Rein Halbersma wrote:Now I am assuming that estimating these parameters needs at the very least 1 game per parameter, and perhaps a dozen more.
That would be true only with "old statistics" which almost universally relied on maximum likelihood. Historically though, ANN "weight decay" showed the power of what is now called "regularisation": https://en.wikipedia.org/wiki/Regulariz ... thematics). The theory behind it is that we apply an a-priori probability distribution on weights (bias) in addition to the data: https://en.wikipedia.org/wiki/Maximum_a ... estimation. In practice, we add an error term that penalises "complex" weight vectors. In the mid-90's, Othello programs used ad-hoc solutions against overfitting which was seen in reverse: the "rare pattern" problem. Some even generated artificial examples (as in Backgammon).

So in conclusion, there is no longer a simple relation between the number of weights and the number of required examples. It should be noted though that a large enough number of examples will always fix problems, even without regularisation.

Thanks for this explanation: I have to process all that. It sounds plausible, but I need to do some reading up on the ANN literature you mentiond in order to be able to comment on this.

Additionally, the system that we want to solve is very sparse in the following way: for any example (position), the number of non-zero attributes (features) is bounded by a small constant. With 16 patterns, only 16 configurations can occur in a single position. The other configurations have no impact and should therefore be skipped during the computation for that position. Another way of seeing this is that each pattern is a single (discrete) feature and we represent data with 1-of-C encoding because the methods that we use require continuous attributes (http://www.faqs.org/faqs/ai-faq/neural- ... ion-7.html). You can see for example in the Buro papers that all the maths look very complicated for this reason while they simply represent scalar products of a sparse vector (features) by a dense one (weights).

Yes, thanks, this was familiar to me from statistics: for a C-valued categorical variable, you have C-1 "dummy" parameters (with C=0 in this case the empty 4x4 squares as "baseline"). I think R packages typically expand this into C-1 extra columns of 0s and 1s, and do the loop over all pattern values.

Of course, just selecting parameter[x] for pattern value C=x and ignoring the other 3^8 - 2 parameters is much faster, as you pointed out.

I have one other question: I read about Texel's Tuning and I am puzzled about the encoding of the game results as 0, 1/2 and 1. In draughts, draws occur very frequently and it's probably not entirely correct to use a continuous scale for a 3-valued ordinal response variable. I am used to do "ordered logit" regression where the results are (white perspective) L < D < W and you estimate meta parameters for the draw_W and draw_B thresholds. So in terms of likelihood (with eval calling eval on all parameters for the relevant features)

Code: Select all

Lik(win_W) = 1              - F(w_dW - eval)
Lik(draw)  = F(w_dW - eval) - F(w_dB - eval)
Lik(win_B) = F(w_dB - eval) - 0

where F(x) is equal to the usual sigmoid used in 2-valued logit, and LogLik = sum_{res, pos} LogLik(res, eval(pos)). I think you can also add the parameter regularization terms to this log-likelihood and apply the numerical optimization on this objective function. Did you try to treat draws in this manner? Or is this somehow taken care of by your ANN approach (sorry, not familiar with it yet)?

Fabien Letouzey · Post by **Fabien Letouzey** » Sat Aug 08, 2015 12:04

Rein Halbersma wrote:Yes, thanks, this was familiar to me from statistics: for a C-valued categorical variable, you have C-1 "dummy" parameters (with C=0 in this case the empty 4x4 squares as "baseline"). I think R packages typically expand this into C-1 extra columns of 0s and 1s, and do the loop over all pattern values. Of course, just selecting parameter[x] for pattern value C=x and ignoring the other 3^8 - 2 parameters is much faster, as you pointed out.

For me all this stuff is part of "old statistics" and to be avoided if you only need to get things done (as opposed to draw precise conclusions). They use C-1 to avoid linear dependency which makes matrix methods fail. But not only do gradient methods not share this problem (other than becoming slower in extreme cases), the "problem" actually disappears when you add a regularisation term, i.e. the solution becomes unique again AFAIK. For these reasons and a few others, I recommend to shy away from statistics books and instead embrace the IMO more uniform view of ANNs. It's also a lot more "hands on" and therefore accessible to more programmers.

I have one other question: I read about Texel's Tuning and I am puzzled about the encoding of the game results as 0, 1/2 and 1. In draughts, draws occur very frequently and it's probably not entirely correct to use a continuous scale for a 3-valued ordinal response variable. I am used to do "ordered logit" regression where the results are (white perspective) L < D < W and you estimate meta parameters for the draw_W and draw_B thresholds. So in terms of likelihood (with eval calling eval on all parameters for the relevant features)
Code: Select all
Lik(win_W) = 1              - F(w_dW - eval)
Lik(draw)  = F(w_dW - eval) - F(w_dB - eval)
Lik(win_B) = F(w_dB - eval) - 0
where F(x) is equal to the usual sigmoid used in 2-valued logit, and LogLik = sum_{res, pos} LogLik(res, eval(pos)). I think you can also add the parameter regularization terms to this log-likelihood and apply the numerical optimization on this objective function. Did you try to treat draws in this manner? Or is this somehow taken care of by your ANN approach (sorry, not familiar with it yet)?

I have always avoided ordered statistics but what you're suggesting looks like multinomial logistic regression (https://en.wikipedia.org/wiki/Multinomi ... regression). The ANN equivalent consists of selecting Softmax output (and the corresponding error function), so it's considered as an output encoding. I actually find the Softmax maths easier to understand even in the binary case. There's also a connection with data compression. In practice the rare people who do handle draws separately use ad-hoc variants probably because complicated features are not needed for that. One example is a paper by the authors of Junior: "Automatic Learning of Evaluation, with Applications to Computer Chess".

I think you are right that it would be more accurate. Nonetheless I didn't use that as a first attempt. The uniform (Softmax) way would require doubling the number of parameters (not triple thanks to black/white symmetry) and the ad-hoc way (normal eval * draw factor) complicates the derivatives.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 08, 2015 13:53

Fabien Letouzey wrote: For me all this stuff is part of "old statistics" and to be avoided if you only need to get things done (as opposed to draw precise conclusions). They use C-1 to avoid linear dependency which makes matrix methods fail. But not only do gradient methods not share this problem (other than becoming slower in extreme cases), the "problem" actually disappears when you add a regularisation term, i.e. the solution becomes unique again AFAIK. For these reasons and a few others, I recommend to shy away from statistics books and instead embrace the IMO more uniform view of ANNs. It's also a lot more "hands on" and therefore accessible to more programmers.

Well maybe I'm too "old school", although I think we are of the same age category

In any case, I have been browsing/reading since last year in this introductory book on statistical learning from an online Stanford course. The book discusses several "modern" approaches to classification/supervised learning: logistic regression, discriminant analysis (these two I already knew from econometrics) and also trees/random forests (did some small experiments with them a while ago) and support vector machines.

Unfortunately, the book is completely(!) silent on neural networks, so thanks for giving those tips. But a more advanced book (by the same authors) does discuss them. I had kind of hoped to avoid having to read that book, but I looked over the NN chapter this morning. My current understanding of it is like this:

-getting a good evaluation function from self-play is supervised learning / classification
-we are not interested here in statistical confidence or other interpretation, goodness of fit is king
-there are several methods to get the best predictions: logistic regression, discriminant analysis, trees/forests, support vector machines, neural networks
-each method has an objective function that is being optimized
-regularization (lasso/ridge) can help to constrain the huge set of parameters being tuned (this was the biggest insight so far I got from connecting your remarks to what I read in the book).
-each optimization problem can be solved by a variety of numerical back-end solvers (Newton, gradient etc.)

As far as I can see now (just rough intuition, not a strongly held opinion), it's an empirical question which classification method performs best for this type of eval tuning. The numerical back-end solver has to be chosen to limit memory use to available RAM and to converge as fast as possible otherwise. And regularization is a must to avoid having to generate tens of millions of positions. Is this more or less how you see things?

BTW, do you also use cross-validation (k-fold or otherwise) on your training data to avoid over-fitting or other bias?

I have always avoided ordered statistics but what you're suggesting looks like multinomial logistic regression (https://en.wikipedia.org/wiki/Multinomi ... regression).

Multinomial regression is a more general category. It's used to model N-valued but *non-hierarchial* outcomes. It's most often used in consumer surveys where the N answers are e.g. different products (say different brands of beer). The *ordered* multinomial logit is a special case where there is a hierarchy in the choice (think different percentages of alcohol, or loss/draw/win).

I think you are right that it would be more accurate. Nonetheless I didn't use that as a first attempt. The uniform (Softmax) way would require doubling the number of parameters (not triple thanks to black/white symmetry) and the ad-hoc way (normal eval * draw factor) complicates the derivatives.

For N outcomes, there are N-1 meta parameters to be fit in ordered multinomial logit. It's not multiplying the complexity by N.

Because of the high drawing margin in draughts, the meta parameters are pretty high and they mask large differences in eval. The same happens when you compute ratings. In a sense, the eval in a logistic framework, is just the "rating" of a position. Because endgames are so drawn, the official draughts federation has introduced plus and minus draws (you need a 2 piece material advantage for 10 moves in the endgame to get such a score). It would be interesting to label the training data with 5-valued instead of 3-valued scores. I think you get a lot more information from this to tune the eval.

Another thing I worry about (maybe too soon) is serial correlation in the position. It seems a pity not to exploit the fact that positions are a sequence leading from opening to endgame. A mistake in a certain pattern can be present in several sequential positions and this could lead to bias. I would also think that positions closer to the endgame should receive larger weight in the objective function because having an accurate eval in the late middle game is more decisive than in the opening.

Fabien Letouzey · Post by **Fabien Letouzey** » Sat Aug 08, 2015 15:56

Rein Halbersma wrote:-getting a good evaluation function from self-play is supervised learning / classification

That's the way I use it, but reinforcement learning (RL) is also a possibility.

-we are not interested here in statistical confidence or other interpretation, goodness of fit is king

We ignore weight variance, as regularisation is supposed to make this a non-issue. It's hard to compute in large spaces anyway. In any case we can play games to validate hypotheses, so we don't need the usual statistical measures (including cross-validation, etc ...).

-there are several methods to get the best predictions: logistic regression, discriminant analysis, trees/forests, support vector machines, neural networks

Yes. You'll find all of them + some less general ones in ML books or on-line resources. I would focus on numerical methods for our domain. I don't know of any success stories with trees (for full eval) except that it allows laziness ("Bootstrap Learning of alpha-beta-Evaluation Functions").

-each method has an objective function that is being optimized

Yes. The theory is that they all have a hypothesis space (sometimes with a hierarchical structure) and an error function. Once you have that, it becomes an optimisation problem.

-regularization (lasso/ridge) can help to constrain the huge set of parameters being tuned (this was the biggest insight so far I got from connecting your remarks to what I read in the book).

Yes. Some methods already have this baked in, such as tree pruning.

-each optimization problem can be solved by a variety of numerical back-end solvers (Newton, gradient etc.)

It depends on the method. For regression and ANNs, yes you have a lot of choices. Trees are symbolic in nature and enumerating all of them is not possible, so it's the kingdom of heuristics. SVM are notoriously hard to train and require quadratic programming IIRC, not to mention the kernel stuff => use a library.

As far as I can see now (just rough intuition, not a strongly held opinion), it's an empirical question which classification method performs best for this type of eval tuning. The numerical back-end solver has to be chosen to limit memory use to available RAM and to converge as fast as possible otherwise. And regularization is a must to avoid having to generate tens of millions of positions. Is this more or less how you see things?

Yes. As a rule of thumb, I would say that SVM >= ANN > regression (single neuron) in pure accuracy. The problem is run-time speed and it's usually inversely related. Most of us consider (large) ANNs too slow although I gave the example of Hannibal, and IIUC SVMs are even slower. This complicates the decision.

BTW, do you also use cross-validation (k-fold or otherwise) on your training data to avoid over-fitting or other bias?

I make a distinction: regularisation fights overfitting but requires a hyper-parameter. Cross validation is one way to select a value. If I wanted to try a few values, I wouldn't bother to leave examples out as data is plentiful in our case (intuitively: you can cheat but not too much). Cross-validation was designed at a time when 1000 examples were "plentiful". Furthermore we can play games for important choices; this will take search/eval interaction into account so it's better.

Multinomial regression is a more general category. It's used to model N-valued but *non-hierarchial* outcomes. It's most often used in consumer surveys where the N answers are e.g. different products (say different brands of beer). The *ordered* multinomial logic is a special case where there is a hierarchy in the choice (think different percentages of alcohol, or loss/draw/win).

I haven't thought about it carefully but I wouldn't bother for only 3 possible outcomes. In our case, a good way is probably to have a win/loss function (which we already know how to design) and a draw/non-draw one (which seems much easier anyway).

For N outcomes, there are N-1 meta parameters to be fit in ordered multinomial logit. It's not multiplying the complexity by N.

I see. In that case I can look at it in two ways. One is a means to re-use existing methods with minimal changes; maybe that was your intent. Yes it looks like a cheap way to try to improve accuracy, but it might require slower games. Another interpretation is to actually predict draws in which case I don't think any constants are going to cut it; we have to take the position into account. Softmax would do that "automatically", with the drawback that it would exploit the same features that we use for win/loss scoring (which feels like overkill to me).

Because of the high drawing margin in draughts, the meta parameters are pretty high and they mask large differences in eval. The same happens when you compute ratings. In a sense, the eval in a logistic framework, is just the "rating" of a position. Because endgames are so drawn, the official draughts federation has introduced plus and minus draws (you need a 2 piece material advantage for 10 moves in the endgame to get such a score). It would be interesting to label the training data with 5-valued instead of 3-valued scores. I think you get a lot more information from this to tune the eval.

Your ideas all have serious improvement potential; the sky is the limit. But, as you can feel, the "basic" method is already not so easy to get working. Using different rules during learning and while playing games is certainly going to complicate things; there is most likely a limit to what can be accomplished in that way. Now if the tournament (or game testing) rules are going to change, we are in business!

Another thing I worry about (maybe too soon) is serial correlation in the position. It seems a pity not to exploit the fact that positions are a sequence leading from opening to endgame. A mistake in a certain pattern can be present in several sequential positions and this could lead to bias.

I'm not sure what you are aiming at. If you are saying that we routinely violate the independence assumption (IID) then yes, we just ignore that. Nobody wants to take a single position per game to avoid this concern. You can also consider that the training set is composed of games instead of positions. If they are of comparable length, it shouldn't matter much (depending on how long the patterns stay).

I would also think that positions closer to the endgame should receive larger weight in the objective function because having an accurate eval in the late middle game is more decisive than in the opening.

All ML methods can be adapted to handle weights (very easily for numerical ones). Boosting requires this so it was a big thing in the early 2000's. Weka also handles weights everywhere as I recall. But then you have an additional hyper-parameter on your hands AND you need more data since this also artificially reduces the variance.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 08, 2015 16:34

Fabien Letouzey wrote: I make a distinction: regularisation fights overfitting but requires a hyper-parameter. Cross validation is one way to select a value. If I wanted to try a few values, I wouldn't bother to leave examples out as data is plentiful in our case (intuitively: you can cheat but not too much). Cross-validation was designed at a time when 1000 examples were "plentiful". Furthermore we can play games for important choices; this will take search/eval interaction into account so it's better.

With k-fold cross-validation you don't throw any data away. You just average K sets of parameters that are fitted by leaving 1/K part of the data out in each "fold" for validation purposes. But you are probably right that you can always re-fit after you get more games anyway.

Your ideas all have serious improvement potential; the sky is the limit. But, as you can feel, the "basic" method is already not so easy to get working. Using different rules during learning and while playing games is certainly going to complicate things; there is most likely a limit to what can be accomplished in that way. Now if the tournament (or game testing) rules are going to change, we are in business!

Actually, it doesn't matter if one uses different rules in learning and playing: scoring plus/minus draws in the tuning will just magnify the variance in the response variable to be classified. It's about distinguishing bad draws from good draws. You might as well score the training data with the material balance and tune the eval pattern parameters to fit the material difference (mapped to the sigmoid scale). This gives the response even more variance (which is almost perfectly correlated with the L/D/W scoring).

I'm not sure what you are aiming at. If you are saying that we routinely violate the independence assumption (IID) then yes, we just ignore that. Nobody wants to take a single position per game to avoid this concern. You can also consider that the training set is composed of games instead of positions. If they are of comparable length, it shouldn't matter much (depending on how long the patterns stay).

In the usual linear regression, serial correlation is a big problem and leads to biased parameters. The way to handle that is to introduce a time label (move number here). In linear regression you would impose a panel data structure to account for the correlation

Y_it = sum_j X_ijt * b_j + epsilon_it + epsilon_t

which means (in eval tuning) to model each game as t=1..T positions. Each position outcome Y_it in a game is actually the same value Y_i. So being wrong about a specific long-term pattern would penalize the rest of the game with it, which loses information on the other coefficients. Decomposing the error into a position specific part (epsilon_it) and a game-specific part (epsilon_t) you can compensate for that. I don't know the benefit of this, but in linear regressions, the overall fit can improve enormously.

BertTuyt · Post by **BertTuyt** » Sat Aug 08, 2015 20:54

Fabien, just for my understanding-
Think you started with generating games with a material only evaluation ( so really bootstrap learning).
And based on the outcome of these games you optimized the features.

How many games did you use for calculating the final weights (think Michel used 1M games), but i guess as you have fewer parameters 10K - 100K might also do.

To avoid a high draw level (if all games end up in draw, there is little to learn), you should use the right search depth.
If it it too high it take ages, so did you go for a fixed depth, or just limit the game time?

Last but not least, did you use all posiitions, in the openign you have many overlapping positions but with different game outcomes, so you could ommit the first x moves to deal with this, and the same holds for the endgame, where the table-bases could take over. I assume that the relevant positions are between move 10 and move 40/50

Bert

Fabien Letouzey · Post by **Fabien Letouzey** » Sun Aug 09, 2015 07:52

BertTuyt wrote:Fabien, just for my understanding-
Think you started with generating games with a material only evaluation ( so really bootstrap learning).
And based on the outcome of these games you optimized the features.

Yes.

How many games did you use for calculating the final weights (think Michel used 1M games), but i guess as you have fewer parameters 10K - 100K might also do.

I used 600k and patience was the main criterion. At some point I noted a gain from 200k to 400k so quantity matters. I think that the reason is that features are far from being uniformly distributed, so we can't estimate the required number of examples by using theory alone. I also think that Michel used much more than 1M games.

To avoid a high draw level (if all games end up in draw, there is little to learn), you should use the right search depth.
If it it too high it take ages, so did you go for a fixed depth, or just limit the game time?

I don't know whether 50% is "high". I used fixed depth for the first batch (I think it was 12 plies full-width). I quickly switched to increment TC for all games (generation and testing).

You might be right that playing faster games would be better, but I find danger in the "draws are bad" attitude as they are part of the game. They might only slow down testing, but things are less clear for learning.

Last but not least, did you use all posiitions, in the openign you have many overlapping positions but with different game outcomes, so you could ommit the first x moves to deal with this, and the same holds for the endgame, where the table-bases could take over. I assume that the relevant positions are between move 10 and move 40/50

I simply skip positions that will never (or very rarely) be evaluated. So that's opening (probably not important), captures (critical), and endgame (6 pieces or less, probably not important). Your endgame move limit worries me though: we are learning eval, not search.

Ed Gilbert · Post by **Ed Gilbert** » Mon Aug 10, 2015 19:04

Fabien, it seems as if logistic regression could be use to tune more traditional draughts eval functions, in addition to the kind of eval that you used which is mostly regions or subsets of the full board squares. Did you use this to tune the other terms in scan's eval, like king mobility, left-right balance, etc., or are these manually tuned? Would it make sense to try to tune an eval such as kingsrow's using logistic regression? I imagine I might find that some terms don't correlate strongly with statistical game results, in which case that might mean they aren't actually helpful and could be omitted, etc.

-- Ed

Fabien Letouzey · Post by **Fabien Letouzey** » Mon Aug 10, 2015 20:18

Ed Gilbert wrote:Fabien, it seems as if logistic regression could be use to tune more traditional draughts eval functions, in addition to the kind of eval that you used which is mostly regions or subsets of the full board squares. Did you use this to tune the other terms in scan's eval, like king mobility, left-right balance, etc., or are these manually tuned? Would it make sense to try to tune an eval such as kingsrow's using logistic regression? I imagine I might find that some terms don't correlate strongly with statistical game results, in which case that might mean they aren't actually helpful and could be omitted, etc.

Indeed, all of Scan's eval features were handled by the same maths as patterns. It's probably more accurate to handle them separately, but I have no interest in doing that.

Yes, it makes perfect sense to use logistic regression (or any other method) with classical features. I guess the benefit is less because the drawbacks are more evident (no search for instance). From my point of view, increasing the number of parameters is the main way to compensate. It's also very fast compared to tuning using test games, so you can focus on the features and not worry about the weights (but they won't be optimal). Details are important, so don't feel discouraged if the preliminary results are disappointing.

You can draw conclusions regarding features by trying various combinations or, since you presumably don't have too many weights, use standard statistics to help you with that because testing and selecting features is a classical problem.

MichelG · Post by **MichelG** » Wed Aug 12, 2015 20:18

I have integrated the scan evaluation function into dragon for testing purposes and done some testing.

I have tested against 2 different eval function of dragon:
1) DragonBS
This evalution function is the original eval function of dragon. It is a manual tuned "classic" eval. It includes a breakthrough table, is actively maintained and while it may be not great, i think it is ok.

I use this eval function to bootstrap the machine learned eval.

2) DragonML
This evalution is based on DragonBS, but includes 50m weights trained by machine learning. DragonML greatly outperformes dragonBS as you can see in these match results (4000 games/match) This is the default eval.

For comparison, i played a match between these two verions:

dragonML - dragonBS                                      sigma:     (score-1)/sigma:
6 ply         win:2659  lose:116  draw:1225 score:1.636 (+/- 0.009, 74.7)
8 ply         win:2282  lose: 73  draw:1645 score:1.552 (+/- 0.008, 65.6)
10sec/game    win:1545  lose: 56  draw:2399 score:1.372 (+/- 0.008, 46.0)

In order to test the strength of the scan eval, i played 1000-game matches against both my eval functions, at different search depth. I kept same dragon functions (endgame database, search, pruning, etc) and only replaced the eval by that of scan.

DragonBS - Scan
6 ply         win:153  lose:361  draw:486 score:0.792 (+/- 0.022, -9.6)
8 ply         win: 93  lose:322  draw:585 score:0.771 (+/- 0.019, -12.0)
10 ply        win: 62  lose:276  draw:662 score:0.786 (+/- 0.017, -12.5)

DragonML - Scan
6 ply         win:461   lose: 72  draw: 467 score:1.389 (+/- 0.020, 19.9)
8 ply         win:359   lose: 53  draw: 588 score:1.306 (+/- 0.018, 17.1)
10 ply        win:241   lose: 43  draw: 716 score:1.198 (+/- 0.016, 12.6)

As you can see, the scan eval wins from dragonBS, but loses from DragonML

I realise it is a bit tricky to just swap eval functions, because there is some interaction between the search process and the eval function. For instance, some trees may be depth-reduced based depending on the eval value.

To be sure, i created a new, very simple alpha-beta searcher, with no pruning, depth reduction or hashtables, no hashtables and no move sorting, but with capture extensions and quiescence search and played that against the dragonML eval.

DragonBS - Scan, trivial ab-searcher
6 ply         win:86   lose:297  draw:617 score:0.789 (+/- 0.018, -11.5)
DragonML - Scan, trivial ab-searcher
6 ply         win:270  lose: 69  draw:661 score:1.201 (+/- 0.017, 11.6)

This confirms that the accuracy of the scan eval is between dragonBS and dragonML. This means that the scan eval() is quite a good evaluation. It clearly outperforms dragon's bootsrapping (manual tuned) eval. However, it appears less accurate than dragon's machine learned eval.

To compensate for the weaker eval of scan, you would need to give it 2 ply extra search depth:

DragonML 6 ply vs Scan at 8 ply:
              win:159  lose:198  draw:643 score:0.961 (+/- 0.019, -2.1)

One thing to note is that scan is much faster than dragon in terms of evals/second. Scan runs at 22 Mnodes/sec, dragon at 4 Mnodes/sec. That should be enough to compensate for these 2 ply.

With dragon losing from scan, this should mean that scan's search algorithm must be a lot better than dragon's. Not only does it need to ompensate it's weaker eval, but then win convincingly.

For now, my conslusion is that scan has a fairly good eval function, which runs at high speed. Add to this an excellent search algorithm, and you have the result we have seen.

I'll investigate further by absorbing some of the ideas in scan's source code, most notably the probcut/lmr combination.

Michel

World Draughts Forum

Eval tuning

Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning

Re: Eval tuning