Evaluation Function Fuss
Evaluation Function Fuss
After playing a 158 games match against multiple opponents like Kingsrow, Flits and Truus, I was confronted with the shear impossible task, "what can i (or Damage) learn" from the games lost??
Some easy quick fixes were possible, but to a large extend it was almost impossible.
Some reasons for this uncomfortable situation:
* When i retrieve a score, i don't have the backup position on which the score was based.
* Even when i have a specific position it takes me lots of effort to single-step through the evaluation function, to understand what is happening.
* Although i started simple (years ago), with some hints from the Kats textbooks, but during the years due to the additions of/for specific patterns, and/or modifications, my evaluation code is almost unreadable, and therefore difficult to change/optimize/manage.
* Even when i find a correction, self play (again 158 matches) almost never show an improvement over the previous version, and often new unresolved riddles pop up.
I hope I'm not alone here (anybody out there, Pink Floyd), and some have clues how to improve on this.
Sometimes I think we should go back to the drawing boards, when we had a mission related to Artificial Intelligence, and self-learning systems (still think we have to start this, one day), but unfortunately advances in hardware, made us lazy.
Anyway, I'm open for hints, and if someone has some suggestions, weird ideas, documents (or book recommendations) about real learning (so the AI drosophila), then I'm interested...
Bert
Some easy quick fixes were possible, but to a large extend it was almost impossible.
Some reasons for this uncomfortable situation:
* When i retrieve a score, i don't have the backup position on which the score was based.
* Even when i have a specific position it takes me lots of effort to single-step through the evaluation function, to understand what is happening.
* Although i started simple (years ago), with some hints from the Kats textbooks, but during the years due to the additions of/for specific patterns, and/or modifications, my evaluation code is almost unreadable, and therefore difficult to change/optimize/manage.
* Even when i find a correction, self play (again 158 matches) almost never show an improvement over the previous version, and often new unresolved riddles pop up.
I hope I'm not alone here (anybody out there, Pink Floyd), and some have clues how to improve on this.
Sometimes I think we should go back to the drawing boards, when we had a mission related to Artificial Intelligence, and self-learning systems (still think we have to start this, one day), but unfortunately advances in hardware, made us lazy.
Anyway, I'm open for hints, and if someone has some suggestions, weird ideas, documents (or book recommendations) about real learning (so the AI drosophila), then I'm interested...
Bert
-
- Posts: 1722
- Joined: Wed Apr 14, 2004 16:04
- Contact:
Re: Evaluation Function Fuss
Hi Bert,BertTuyt wrote:After playing a 158 games match against multiple opponents like Kingsrow, Flits and Truus, I was confronted with the shear impossible task, "what can i (or Damage) learn" from the games lost??
Some easy quick fixes were possible, but to a large extend it was almost impossible.
Some reasons for this uncomfortable situation:
* When i retrieve a score, i don't have the backup position on which the score was based.
* Even when i have a specific position it takes me lots of effort to single-step through the evaluation function, to understand what is happening.
* Although i started simple (years ago), with some hints from the Kats textbooks, but during the years due to the additions of/for specific patterns, and/or modifications, my evaluation code is almost unreadable, and therefore difficult to change/optimize/manage.
* Even when i find a correction, self play (again 158 matches) almost never show an improvement over the previous version, and often new unresolved riddles pop up.
I hope I'm not alone here (anybody out there, Pink Floyd), and some have clues how to improve on this.
Sometimes I think we should go back to the drawing boards, when we had a mission related to Artificial Intelligence, and self-learning systems (still think we have to start this, one day), but unfortunately advances in hardware, made us lazy.
Anyway, I'm open for hints, and if someone has some suggestions, weird ideas, documents (or book recommendations) about real learning (so the AI drosophila), then I'm interested...
Bert
I'm not sure if I could give any useful hints to an experienced programmer like you, since I am just about starting with my own search and eval (I have a move generator + hash table so far) but here it goes.
First, I would make sure to always retrieve the PV all the way to the quiescence search, and log this information in a file after each game. This gives you the exact position the root score is calculated from. Second, I would make a print_eval_contributions() function that breaks down the eval score in a lists of 4 columns, with for each row a) the size of the parameter, b) the value for white, c) the value for black and d) the net result. Also log this information after each move you make or ponder on. If you think Damage made a bad move somewhere, you can then after the game list the eval contributions for a few alternative moves as well and compare. This should give you a good way to find out which features make an engine decide a particular move. If you think it wrongly overrates one feature over another, you can adapt.
A basic draughts eval needs at least (in rough order of importance) material balance, tempo advancement, left/right balance, center control and back rank/breakthrough detection. I'm sure you have those things already. Finally, the most difficult portions: patterns that match long-term cramps that are not easily detected by shallow searches. For these, use Katz or the e-books of Tjalling Goedemoed. They are very didactical but also hard to translate to code because each pattern has many exceptions.
If readability of your code is a problem: make a small function for each contribution and collect them in the main eval(). I used to have many manually copy-pasted unrolled loops and other speed hacks in my move generator. Not no more: overall readability of the code greatly speeds up development, with hardly any speed penalties. __forceinline is your best friend The Stockfish chess engine is a great source of inspiration to learn from (it gave me the idea for a C++ template engine).
Automated learning is an interesting topic, as is statistical data mining on game databases, but it's getting late...
Rein
Re: Evaluation Function Fuss
The problem is that after a while, you optimize and add to your code, and it becomes a fast but completely unreadable mess.
I don't see any good solution to it, and i haven't changed the basic evaluation a bit in dragon for years, except for adding more stuff and exceptions to it.
A couple of years ago, i had a idea: create a new evaluation function, and make it both readable and self-explainable, with code like this:
However, since the new evaluator was written from scratch, it doesn't perform well if you play it against the old evaluator. (It loses even when given the same amount of nodes) Given enough fine tuning there might be hope, it's increased understandability might allow you to tune it so well that it can compensate the poorer performance.
But i am not doiing that at the moment, and i'll probably stick with the old evaluator for ever.
Michel
I don't see any good solution to it, and i haven't changed the basic evaluation a bit in dragon for years, except for adding more stuff and exceptions to it.
A couple of years ago, i had a idea: create a new evaluation function, and make it both readable and self-explainable, with code like this:
Code: Select all
} else if (moveability==1) {
// threath causes a forced move
posScore-=150;
if (evalDebugger==true) {
sprintf(evalExplain[debugCnt].reason,"difficult threath on %i",fieldNum(thrPos,color));
evalExplain[debugCnt].score=-150;
evalExplain[debugCnt].side=OWN;
debugCnt++;
}
if (evalProcCounter==true) evalProc[gamefase].threatDiff++;
But i am not doiing that at the moment, and i'll probably stick with the old evaluator for ever.
Michel
-
- Posts: 859
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Re: Evaluation Function Fuss
Hi Bert,
Yes, designing and improving the eval is the most interesting part of writing a checkers program, and certainly the most time consuming. I don't have any easy answers, but I can describe some things that seem to work for me.
The kingsrow eval function is the sum of a few dozen terms. I keep track of these terms separately and then add them up at the end. I have a button on the user interface toolbar which calls the eval function giving it the present board position, and it displays all these eval terms. I find this to be my most useful tool for debugging and understanding.
I don't think I have sacrificed readability in the eval function for speed. Other than using table lookups in a few places to speed up some of calculations, the code is fairly well structured and easy to read, at least by my standards. I use _forceinline in a few places to get the readability of function calls without the overhead.
I find that I get many of my ideas for improvements and corrections by watching the automated games and reviewing the losses. When you see your program get stuck in the same lock a few times, or give up the same runaway path, you know where the problem is. Seeing patterns and themes in the games gives me ideas.
I don't think 158 games is nearly enough to find out if a change has helped or not, unless it is a really major change. I used to have a half dozen slow computers lying around (leftovers from building databases), and I would run matches on all these, and sometimes my wife's and my daughter's computers as well, to get a large number of games. Now that I have an 8-core machine I can run 8 matches simultaneously on that. For evaluating changes in the eval function I think blitz games are ok, and I've been using time controls like one or two minutes per 75 moves. Still, even with over 1000 games, you will often not get a conclusive answer, and then you have to go with your instincts.
-- Ed
Yes, designing and improving the eval is the most interesting part of writing a checkers program, and certainly the most time consuming. I don't have any easy answers, but I can describe some things that seem to work for me.
The kingsrow eval function is the sum of a few dozen terms. I keep track of these terms separately and then add them up at the end. I have a button on the user interface toolbar which calls the eval function giving it the present board position, and it displays all these eval terms. I find this to be my most useful tool for debugging and understanding.
I don't think I have sacrificed readability in the eval function for speed. Other than using table lookups in a few places to speed up some of calculations, the code is fairly well structured and easy to read, at least by my standards. I use _forceinline in a few places to get the readability of function calls without the overhead.
I find that I get many of my ideas for improvements and corrections by watching the automated games and reviewing the losses. When you see your program get stuck in the same lock a few times, or give up the same runaway path, you know where the problem is. Seeing patterns and themes in the games gives me ideas.
I don't think 158 games is nearly enough to find out if a change has helped or not, unless it is a really major change. I used to have a half dozen slow computers lying around (leftovers from building databases), and I would run matches on all these, and sometimes my wife's and my daughter's computers as well, to get a large number of games. Now that I have an 8-core machine I can run 8 matches simultaneously on that. For evaluating changes in the eval function I think blitz games are ok, and I've been using time controls like one or two minutes per 75 moves. Still, even with over 1000 games, you will often not get a conclusive answer, and then you have to go with your instincts.
-- Ed
-
- Posts: 859
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Re: Evaluation Function Fuss
Another thing I use a lot is a menu command that runs a test for eval symmetry errors. It generates millions of random board setups and passes each of these to the eval function two times. On each second call it passes the colors reversed mirror of the board and checks for symmetry in the two eval scores. I run this every time I make a change in the eval code, and I can't tell you how many times this has caught bugs after I thought the code should be correct. If you have never done this before you might be surprised to find some symmetry bugs that have existed for years in your code.
-- Ed
-- Ed
Re: Evaluation Function Fuss
Sofar I recognize much of the comments being made:
- I have split the evaluation function in several parts based on concepts like material, general patterns, locks (left-right wings), breakthrough pattern, outposts (in Dutch voorpost), .... I learned most from the 2 Kats textbooks
- As Ed I have the option to display the eval sub functions score after every move.
- And i am also working to trace the end position (next to the value ), although there are still some small issues which sometimes frustrate this.
- I like the idea of Ed by testing symmetry principles, as my eval is (or should be) also symmetrical.
But still I don't like this approach
I really think we should, next to the other creative ideas, focus more on automatic learning.
Especially as programs become smarter, we could reach the situation that humans are no longer capable in understanding the underlying principles of the game and/or the mistakes being made.
Some other ideas:
- Automatic generation of an evaluation function based on a know-how database (describing patterns and situation), and subsequent auto parameter tuning/optimization. Think in GWD something similar has happened (at least the pattern-input part). maybe we need something like an evaluation draughts language (EDL) to describe all features.
- Make the eval board implementation independent, so we could exchange evals (or subevals in the future), by using things like whiteman(B35) to describe a whiteman on position 35. I'm not sure if it is possible to completely write an eval this way (in my case it isn't, as i use/construct some generic patterns by shift, and, or, which is bitboard implementation dependant).
I'm curious if others have ideas related to auto learning, or experience with this. Otherwise isn't this something for an University student
Bert
- I have split the evaluation function in several parts based on concepts like material, general patterns, locks (left-right wings), breakthrough pattern, outposts (in Dutch voorpost), .... I learned most from the 2 Kats textbooks
- As Ed I have the option to display the eval sub functions score after every move.
- And i am also working to trace the end position (next to the value ), although there are still some small issues which sometimes frustrate this.
- I like the idea of Ed by testing symmetry principles, as my eval is (or should be) also symmetrical.
But still I don't like this approach
I really think we should, next to the other creative ideas, focus more on automatic learning.
Especially as programs become smarter, we could reach the situation that humans are no longer capable in understanding the underlying principles of the game and/or the mistakes being made.
Some other ideas:
- Automatic generation of an evaluation function based on a know-how database (describing patterns and situation), and subsequent auto parameter tuning/optimization. Think in GWD something similar has happened (at least the pattern-input part). maybe we need something like an evaluation draughts language (EDL) to describe all features.
- Make the eval board implementation independent, so we could exchange evals (or subevals in the future), by using things like whiteman(B35) to describe a whiteman on position 35. I'm not sure if it is possible to completely write an eval this way (in my case it isn't, as i use/construct some generic patterns by shift, and, or, which is bitboard implementation dependant).
I'm curious if others have ideas related to auto learning, or experience with this. Otherwise isn't this something for an University student
Bert
-
- Posts: 1
- Joined: Wed Feb 23, 2011 02:01
- Real name: Harriett Tarlton
Re: Evaluation Function Fuss
Last edited by HarriettTarlton on Mon Mar 14, 2011 22:31, edited 1 time in total.
-
- Posts: 21
- Joined: Sun Feb 20, 2011 21:04
- Real name: Gijsbert Wiesenekker
Re: Evaluation Function Fuss
Coding patterns and their evaluation functions is a tedious and error-prone task as I know from my chess program ZZZZZZ. Because draughts lends itself better to bit-board based pattern evaluation GWD uses an easy-to-use editor to define patterns and their corresponding (dynamic) evaluation functions:
The patterns are compiled into bit-boards and the evaluation functions into reverse-polish notation at program startup for efficient evaluation. The patterns do not have to be defined twice (from white's point of view and black's point of view) which could introduce errors, but only from white's point of view. The patterns are automatically mirrored to derive those from black's point of view. In fact, GWD does not mirror the patterns but mirrors the board and evaluates the patterns for the mirrored board as well.
Now to understand the score GWD shows, GWD always returns a full PV (which means GWD does not derive the PV from the transposition table but maintains it itself, and does not use transposition table scores in non-minimal window searches to avoid transposition table score cutoffs) and ensures that the score at the end of the PV is (apart from the sign) always equal to the root score (which means that GWD handles alpha and beta dependent fractional extensions in a certain way in non-minimal window searches). GWD prints all matched patterns and their corresponding scores at the end of the PV:
As you can see the root score (29592) is equal to the score of the position at the end of the PV.
Due to lack of time Klaas and I hardly play test games with GWD (except during computer draughts tournaments..) I do have a lot of experience testing my chess program ZZZZZZ and I have found the following approach to be very effective:
I play test games against a very strong chess program (Rybka). Now most of the time Rybka would have played a different move than ZZZZZZ actually plays, but the difference between the scores is only small. Of course, as the game progresses the differences gradually build up to a real disadvantage. Analyzing the small differences was not very helpful, one of the reasons also being that in chess openings many moves are playable. However I noticed that in certain positions the (usually already positive) score of Rybka's position suddenly went up by more than half a pawn after the move ZZZZZZ played, and that in certain positions the score of the move Rybka suggested for ZZZZZZ was positive (so the score of Rybka's position after that move from ZZZZZZ would even have been negative), but the score of Rybka's position was positive by more than half a pawn after the move ZZZZZZ actually played. Analyzing these few positions greatly improved the evaluation of ZZZZZZ. However, this is not easy. Especially patterns that should be evaluated positive but are evaluated negative are hard to find, as positions containing these patterns are removed from the alpha-beta search and do not show up in the PV. Conversely, patterns that should be evaluated negative but are evaluated positive are easier to find, as positions containing these patterns show up in the PV. The position above for example shows that the reason GWD evaluates it to 29592 is apparently the pattern 'Olympic formation'.
Regards,
Gijsbert
The patterns are compiled into bit-boards and the evaluation functions into reverse-polish notation at program startup for efficient evaluation. The patterns do not have to be defined twice (from white's point of view and black's point of view) which could introduce errors, but only from white's point of view. The patterns are automatically mirrored to derive those from black's point of view. In fact, GWD does not mirror the patterns but mirrors the board and evaluates the patterns for the mirrored board as well.
Now to understand the score GWD shows, GWD always returns a full PV (which means GWD does not derive the PV from the transposition table but maintains it itself, and does not use transposition table scores in non-minimal window searches to avoid transposition table score cutoffs) and ensures that the score at the end of the PV is (apart from the sign) always equal to the root score (which means that GWD handles alpha and beta dependent fractional extensions in a certain way in non-minimal window searches). GWD prints all matched patterns and their corresponding scores at the end of the PV:
Code: Select all
18 29592 (0.00) 16.28 15295362 ==== 39-33 12-18 33x24x29 20x29x24 31-27 11-16 40-34 29x40x34 35x44x40 07-11 47-41 03-09 38-33 09-14 42-38 14-20 43-39 08-12 44-40 20-24 39-34 12-17 29592
.. .. .. .. ..
.. .. .. .. ..
bO .. bO .. bO
bO bO bO bO ..
bO .. bO bO ..
bO wO wO .. ..
.. wO wO wO ..
wO wO wO .. wO
wO .. .. .. wO
.. .. .. .. ..
white to move
pattern Kroonschijf for white has score 30000
pattern Kroonschijf for black has score 30000
pattern Olympische formatie for white has score 30000
pattern KV-zwak-1 for white has score 0
pattern KV-zwak-1 for black has score 0
pattern LV-zwak-1 for white has score 0
pattern LV-zwak-1 for black has score 0
pattern Tempo1 for white has score 0
pattern Tempo1 for black has score 0
pattern Tempo2 for white has score 4
pattern Tempo2 for black has score 0
pattern Tempo3 for white has score 12
pattern Tempo3 for black has score 9
pattern Tempo4 for white has score 12
pattern Tempo4 for black has score 16
pattern Tempo5 for white has score 10
pattern Tempo5 for black has score 15
pattern Tempo6 for white has score 0
pattern Tempo6 for black has score 6
pattern Tempo7 for white has score 0
pattern Tempo7 for black has score 0
pattern Tempo8 for white has score 0
pattern Tempo8 for black has score 0
pattern Tempo9 for white has score 0
pattern Tempo9 for black has score 0
pattern Vleugelverdeling for white has score -200
pattern Vleugelverdeling for black has score 200
pattern Centrumverdeling for white has score 0
pattern Centrumverdeling for black has score 0
Due to lack of time Klaas and I hardly play test games with GWD (except during computer draughts tournaments..) I do have a lot of experience testing my chess program ZZZZZZ and I have found the following approach to be very effective:
I play test games against a very strong chess program (Rybka). Now most of the time Rybka would have played a different move than ZZZZZZ actually plays, but the difference between the scores is only small. Of course, as the game progresses the differences gradually build up to a real disadvantage. Analyzing the small differences was not very helpful, one of the reasons also being that in chess openings many moves are playable. However I noticed that in certain positions the (usually already positive) score of Rybka's position suddenly went up by more than half a pawn after the move ZZZZZZ played, and that in certain positions the score of the move Rybka suggested for ZZZZZZ was positive (so the score of Rybka's position after that move from ZZZZZZ would even have been negative), but the score of Rybka's position was positive by more than half a pawn after the move ZZZZZZ actually played. Analyzing these few positions greatly improved the evaluation of ZZZZZZ. However, this is not easy. Especially patterns that should be evaluated positive but are evaluated negative are hard to find, as positions containing these patterns are removed from the alpha-beta search and do not show up in the PV. Conversely, patterns that should be evaluated negative but are evaluated positive are easier to find, as positions containing these patterns show up in the PV. The position above for example shows that the reason GWD evaluates it to 29592 is apparently the pattern 'Olympic formation'.
Regards,
Gijsbert
-
- Posts: 21
- Joined: Sun Feb 20, 2011 21:04
- Real name: Gijsbert Wiesenekker
Re: Evaluation Function Fuss
As an example of the above analysis: during the last tournament at Culemborg GWD played a game against Damage. After move 41... 21-27 from Damage the following position was on the board:
GWD's position is not good, and did not improve after 48.42-38, because GWD lost a man after 48... 12-17.
Even a non-draughts player like myself sees that 48.39-34 with the idea 34-30, 30-25, 20-14 at least gives some counterplay.
So the first question was why GWD did not see that it's position was bad. After some trial and error adding the following pattern seemed to help:
together with one or more white men on any of the squares 25,30,34,35. (You should mirror the pattern to apply it to the position above).
Although GWD now noticed sooner that it's position was bad, still it would play 48-42 under tournament conditions (say about 20 seconds per move), but there was a fail-high after 30 seconds on the move 39-34. I noticed that It took quite some time to search the remaining moves after 48-42. GWD uses two search processes, and I happened to have the terminal to which the second process writes open, and I now noticed that it had stopped searching, and that it had searched the move 48-42 only. That was not expected, as the second process should search the current move indefinitely, interrupted by the first process. However, a small code change a couple of years ago caused the second process to search with a time-limit of 30 seconds, and worse, it only searched the first move, not the other moves.
With that also fixed GWD now plays 48.39-34 under tournament conditions.
Gijsbert
GWD's position is not good, and did not improve after 48.42-38, because GWD lost a man after 48... 12-17.
Even a non-draughts player like myself sees that 48.39-34 with the idea 34-30, 30-25, 20-14 at least gives some counterplay.
So the first question was why GWD did not see that it's position was bad. After some trial and error adding the following pattern seemed to help:
together with one or more white men on any of the squares 25,30,34,35. (You should mirror the pattern to apply it to the position above).
Although GWD now noticed sooner that it's position was bad, still it would play 48-42 under tournament conditions (say about 20 seconds per move), but there was a fail-high after 30 seconds on the move 39-34. I noticed that It took quite some time to search the remaining moves after 48-42. GWD uses two search processes, and I happened to have the terminal to which the second process writes open, and I now noticed that it had stopped searching, and that it had searched the move 48-42 only. That was not expected, as the second process should search the current move indefinitely, interrupted by the first process. However, a small code change a couple of years ago caused the second process to search with a time-limit of 30 seconds, and worse, it only searched the first move, not the other moves.
With that also fixed GWD now plays 48.39-34 under tournament conditions.
Gijsbert
Re: Evaluation Function Fuss
A pity that GWD (and the other programs) are not yet able to detect these patterns themselves
Does some-one has positive examples from (for example Chess), where some programs are succesfull.
My vision, that one day we are able to connect a complete dumb program via DXP with Kingsrow, and that after a week (and zillions of games), the program beats Kingsrow in a match.
If a computer can win jeopardy (IBM with Watson ), why not this task....
Bert
Does some-one has positive examples from (for example Chess), where some programs are succesfull.
My vision, that one day we are able to connect a complete dumb program via DXP with Kingsrow, and that after a week (and zillions of games), the program beats Kingsrow in a match.
If a computer can win jeopardy (IBM with Watson ), why not this task....
Bert
Re: Evaluation Function Fuss
Interesting to read old posts (10 years ago).
Especially when we see the progress being made in evaluation functions (and alike).
Bert
Especially when we see the progress being made in evaluation functions (and alike).
Bert