Rein Halbersma wrote:TAILLE wrote:
In the situation
v1 = +50 and n1 = 500 000
v2 = +30 and n2 = 100 000
I will decide p2 the most promising move
but in the situation
v1 = +350 and n1 = 500 000
v2 = +330 and n2 = 100 000
I will continue to use p1 because with v1 = +350 I am pretty confident to find quicky the win.
In other words when v becomes high I diminsh the influence of n.
I guess it could not be solid for a mathematical point of view but at least it is in accordance to my knowledge of the game ... and it works quite well doesn'it?
Interesting trade-off! I think I disagree. v2 = +330 is also an almost decisive advantage, not that much worse than v1 = +350. Your confidence in finding a quick win for p1 seems questionable. After, you already spent 500'000 effort on it. Seems equally logical to try p2 and see if the +330 converges to a win even quicker!
Your conclusion that it work "quite well" is 100% hindsight-bias (as Fabien put it: showing the lottery winner) because 24-30 actually did win in this example, and 13-19 not. In any case, single examples can only be used as encouragement that something good is being discovered, but to fine tune it you need large samples and statistical testing. At least in the bandit literature, sqrt(ln N / n) has been shown to be optimal ("minimal regret"). sqrt(N/n) leads to too much exploration, and sqrt(1/n) to too little exploration.
I would be interested in seeing how a sqrt(ln N/n) approach would work in your example. E.g. it could be possible that 13-19 (which you showed to be the early "leader") would be refuted quicker by 24-30. I wouldn't be surprised if 24-30 would be explored even more with a UCB term compared to your current version. Of course, there's no way of knowing until you do the experiment
Also: my confidence is increasing rapidly that your algorithm is sound now!
Hi Rein,
First of all I am happy to see your confidence in Damy increasing.
Let me explain why I take v into account in the handling of uncertainty.
Suppose that after a certain number of iterations of your search algorithm you find that you can gain a man. Certainly your are happy because you can expect a win but if two moves are promising (here 13-19 and 24-30) the difficulty may become very great because we know that an advantage of a man is not automatcally sufficient to win.
The problem is then the following: in a position with an advantage of one man how can we recognise a really winning position?
This question may be strange but in fact it is crucial in my point view.
Diag1
White to play
For a human no question exists, it is a win, full stop.
On the other with the following position
Diag2
white to play
A human will be very great difficulties to decide what could be the result
What about programs?
Curiously it is exactly the contrary:
In Diag1 a lot of moves may take place before the first exchange which will reach the egdb and a program has great difficulty to prove a win.
In the other hand, in Diag2 it will take far less moves to reach the egdb and the program can solve such problem.
What it is the best strategy for a program to prove a win in position like the diag1?
You see my point?
If you handle uncertainty as "usual" you will never solve the problem because you will explore for a very long time all possible moves because the value of all variations is always very near from +100 and in such situation you will choose the move which is less explorer than the others.
I have not finished the analyse of such situation but limiting exploration (against exploitation) is one possibility I will continue to test deeper.