statistics

MichelG · Post by **MichelG** » Fri Jan 28, 2011 09:24

BertTuyt wrote:With the modified evaluation function (+ voorpost), with corrected symmetry bugs (as detected and corrected by Ed), I did another Horizon Damage 158 games test.
The end result: 33+ 3- 122=, in comparison with the previous match (with the bugged evaluation) not really better (in reality somewhat worse, 25+ 5- 128=).

But I think this is a statistical fluctuation (at least I hope) also observed by others.
If I remember well also Ed did not reveal a huge (or any) difference with the modified eval.
Bert

158 games seems way to little to determen if one version of program is better than another version.

I usually compute a match score as

Code: Select all

score=(2*win+draw)/(number of games)

This results in a score of between 0 (player 1 loses all), 1=(equal) and 2 (player 1 wins all)
If you compute the statistical variance on this, you get

Code: Select all

		n1=win+draw+lose;
		sigma=sqrt( ( (4*win+draw)/n1 -(2*win+draw)*(2*win+draw)/(n1*n1))/n1);

For a 1000 games between equal strength players, typically i get score=1.000, and sigma=0.020. For 158 games sigma would be 0.06 (estimated)

Or in other words, if you want to be 95% sure player 1 betters player 2 in a 158 game match, it needs to score at least 56%-44% or so. Any score between 44 and 56 percent doesn't mean much in such a short match.

I am doing to calculations by heart, so they may be off a little, and they also depend on the draw rate.

Michel

Rein Halbersma · Post by **Rein Halbersma** » Fri Jan 28, 2011 12:34

MichelG wrote:
BertTuyt wrote:With the modified evaluation function (+ voorpost), with corrected symmetry bugs (as detected and corrected by Ed), I did another Horizon Damage 158 games test.
The end result: 33+ 3- 122=, in comparison with the previous match (with the bugged evaluation) not really better (in reality somewhat worse, 25+ 5- 128=).

But I think this is a statistical fluctuation (at least I hope) also observed by others.
If I remember well also Ed did not reveal a huge (or any) difference with the modified eval.
Bert
158 games seems way to little to determen if one version of program is better than another version.

I usually compute a match score as
Code: Select all
score=(2*win+draw)/(number of games)
This results in a score of between 0 (player 1 loses all), 1=(equal) and 2 (player 1 wins all)
If you compute the statistical variance on this, you get
Code: Select all
		n1=win+draw+lose;
		sigma=sqrt( ( (4*win+draw)/n1 -(2*win+draw)*(2*win+draw)/(n1*n1))/n1);
For a 1000 games between equal strength players, typically i get score=1.000, and sigma=0.020. For 158 games sigma would be 0.06 (estimated)

Or in other words, if you want to be 95% sure player 1 betters player 2 in a 158 game match, it needs to score at least 56%-44% or so. Any score between 44 and 56 percent doesn't mean much in such a short match.

I am doing to calculations by heart, so they may be off a little, and they also depend on the draw rate.

Michel

Michel,

There was a post by Remi Coulom (author of BayesElo) on Talkchess that explains that likelihood-of-superiority between two programs does not depend on the drawing margin, nor on the overlap in confidence intervals of both programs' ratings.

See http://talkchess.com/forum/viewtopic.ph ... hlight=los for more information

Rein

Ed Gilbert · Post by **Ed Gilbert** » Sat Jan 29, 2011 00:48

Here's what Remi's own program Bayeselo says about the 2 matches.

First match:

Code: Select all

Rank Name      Elo    +    - games score oppo. draws
   1 damage     15   20   20   158   56%   -15   81%
   2 horizon   -15   20   20   158   44%    15   81%

LOS:
         da ho
damage      93
horizon   6

Second match:

Code: Select all

Rank Name      Elo    +    - games score oppo. draws
   1 damage     23   20   20   158   59%   -23   77%
   2 horizon   -23   20   20   158   41%    23   77%

LOS:
         da ho
damage      99
horizon   0

For each match, the first bayeselo table gives in the first 2 columns the +/- elo stats for each opponent about some nominal, and IIRC the elo difference that would be necessary for a 95% confidence level. The second table gives the likelihood of superiority (LOS) in percent. So from the first match Damage has a 93% LOS, and 99% from the second match.

I agree with Michel that 158 games is usually not enough to establish LOS with a high probability, but when the results are this lopsided then perhaps they are.

-- Ed

BertTuyt · Post by **BertTuyt** » Sat Feb 05, 2011 13:53

I played 13 158-games matches ( fixed-depth) between Horizon and Damage, to test the effect on search depth and different evaluation function (Horizon and Damage) .
Will post some results here on the Forum.

And maybe some of you can derive some statistics from it....

Bert

World Draughts Forum

statistics

statistics

Re: statistics

Re: statistics

Re: statistics