Truus DXP-Server

FeikeBoomstra · Post by **FeikeBoomstra** » Sat Mar 07, 2009 16:23

First of all: congratulations to Ed.

Two observations:
1) If you go down from three threads to one thread, your rating only drops a few points (and it is not sure what is the cause). I (vaguely) remember an article about rating as a function of the search depth, and I think one ply extra search depth gave a substantial contribution to the rating. That's not the case here.

2) Yes Rein, if you want to order the programs by their rating, you have to play a lot of games. Currently I am busy to figure out, when you can say that one program is stronger than the other, as function of the number of games, the score and the number of draws.

Rein Halbersma · Post by **Rein Halbersma** » Sat Mar 07, 2009 16:36

FeikeBoomstra wrote:First of all: congratulations to Ed.

2) Yes Rein, if you want to order the programs by their rating, you have to play a lot of games. Currently I am busy to figure out, when you can say that one program is stronger than the other, as function of the number of games, the score and the number of draws.

IT's not only useful for title matches, but also for testing whether an improvement is statistical meaningful. I have a little program in the statistical programming language R that does these sort of calculations. Here's how to calculate it yourself (quoted from on old post of mine on Talkchess)

If you play a 100 game match with 51 wins, 49 losses, then the mean result is that you score 51 points, with a variance of 24.99 (=51*(1-.51)^2 + 49*(0-.51)^2). The average result per game is 0.51 with a standard error on the mean of 0.04999=sqrt(24.99/100). If we assume that games are statistically indepent, then the outcome of a 100 game match with two oppenents that score 51-49 against each other is normally distributed around a mean of 51% and a standard deviation of 0.04999.

If you want to test whether the 51-49 match was a signifcant improvement over a 50-50 match (no draws again), then you have to calculate the 95% confidence interval (one-sided) under the null hypothesis of a 50-50 result. That turns out to be a score of 58,2% or more *in a 100 game match*. Hence, a 51-49 result in a 100 game match is not a significant improvement.

How many games do you have to repeat in order for a 51-49 result to be an improvement over 50-50? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!

Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).

So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.

Rein Halbersma · Post by **Rein Halbersma** » Sat Mar 07, 2009 16:56

Rein Halbersma wrote:
IT's not only useful for title matches, but also for testing whether an improvement is statistical meaningful. I have a little program in the statistical programming language R that does these sort of calculations.

Based on the results Ed gave yesterday, Kingsrow has an 80 point ELO (61% score) advantage over Truus. Here are the expected match result for Kingsrow (parallel, 9 pc) - Truus over a traditional 20 game match:

Code: Select all

21-19  0.2%
22-18  8.5%
23-17 45.3%
24-16 40.0%
25-15  5.8%
26-14  0.1%

So 20 games should be enough to be confident in the match result if the difference is larger than 80 ELO points. In order words: if you improve your program with at least 80 ELO points then you will get a decisive match outcome already after 20 games. If the match is even after 20 games, the improvement is less than 80 points.

If you want to test for smaller improvements, you need more matches. For an 35 ELO point improvement (55% score, i.e. 22-18 on average) you need about 270 games to be 95% sure of such an improvement.

Ed Gilbert · Post by **Ed Gilbert** » Sat Mar 07, 2009 17:17

FeikeBoomstra wrote:1) If you go down from three threads to one thread, your rating only drops a few points (and it is not sure what is the cause). I (vaguely) remember an article about rating as a function of the search depth, and I think one ply extra search depth gave a substantial contribution to the rating. That's not the case here.

I don't think this holds when the results are as lopsided as these matches. Draughts is a drawn game, and in order to score a win the opponent has to make a mistake. Suppose that truus only makes mistakes in say 40 games out of 158. Then even if truus is playing a perfect opponent it will not lose more than 40 games. It may be that in these matches kingsrow was already taking advantage of almost all of truus' mistakes when it used a single search thread and 6pc db, so adding more power had little effect on the outcome. Also, as Rein pointed out, there is a lot of variation in results with "only" 158 games played. I could probably run these 3 matches again and get +/-5 on the number of kingsrow wins in each match.

-- Ed

FeikeBoomstra · Post by **FeikeBoomstra** » Sat Mar 07, 2009 17:17

Rein, if you take into account the fact that there are draws, you can decrease the number of games.

BertTuyt · Post by **BertTuyt** » Sat Mar 07, 2009 18:59

Interesting discussion.
I am also working to get the 158 games match-mode working in Damage.
In the past I did all kinds of tests with selfplay, good to repeat this with Damage and Truus.

A question for the audience:
- If you have 2 programs. A with a huge search-depth, B with a huge knowlegde (so very good evaluation function).

If these programs play a match with timesetting X (just as a example 10 min/Game), and then you play more matched were you increase the search-time to 2X (20-min) 4X (40-min) for a game. Which of the 2 programs will benefit the most, or how will the end-result be unaffected?
Or will the end-result (relative difference is strength) stay the same, only (most likely) with somewhat more draws.

Bert

BertTuyt · Post by **BertTuyt** » Sat Mar 07, 2009 19:22

The background for my previous question, is related to a selftest which I have executed some time ago.

There I had 2 versions of Damage, 1 with a simple eval and the other with a complex evaluation.
Both programs played a match with each other with a delta search-depth for the 2 versions ( the version with the basic eval was "allowed" to think 1 - 2 - 3 ply deeper).

The outcome of that test was, that it "appeared" that the more intelligent version benefits more from the increased time or search-depth.
Otherwise stated, a more knowledgable program will benefit more from the hardware improvements/innovation.
It could be that the result was biased by the relatively low search-depths at that time, so direct complex combinations were beyond the horizon for the "intelligent" program.

It is evident that when search-depth further increases, ultimate all programs will draw (based on the assumption that Draughts is a draw game when played "perfectly").

Maybe interesting to get other opinions out of this community.

Bert

FeikeBoomstra · Post by **FeikeBoomstra** » Sat Mar 07, 2009 19:23

I suppose the law of the decreasing increase is valid here.
Your increased knowledge for the next depth decreases as a function of the depth.

So the "smarter" program will be in advance.

BertTuyt · Post by **BertTuyt** » Sat Mar 07, 2009 19:41

Feike, so your theory/assumption is:

* 2 programs A and B, with different knowledge levels, but playing towards the same depth d, with the outcome/result X

* when you increase the searchdepth to d+1, the outcome remains X.

Is that the right interpretation of your thoughts (basically the concept of diminishing returns at higher search depth). But when the search-depths stay the same, the result for both programs will stay the same.

Bert

FeikeBoomstra · Post by **FeikeBoomstra** » Sat Mar 07, 2009 19:57

What I thought was: both programs get the same time. So the "intelligent" has a smaller search depth. Now you increase the time (or the processor speed). The "relative" increase of the intelligent program will be bigger, so that program has more benefit of the increased depth.

Ed Gilbert · Post by **Ed Gilbert** » Tue Mar 10, 2009 02:21

I'm finished (for now) making changes to the server program. If anyone would like to have the executable file, send me an email (eyg_at_prodigy_dot_net).

-- Ed

FeikeBoomstra · Post by **FeikeBoomstra** » Thu Mar 12, 2009 14:16

With Ed's help I have truus_dxp up and running.
As you may know, I am running at ubuntu 64 bits, this is how I did it.

I used VirtualBox to run windows xp inside ubuntu, and in this window environment I run truus_dxp (which calls Truus). You have to do special things to have TCP/IP traffic between the host and the guest os, but in the end it was quite simple for this case. You can simply transfer a certain port to the guest os, that's all.

Next step is to setup the environment for the 158 matches.

BertTuyt · Post by **BertTuyt** » Fri Mar 13, 2009 00:06

Feike, im curious how your program will score against Truus.
If you would like your 158 game-match code with me , i would appreciate this.

Bert

BertTuyt · Post by **BertTuyt** » Fri Mar 13, 2009 00:08

Feike, sorry my previous mail was not clear.
I meant, if you want to share the source code for the part controlling the 158 game-match, i would appreciate this.

Bert

FeikeBoomstra · Post by **FeikeBoomstra** » Fri Mar 13, 2009 00:39

Of course I am willing to share, but all my administration is programmed in python.

The engine is programmed in C(++), to be honest not really C++, just C while using some features of C++. Originally my idea was, just give it a position and a time and the engine will give the next move. But later on it appeared I needed some game information in the engine too, so now, the engine knows about the time used and the number of moves.

But all other things are programmed in python, the display, communication with the remote player, with the engine, save/read pdn files, various debug options, that's all python.

What I am thinking of now is creating a file with all 79 start positions and an administration to keep track of the progress and recover from hick-ups. That's all.

World Draughts Forum

Truus DXP-Server

Server program available.