![](https://damforum.nl/bb3/images/ua.png)
Truus DXP-Server
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
First of all: congratulations to Ed.
Two observations:
1) If you go down from three threads to one thread, your rating only drops a few points (and it is not sure what is the cause). I (vaguely) remember an article about rating as a function of the search depth, and I think one ply extra search depth gave a substantial contribution to the rating. That's not the case here.
2) Yes Rein, if you want to order the programs by their rating, you have to play a lot of games. Currently I am busy to figure out, when you can say that one program is stronger than the other, as function of the number of games, the score and the number of draws.
Two observations:
1) If you go down from three threads to one thread, your rating only drops a few points (and it is not sure what is the cause). I (vaguely) remember an article about rating as a function of the search depth, and I think one ply extra search depth gave a substantial contribution to the rating. That's not the case here.
2) Yes Rein, if you want to order the programs by their rating, you have to play a lot of games. Currently I am busy to figure out, when you can say that one program is stronger than the other, as function of the number of games, the score and the number of draws.
-
- Posts: 1722
- Joined: Wed Apr 14, 2004 16:04
- Contact:
IT's not only useful for title matches, but also for testing whether an improvement is statistical meaningful. I have a little program in the statistical programming language R that does these sort of calculations. Here's how to calculate it yourself (quoted from on old post of mine on Talkchess)FeikeBoomstra wrote:First of all: congratulations to Ed.
2) Yes Rein, if you want to order the programs by their rating, you have to play a lot of games. Currently I am busy to figure out, when you can say that one program is stronger than the other, as function of the number of games, the score and the number of draws.
If you play a 100 game match with 51 wins, 49 losses, then the mean result is that you score 51 points, with a variance of 24.99 (=51*(1-.51)^2 + 49*(0-.51)^2). The average result per game is 0.51 with a standard error on the mean of 0.04999=sqrt(24.99/100). If we assume that games are statistically indepent, then the outcome of a 100 game match with two oppenents that score 51-49 against each other is normally distributed around a mean of 51% and a standard deviation of 0.04999.
If you want to test whether the 51-49 match was a signifcant improvement over a 50-50 match (no draws again), then you have to calculate the 95% confidence interval (one-sided) under the null hypothesis of a 50-50 result. That turns out to be a score of 58,2% or more *in a 100 game match*. Hence, a 51-49 result in a 100 game match is not a significant improvement.
How many games do you have to repeat in order for a 51-49 result to be an improvement over 50-50? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!
Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).
So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
-
- Posts: 1722
- Joined: Wed Apr 14, 2004 16:04
- Contact:
Based on the results Ed gave yesterday, Kingsrow has an 80 point ELO (61% score) advantage over Truus. Here are the expected match result for Kingsrow (parallel, 9 pc) - Truus over a traditional 20 game match:Rein Halbersma wrote:
IT's not only useful for title matches, but also for testing whether an improvement is statistical meaningful. I have a little program in the statistical programming language R that does these sort of calculations.
Code: Select all
21-19 0.2%
22-18 8.5%
23-17 45.3%
24-16 40.0%
25-15 5.8%
26-14 0.1%
If you want to test for smaller improvements, you need more matches. For an 35 ELO point improvement (55% score, i.e. 22-18 on average) you need about 270 games to be 95% sure of such an improvement.
-
- Posts: 859
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
I don't think this holds when the results are as lopsided as these matches. Draughts is a drawn game, and in order to score a win the opponent has to make a mistake. Suppose that truus only makes mistakes in say 40 games out of 158. Then even if truus is playing a perfect opponent it will not lose more than 40 games. It may be that in these matches kingsrow was already taking advantage of almost all of truus' mistakes when it used a single search thread and 6pc db, so adding more power had little effect on the outcome. Also, as Rein pointed out, there is a lot of variation in results with "only" 158 games played. I could probably run these 3 matches again and get +/-5 on the number of kingsrow wins in each match.FeikeBoomstra wrote:1) If you go down from three threads to one thread, your rating only drops a few points (and it is not sure what is the cause). I (vaguely) remember an article about rating as a function of the search depth, and I think one ply extra search depth gave a substantial contribution to the rating. That's not the case here.
-- Ed
Last edited by Ed Gilbert on Sat Mar 07, 2009 17:20, edited 1 time in total.
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
Interesting discussion.
I am also working to get the 158 games match-mode working in Damage.
In the past I did all kinds of tests with selfplay, good to repeat this with Damage and Truus.
A question for the audience:
- If you have 2 programs. A with a huge search-depth, B with a huge knowlegde (so very good evaluation function).
If these programs play a match with timesetting X (just as a example 10 min/Game), and then you play more matched were you increase the search-time to 2X (20-min) 4X (40-min) for a game. Which of the 2 programs will benefit the most, or how will the end-result be unaffected?
Or will the end-result (relative difference is strength) stay the same, only (most likely) with somewhat more draws.
Bert
I am also working to get the 158 games match-mode working in Damage.
In the past I did all kinds of tests with selfplay, good to repeat this with Damage and Truus.
A question for the audience:
- If you have 2 programs. A with a huge search-depth, B with a huge knowlegde (so very good evaluation function).
If these programs play a match with timesetting X (just as a example 10 min/Game), and then you play more matched were you increase the search-time to 2X (20-min) 4X (40-min) for a game. Which of the 2 programs will benefit the most, or how will the end-result be unaffected?
Or will the end-result (relative difference is strength) stay the same, only (most likely) with somewhat more draws.
Bert
The background for my previous question, is related to a selftest which I have executed some time ago.
There I had 2 versions of Damage, 1 with a simple eval and the other with a complex evaluation.
Both programs played a match with each other with a delta search-depth for the 2 versions ( the version with the basic eval was "allowed" to think 1 - 2 - 3 ply deeper).
The outcome of that test was, that it "appeared" that the more intelligent version benefits more from the increased time or search-depth.
Otherwise stated, a more knowledgable program will benefit more from the hardware improvements/innovation.
It could be that the result was biased by the relatively low search-depths at that time, so direct complex combinations were beyond the horizon for the "intelligent" program.
It is evident that when search-depth further increases, ultimate all programs will draw (based on the assumption that Draughts is a draw game when played "perfectly").
Maybe interesting to get other opinions out of this community.
Bert
There I had 2 versions of Damage, 1 with a simple eval and the other with a complex evaluation.
Both programs played a match with each other with a delta search-depth for the 2 versions ( the version with the basic eval was "allowed" to think 1 - 2 - 3 ply deeper).
The outcome of that test was, that it "appeared" that the more intelligent version benefits more from the increased time or search-depth.
Otherwise stated, a more knowledgable program will benefit more from the hardware improvements/innovation.
It could be that the result was biased by the relatively low search-depths at that time, so direct complex combinations were beyond the horizon for the "intelligent" program.
It is evident that when search-depth further increases, ultimate all programs will draw (based on the assumption that Draughts is a draw game when played "perfectly").
Maybe interesting to get other opinions out of this community.
Bert
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
Feike, so your theory/assumption is:
* 2 programs A and B, with different knowledge levels, but playing towards the same depth d, with the outcome/result X
* when you increase the searchdepth to d+1, the outcome remains X.
Is that the right interpretation of your thoughts (basically the concept of diminishing returns at higher search depth). But when the search-depths stay the same, the result for both programs will stay the same.
Bert
* 2 programs A and B, with different knowledge levels, but playing towards the same depth d, with the outcome/result X
* when you increase the searchdepth to d+1, the outcome remains X.
Is that the right interpretation of your thoughts (basically the concept of diminishing returns at higher search depth). But when the search-depths stay the same, the result for both programs will stay the same.
Bert
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
-
- Posts: 859
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Server program available.
I'm finished (for now) making changes to the server program. If anyone would like to have the executable file, send me an email (eyg_at_prodigy_dot_net).
-- Ed
-- Ed
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
With Ed's help I have truus_dxp up and running.
As you may know, I am running at ubuntu 64 bits, this is how I did it.
I used VirtualBox to run windows xp inside ubuntu, and in this window environment I run truus_dxp (which calls Truus). You have to do special things to have TCP/IP traffic between the host and the guest os, but in the end it was quite simple for this case. You can simply transfer a certain port to the guest os, that's all.
Next step is to setup the environment for the 158 matches.
As you may know, I am running at ubuntu 64 bits, this is how I did it.
I used VirtualBox to run windows xp inside ubuntu, and in this window environment I run truus_dxp (which calls Truus). You have to do special things to have TCP/IP traffic between the host and the guest os, but in the end it was quite simple for this case. You can simply transfer a certain port to the guest os, that's all.
Next step is to setup the environment for the 158 matches.
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
Of course I am willing to share, but all my administration is programmed in python.
The engine is programmed in C(++), to be honest not really C++, just C while using some features of C++. Originally my idea was, just give it a position and a time and the engine will give the next move. But later on it appeared I needed some game information in the engine too, so now, the engine knows about the time used and the number of moves.
But all other things are programmed in python, the display, communication with the remote player, with the engine, save/read pdn files, various debug options, that's all python.
What I am thinking of now is creating a file with all 79 start positions and an administration to keep track of the progress and recover from hick-ups. That's all.
The engine is programmed in C(++), to be honest not really C++, just C while using some features of C++. Originally my idea was, just give it a position and a time and the engine will give the next move. But later on it appeared I needed some game information in the engine too, so now, the engine knows about the time used and the number of moves.
But all other things are programmed in python, the display, communication with the remote player, with the engine, save/read pdn files, various debug options, that's all python.
What I am thinking of now is creating a file with all 79 start positions and an administration to keep track of the progress and recover from hick-ups. That's all.