Parallel perft (with hash table)

jj · Post by jj » Thu Sep 09, 2010 18:15

Hi all,

Over the last couple of weeks I've implemented a parallel version of perft, using the YBWC algorithm. To test this, I used the three 'standard' perft positions - just with a greater depth. Here are the extended counts:

Code: Select all

start position
perft(11) total    1665861398 nodes
perft(12) total   10749771911 nodes
perft(13) total   70612863532 nodes
perft(14) total  465829118473 nodes

random position [FEN "B:W6,9,10,11,20,21,22,23,30,33,37,41,42,43,44,46,K31:BK17,K24."]
perft( 9) total    1216917193 nodes
perft(10) total   13106503411 nodes
perft(11) total  183848651744 nodes
perft(12) total 2026297454724 nodes

Woldouby position [FEN "W:W25,27,28,30,32,33,34,35,37,38:B12,13,14,16,18,19,21,23,24,26."]
perft(15) total     346184885 nodes
perft(16) total    2272406115 nodes
perft(17) total   14962263728 nodes
perft(18) total  106965577859 nodes

Tests were run on (my brother's) Intel i7-920 @ 2.67GHz, a quad core with hyperthreading, running Windows Vista and Java HotSpot 64-bit VM. The results using bulk counting are:

Code: Select all

          start(12) random(10) Woldouby(16) avg speedup
1 thread     200.51     205.96        58.01  
2 threads    101.82      95.92        30.08         2.0
4 threads     63.70      67.60        19.32         3.1
6 threads     45.00      45.98        13.78         4.4
8 threads     39.27      40.73        11.83         5.0

I also ran a series of tests with a (shared) hash table of 4M positions. To keep things interesting, 2 more plies were added to the search depth.

Code: Select all

          start(14) random(12) Woldouby(18) avg speedup
1 thread     194.10     129.15       239.87
2 threads     98.79      75.10       145.90         1.8
4 threads     68.33      52.92        87.08         2.7
6 threads     50.09      41.45        78.01         3.4
8 threads     44.99      35.73        51.33         4.2

It would be nice if some people can confirm the perft counts and report their multithread speedups.

Jan-Jaap

jj · Post by jj » Tue Sep 14, 2010 19:41

Hi, I have to add that I found out that the i7-920 was running @ 3.0 instead of 2.67 GHz in the above tests.

Anyway, I thought using perft is a nice way to prove the correctness of both YBWC and (shared) hash table implementation, since the output should always be equal to the normal perft output (without parallelism and hash table). And doing this before before trying my hand at real search.

I suppose YBWC perft shows the upper limit for the speedup that is feasible for a search. Do you guys think that x 5.0 is about the maximum speedup for 8 threads on a quad core with hyperthreading?

It would be appreciated to get some feedback on my hash table performance: a perft speedup of roughly x 50, with 20+ % hash table hits for positions 1 (start position) and 3 ('Woldouby'), and 90+ % for position 2 ('random').

Just for fun I also calculated perft ply+4

Code: Select all

start   (15) =  3.120.929.044.364    (2:54)
random  (13) = 28.502.670.196.300    (7:56)
woldouby(19) =    767.973.855.727    (4:59)

Nice to see those big numbers

Jan-Jaap

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 15, 2010 20:48

jj wrote:
Code: Select all
start position
perft(14) total  465829118473 nodes

random position [FEN "B:W6,9,10,11,20,21,22,23,30,33,37,41,42,43,44,46,K31:BK17,K24."]
perft(12) total 2026297454724 nodes

Woldouby position [FEN "W:W25,27,28,30,32,33,34,35,37,38:B12,13,14,16,18,19,21,23,24,26."]
perft(18) total  106965577859 nodes
It would be nice if some people can confirm the perft counts and report their multithread speedups.

Jan-Jaap

Confirmed on a single core P4 @ 3GHz with two 16M entry hash tables (one for each color) and the 3 bitboards as hash signature (so guaranteed hash correctness).

Rein

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 15, 2010 20:53

jj wrote:

Code: Select all

          start(12) random(10) Woldouby(16) avg speedup
1 thread     200.51     205.96        58.01  
2 threads    101.82      95.92        30.08         2.0
4 threads     63.70      67.60        19.32         3.1
6 threads     45.00      45.98        13.78         4.4
8 threads     39.27      40.73        11.83         5.0

I also ran a series of tests with a (shared) hash table of 4M positions. To keep things interesting, 2 more plies were added to the search depth.

Code: Select all

          start(14) random(12) Woldouby(18) avg speedup
1 thread     194.10     129.15       239.87
2 threads     98.79      75.10       145.90         1.8
4 threads     68.33      52.92        87.08         2.7
6 threads     50.09      41.45        78.01         3.4
8 threads     44.99      35.73        51.33         4.2

Because the hashed perft has less parallel speedup than the normal perft, it could be that your program is at the memory bandwidth limit and/or having caching problems. Could you run the following two experiments? Run 1,2,4 and 8 simultaneous instances of your serial perft code (one on each core of course) and see how the times-to-depth change. If the serial speeds drop as you increase the number of utilized cores, you are memory-bound. As a second experiment, try and run the multi-threaded code with 4M, 8M, 16M and 32M globally shared hash entries. Finally you could try and run with thread-local hash tables (depending on how much memory your system has).

Rein

jj · Post by jj » Sun Sep 19, 2010 19:54

Hi Rein,

Rein Halbersma wrote:Confirmed on a single core P4 @ 3GHz with two 16M entry hash tables (one for each color) and the 3 bitboards as hash signature (so guaranteed hash correctness).

Thanks for confirming the counts. What speedup do you get for normal perft vs. perft-with-hash-table (for lesser depths)?

Rein Halbersma wrote:Because the hashed perft has less parallel speedup than the normal perft, it could be that your program is at the memory bandwidth limit and/or having caching problems. Could you run the following two experiments? Run 1,2,4 and 8 simultaneous instances of your serial perft code (one on each core of course) and see how the times-to-depth change. If the serial speeds drop as you increase the number of utilized cores, you are memory-bound. As a second experiment, try and run the multi-threaded code with 4M, 8M, 16M and 32M globally shared hash entries. Finally you could try and run with thread-local hash tables (depending on how much memory your system has).

I conducted the experiments you suggested. To make testing a bit easier, I 'defined' a new function, ptest (simply implemented as the elapsed system time for the complete run including console output):

Code: Select all

ptest( extra_depth, bulk?, threads, hash table settings ) =

      (SUM depth : 1 <= depth <= 11 + extra_depth : time(start   .perft(depth, bulk)) )
    + (SUM depth : 1 <= depth <=  9 + extra_depth : time(random  .perft(depth, bulk)) )
    + (SUM depth : 1 <= depth <= 15 + extra_depth : time(woldouby.perft(depth, bulk)) )

So ptest( +0, false/true, 1, no hash table ) is the same as 'classical' perft, except that all times are added yielding the total time. Hash table settings are: use hash table or not, use one shared or multiple thread-local hash tables (no synchronization issues), and finally the hash table size(s) in number of entries. Below I always use bulk = true.

First I should mention that the i7-920 test system I use has 12 Gb RAM. I changed the clock frequency to 2.8 GHz because at 3.0 GHz the system becomes unstable. Also, as mentioned before on this forum, running times can vary considerably. Differences of 5% are common, differences of 10% or more are rare. In my first post I used single runs (measurements), giving less reliable results. For the tests in this post I repeated each run three times and used the average time.

In addition to perft counting leaves (the return value), I also count the number of nodes visited. Then it is possible to measure the speedup in time-to-depth and in Mnodes/sec separately. Mnodes/sec I calculate by taking the average of the Mnodes/sec for the last (deepest) perft for each of the three test positions.
When using no hash table, these speedups are the almost equal. Using a hash table, however, shows less speedup in time-to-depth than in Mnodes/sec. Rephrased, the test from my first post (repeated) now gives:

Code: Select all

A. ptest( +1, true, *, no hash table ) :

threads   time  speedup  Mnodes/sec  Mn/s speedup
1       477.37                  8.7    
2       241.35     1.98        
4       143.06     3.34        
6       109.13     4.37        
8       100.03     4.77        41.9          4.82
                
B. ptest( +3, true, *, shared hash table 4M ) :

threads   time  speedup  Mnodes/sec  Mn/s speedup
1       626.69                  8.5    
2       332.71     1.88        
4       199.11     3.15        
6       157.45     3.98        
8       150.33     4.17        37.6          4.42

1) Running multiple instances of serial perft. In my case, this means running multiple instances of the Java virtual machine. The time-to-depth changes considerably:

Code: Select all

ptest( +0, true, 1, no hash table ) :

instances   time   x time  speedup | x cores = speedup work?
1          65.42                   |
2          67.73     1.04     0.97 |       2           1.93
4          77.93     1.19     0.84 |       4           3.36
6          88.26     1.35     0.74 |       6           4.45
8         105.84     1.62     0.62 |       8           4.94

I'm not sure whether this makes sense, but multiplying the relative speed by the number of cores may give an upper limit for parallel speedup on this machine. Or does Java use too much memory? I wonder if compiled C/C++ programs perform better when tested like this. By the way, using a hash table or not makes no difference. Running multiple instances of ptest( +2, true, 1, shared hash table 4M ) gives the same speedup (slowdown) results.

2) Bigger hash tables. As expected, the bigger the hash table the better:

Code: Select all

ptest( +3, true, 8, shared hash table * M ) :

hash entries   time  speedup
1 M          277.64
4 M          161.06     1.72
8 M          131.50     2.11
16 M         113.86     2.44
32 M         108.64     2.56

The total speedup of a 32 M hash table vs. no hash table for deep perft is about x 80. I don't know if this is normal. Run B repeated, with a 32 M hash table:

Code: Select all

ptest( +3, true, *, shared hash table 32 M ) :

threads   time  speedup  Mnodes/sec  speedup Mn/s
1       437.15                  8.3          1.00
2       237.69     1.84               
4       146.66     2.98               
6       111.77     3.91               
8       107.65     4.06        36.4          4.39

Like in a real search, I suppose, more nodes are searched but less efficiently. This explains why perft with hash table gives less parallel time-to-depth speedup than perft without hash table. The impact of using a (big) hash table is much greater than using multiple threads.

3) Shared hash table vs. thread local hash tables. As expected, local hash tables give a slightly better speed(up) for Mnodes/sec. But they also result in way longer times-to-depth.

I'm still interested to know if other people get similar parallel and hash table results. Though completely different from a real search, at least it's easy to reproduce and it gives some useful data. I hope some of you can find the time to comment on this and maybe repeat some (p)tests. I apologise for the long running times. I wanted the fastest runs to be reliable and then I started adding times and things got a bit out of hand

Jan-Jaap

jj · Post by jj » Tue Sep 21, 2010 20:57

To learn a bit more about optimal parallel performance, I wrote a litte C test program: crunch. It bears some resemblance to perft, but it needs no move generator and uses almost no memory. It takes two parameters, width and depth, and simply generates (and times) a 'tree' of width ^ depth function calls. Intended use is running parallel instances on a multi-core system to see what happens to the running time. In case somebody is interested to use it, here is the complete listing:

Code: Select all

#include <stdio.h>
#include <time.h>

void crunch(int width, int depth) {

    if (depth == 0) return;

    for (int i = 0; i < width; i++) {
        crunch(width, depth - 1);
    }

}

int main(int argc, char** argv) {

    if (argc < 3) {
        printf("usage: crunch width depth\n");
        return 1;
    }

    int depth, width;
    sscanf(argv[1], "%d", &width);
    sscanf(argv[2], "%d", &depth);
    printf("width = %d, depth = %d\n", width, depth);

    clock_t t0 = clock();

    crunch(width, depth);

    printf("done in %.2f seconds\n", (float)(clock() - t0) / 1000);
    printf("Press [ENTER]");
    char c; scanf("%c", &c);

}

After testing the C program, I also made a Java version. The C version shows very poor performance, especially with hyperthreading. I suppose this is due to the Cygwin/gcc 32-bit compiler I used (no optimizations tried). Rather arbitrarily, I use crunch 100/150 5, which will keep a processor busy for some time. Here are the results on the i7-920:

Code: Select all

C (crunch.exe*32): crunch 100 5          Java (JCrunch.jar): crunch 150 5

HyperThreading OFF (4 real cores)
  
instances   time  x time  speedup  work  instances   time  x time  speedup  work
        1  39.35    1.00     1.00  1.00          1  61.68    1.00     1.00  1.00
        2  39.30    1.00     1.00  2.00          2  61.68    1.00     1.00  2.00
        3  39.41    1.00     1.00  3.00          3  61.77    1.00     1.00  3.00
        4  39.62    1.01     0.99  3.97          4  62.39    1.01     0.99  3.95

HyperThreading ON (8 virtual cores)

instances   time  x time  speedup  work  instances   time  x time  speedup  work
        1  40.11    1.00     1.00  1.00          1  61.92    1.00     1.00  1.00
        2  42.14    1.05     0.95  1.90          2  64.77    1.05     0.96  1.91
        4  52.13    1.30     0.77  3.08          4  77.21    1.25     0.80  3.21
        6  61.69    1.54     0.65  3.90          6  91.87    1.48     0.67  4.04
        8  80.38    2.00     0.50  3.99          8 113.37    1.83     0.55  4.37

HyperThreading gain: 10%

At least now I know where the slowdown factors come from: this seems normal hyperthreading behaviour. Without hyperthreading there is hardly any slowdown, with hyperthreading there is. I guess I could have known this before, looking at Bert's parallel panic figures. My figures more or less confirm those. Of course I don't have search and eval overhead yet.

I also repeated my previous tests with hyperthreading switched off.

Code: Select all

A. ptest( +1, true, *, no hash table ) :

threads    time  speedup  Mnodes/sec  speedup Mn/s
      1  470.28                 8.81  
      2  229.07     2.05       18.18          2.06
      3  161.83     2.91       25.74          2.92
      4  121.86     3.86       34.17          3.88

B. ptest( +3, true, *, shared hash table 32 M ) :

threads    time  speedup  Mnodes/sec  speedup Mn/s
      1  431.31                 8.41  
      2  224.89     1.92       16.56          1.97
      3  155.67     2.77       24.10          2.87
      4  123.85     3.48       30.90          3.67

This shows less thread slowdown than the hyperthreading ptest, especially the Mnodes/sec. It also shows no memory bound problems (yet). For test A, the ptest hyperthreading gain is 22%, for test B it is 15%.

So I guess my figures look ok, taking into account some overhead for split() etc. Also, Java looks competitive compared to C so far. Maybe Ed (and others) can try some of this (crunch) in C on 8 real cores?

Jan-Jaap

Rein Halbersma · Post by **Rein Halbersma** » Tue Sep 21, 2010 20:59

jj wrote: It would be appreciated to get some feedback on my hash table performance: a perft speedup of roughly x 50, with 20+ % hash table hits for positions 1 (start position) and 3 ('Woldouby'), and 90+ % for position 2 ('random').
Jan-Jaap

Hi Jan-Jaap,

Here are 3 perft runs on the 3 standard positions. The first run is perft with bulk-counting at depth==1. I use a specialized move_count() function that simply counts the number of set bits in the various bit patterns, rather than trying to generate all the moves into their respective structs. I should probably also explain that I calculate the nodes per second as the number of recursive calls to the perft routine, not as the number of leafs per second.

Rein

PS the runs are hidden in spoilers so as not to annoy people with limited screen width (yes, I am an Android phone user)

This first run does not maintain a Zobrist hash key.

Spoiler:

Code: Select all

   b   b   b   b   b
 b   b   b   b   b
   b   b   b   b   b
 b   b   b   b   b
   .   .   .   .   .
 .   .   .   .   .
   w   w   w   w   w
 w   w   w   w   w
   w   w   w   w   w
 w   w   w   w   w

"W:B1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20:W31,32,33,34,35,36,37,38,
39,40,41,42,43,44,45,46,47,48,49,50"

Searching to nominal depth=11

perft[ 1/ 0.0] =            9 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.0] =           81 leafs,           10 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 0.0] =          658 leafs,           91 nodes,   0.00s,   0.09 Mnps
perft[ 4/ 0.0] =         4265 leafs,          749 nodes,   0.00s,   0.75 Mnps
perft[ 5/ 0.0] =        27117 leafs,         5014 nodes,   0.00s,   5.01 Mnps
perft[ 6/ 0.0] =       167140 leafs,        32131 nodes,   0.02s,   2.01 Mnps
perft[ 7/ 0.0] =      1049442 leafs,       199271 nodes,   0.03s,   6.23 Mnps
perft[ 8/ 0.0] =      6483961 leafs,      1248713 nodes,   0.22s,   5.68 Mnps
perft[ 9/ 0.0] =     41022423 leafs,      7732674 nodes,   1.38s,   5.62 Mnps
perft[10/ 0.0] =    258895763 leafs,     48755097 nodes,   8.85s,   5.51 Mnps
perft[11/ 0.0] =   1665861398 leafs,    307650860 nodes,  57.13s,   5.39 Mnps
   .   .   .   .   .
 w   .   .   w   w
   w   .   .   .   .
 .   B   .   .   w
   w   w   w   B   .
 .   .   .   .   w
   W   .   w   .   .
 .   w   .   .   .
   w   w   w   w   .
 w   .   .   .   .

"B:BK17,K24:W6,9,10,11,20,21,22,23,30,K31,33,37,41,42,43,44,46"

Searching to nominal depth=9

perft[ 1/ 0.0] =           14 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.0] =           55 leafs,           15 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 0.0] =         1168 leafs,           70 nodes,   0.00s,   0.07 Mnps
perft[ 4/ 0.0] =         5432 leafs,         1238 nodes,   0.00s,   1.24 Mnps
perft[ 5/ 0.0] =        87195 leafs,         6670 nodes,   0.00s,   6.67 Mnps
perft[ 6/ 0.0] =       629010 leafs,        93865 nodes,   0.03s,   2.84 Mnps
perft[ 7/ 0.0] =      9041010 leafs,       722875 nodes,   0.31s,   2.31 Mnps
perft[ 8/ 0.0] =     86724219 leafs,      9763885 nodes,   3.13s,   3.12 Mnps
perft[ 9/ 0.0] =   1216917193 leafs,     96488104 nodes,  41.81s,   2.31 Mnps
   .   .   .   .   .
 .   .   .   .   .
   .   b   b   b   .
 b   .   b   b   .
   b   .   b   b   w
 b   w   w   .   w
   .   w   w   w   w
 .   w   w   .   .
   .   .   .   .   .
 .   .   .   .   .

"W:B12,13,14,16,18,19,21,23,24,26:W25,27,28,30,32,33,34,35,37,38"

Searching to nominal depth=15

perft[ 1/ 0.0] =            6 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.0] =           12 leafs,            7 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 0.0] =           30 leafs,           19 nodes,   0.00s,   0.02 Mnps
perft[ 4/ 0.0] =           73 leafs,           49 nodes,   0.00s,   0.05 Mnps
perft[ 5/ 0.0] =          215 leafs,          122 nodes,   0.00s,   0.12 Mnps
perft[ 6/ 0.0] =          590 leafs,          337 nodes,   0.00s,   0.34 Mnps
perft[ 7/ 0.0] =         1944 leafs,          927 nodes,   0.00s,   0.93 Mnps
perft[ 8/ 0.0] =         6269 leafs,         2871 nodes,   0.00s,   2.87 Mnps
perft[ 9/ 0.0] =        22369 leafs,         9140 nodes,   0.00s,   9.14 Mnps
perft[10/ 0.0] =        88050 leafs,        31509 nodes,   0.02s,   1.97 Mnps
perft[11/ 0.0] =       377436 leafs,       119559 nodes,   0.03s,   3.74 Mnps
perft[12/ 0.0] =      1910989 leafs,       496995 nodes,   0.11s,   4.48 Mnps
perft[13/ 0.0] =      9872645 leafs,      2407984 nodes,   0.58s,   4.16 Mnps
perft[14/ 0.0] =     58360286 leafs,     12280629 nodes,   3.11s,   3.95 Mnps
perft[15/ 0.0] =    346184885 leafs,     70640915 nodes,  19.16s,   3.69 Mnps

The second run maintains a Zobrist hash key. Since I have a P4 with 32-bits windows, I do not have an intrinsic BitScan CPU instruction. I use the DeBruijn table-lookup method (see chess programming wiki). This is rather slow of course, and you can see that the perft times take a considerable hit.

Spoiler:

Code: Select all

   b   b   b   b   b
 b   b   b   b   b
   b   b   b   b   b
 b   b   b   b   b
   .   .   .   .   .
 .   .   .   .   .
   w   w   w   w   w
 w   w   w   w   w
   w   w   w   w   w
 w   w   w   w   w

"W:B1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20:W31,32,33,34,35,36,37,38,
39,40,41,42,43,44,45,46,47,48,49,50"

Searching to nominal depth=11

perft[ 1/ 0.0] =            9 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.0] =           81 leafs,           10 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 0.0] =          658 leafs,           91 nodes,   0.00s,   0.09 Mnps
perft[ 4/ 0.0] =         4265 leafs,          749 nodes,   0.00s,   0.75 Mnps
perft[ 5/ 0.0] =        27117 leafs,         5014 nodes,   0.00s,   5.01 Mnps
perft[ 6/ 0.0] =       167140 leafs,        32131 nodes,   0.02s,   2.01 Mnps
perft[ 7/ 0.0] =      1049442 leafs,       199271 nodes,   0.05s,   4.15 Mnps
perft[ 8/ 0.0] =      6483961 leafs,      1248713 nodes,   0.30s,   4.19 Mnps
perft[ 9/ 0.0] =     41022423 leafs,      7732674 nodes,   1.94s,   3.99 Mnps
perft[10/ 0.0] =    258895763 leafs,     48755097 nodes,  12.27s,   3.97 Mnps
perft[11/ 0.0] =   1665861398 leafs,    307650860 nodes,  78.31s,   3.93 Mnps

   .   .   .   .   .
 w   .   .   w   w
   w   .   .   .   .
 .   B   .   .   w
   w   w   w   B   .
 .   .   .   .   w
   W   .   w   .   .
 .   w   .   .   .
   w   w   w   w   .
 w   .   .   .   .

"B:BK17,K24:W6,9,10,11,20,21,22,23,30,K31,33,37,41,42,43,44,46"

Searching to nominal depth=9

perft[ 1/ 0.0] =           14 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.0] =           55 leafs,           15 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 0.0] =         1168 leafs,           70 nodes,   0.00s,   0.07 Mnps
perft[ 4/ 0.0] =         5432 leafs,         1238 nodes,   0.00s,   1.24 Mnps
perft[ 5/ 0.0] =        87195 leafs,         6670 nodes,   0.00s,   6.67 Mnps
perft[ 6/ 0.0] =       629010 leafs,        93865 nodes,   0.05s,   1.96 Mnps
perft[ 7/ 0.0] =      9041010 leafs,       722875 nodes,   0.36s,   2.00 Mnps
perft[ 8/ 0.0] =     86724219 leafs,      9763885 nodes,   4.05s,   2.41 Mnps
perft[ 9/ 0.0] =   1216917193 leafs,     96488104 nodes,  51.70s,   1.87 Mnps

   .   .   .   .   .
 .   .   .   .   .
   .   b   b   b   .
 b   .   b   b   .
   b   .   b   b   w
 b   w   w   .   w
   .   w   w   w   w
 .   w   w   .   .
   .   .   .   .   .
 .   .   .   .   .

"W:B12,13,14,16,18,19,21,23,24,26:W25,27,28,30,32,33,34,35,37,38"

Searching to nominal depth=15

perft[ 1/ 0.0] =            6 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.0] =           12 leafs,            7 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 0.0] =           30 leafs,           19 nodes,   0.00s,   0.02 Mnps
perft[ 4/ 0.0] =           73 leafs,           49 nodes,   0.00s,   0.05 Mnps
perft[ 5/ 0.0] =          215 leafs,          122 nodes,   0.00s,   0.12 Mnps
perft[ 6/ 0.0] =          590 leafs,          337 nodes,   0.00s,   0.34 Mnps
perft[ 7/ 0.0] =         1944 leafs,          927 nodes,   0.00s,   0.93 Mnps
perft[ 8/ 0.0] =         6269 leafs,         2871 nodes,   0.00s,   2.87 Mnps
perft[ 9/ 0.0] =        22369 leafs,         9140 nodes,   0.00s,   9.14 Mnps
perft[10/ 0.0] =        88050 leafs,        31509 nodes,   0.02s,   1.97 Mnps
perft[11/ 0.0] =       377436 leafs,       119559 nodes,   0.03s,   3.74 Mnps
perft[12/ 0.0] =      1910989 leafs,       496995 nodes,   0.16s,   3.17 Mnps
perft[13/ 0.0] =      9872645 leafs,      2407984 nodes,   0.84s,   2.85 Mnps
perft[14/ 0.0] =     58360286 leafs,     12280629 nodes,   4.30s,   2.86 Mnps
perft[15/ 0.0] =    346184885 leafs,     70640915 nodes,  26.91s,   2.63 Mnps

The third run is with the dual hash table lookup with a total of 32M entries. Total node reduction is 96%, 95% and 81% for the 3 standard positions. Total time reduction is 90%, 95% and 62% respectively.

Spoiler:

Code: Select all

   b   b   b   b   b
 b   b   b   b   b
   b   b   b   b   b
 b   b   b   b   b
   .   .   .   .   .
 .   .   .   .   .
   w   w   w   w   w
 w   w   w   w   w
   w   w   w   w   w
 w   w   w   w   w

"W:B1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20:W31,32,33,34,35,36,37,38,
39,40,41,42,43,44,45,46,47,48,49,50"

Searching to nominal depth=11

perft[ 1/ 0.0] =            9 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.9] =           81 leafs,           10 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 1.9] =          658 leafs,           91 nodes,   0.00s,   0.09 Mnps
perft[ 4/ 2.9] =         4265 leafs,          749 nodes,   0.00s,   0.75 Mnps
perft[ 5/ 3.8] =        27117 leafs,         3524 nodes,   0.00s,   3.52 Mnps
perft[ 6/ 4.7] =       167140 leafs,        15630 nodes,   0.02s,   0.92 Mnps
perft[ 7/ 5.7] =      1049442 leafs,        60990 nodes,   0.02s,   3.81 Mnps
perft[ 8/ 6.7] =      6483961 leafs,       239244 nodes,   0.11s,   2.17 Mnps
perft[ 9/ 7.6] =     41022423 leafs,       875738 nodes,   0.41s,   2.15 Mnps
perft[10/ 8.6] =    258895763 leafs,      3283861 nodes,   1.50s,   2.19 Mnps
perft[11/ 9.6] =   1665861398 leafs,     11686263 nodes,   5.56s,   2.10 Mnps

   .   .   .   .   .
 w   .   .   w   w
   w   .   .   .   .
 .   B   .   .   w
   w   w   w   B   .
 .   .   .   .   w
   W   .   w   .   .
 .   w   .   .   .
   w   w   w   w   .
 w   .   .   .   .

"B:BK17,K24:W6,9,10,11,20,21,22,23,30,K31,33,37,41,42,43,44,46"

Searching to nominal depth=9

perft[ 1/ 0.0] =           14 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.9] =           55 leafs,           15 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 1.8] =         1168 leafs,           70 nodes,   0.00s,   0.07 Mnps
perft[ 4/ 2.9] =         5432 leafs,         1238 nodes,   0.00s,   1.24 Mnps
perft[ 5/ 3.7] =        87195 leafs,         4737 nodes,   0.00s,   4.74 Mnps
perft[ 6/ 4.9] =       629010 leafs,        44954 nodes,   0.02s,   2.64 Mnps
perft[ 7/ 5.6] =      9041010 leafs,       129113 nodes,   0.06s,   2.05 Mnps
perft[ 8/ 6.7] =     86724219 leafs,       823612 nodes,   0.33s,   2.50 Mnps
perft[ 9/ 7.1] =   1216917193 leafs,      4856167 nodes,   1.78s,   2.73 Mnps

   .   .   .   .   .
 .   .   .   .   .
   .   b   b   b   .
 b   .   b   b   .
   b   .   b   b   w
 b   w   w   .   w
   .   w   w   w   w
 .   w   w   .   .
   .   .   .   .   .
 .   .   .   .   .

"W:B12,13,14,16,18,19,21,23,24,26:W25,27,28,30,32,33,34,35,37,38"

Searching to nominal depth=15

perft[ 1/ 0.0] =            6 leafs,            1 nodes,   0.00s,   0.00 Mnps
perft[ 2/ 0.9] =           12 leafs,            7 nodes,   0.00s,   0.01 Mnps
perft[ 3/ 1.6] =           30 leafs,           19 nodes,   0.00s,   0.02 Mnps
perft[ 4/ 2.4] =           73 leafs,           49 nodes,   0.00s,   0.05 Mnps
perft[ 5/ 3.4] =          215 leafs,          122 nodes,   0.00s,   0.12 Mnps
perft[ 6/ 4.4] =          590 leafs,          323 nodes,   0.00s,   0.32 Mnps
perft[ 7/ 5.4] =         1944 leafs,          849 nodes,   0.00s,   0.85 Mnps
perft[ 8/ 6.4] =         6269 leafs,         2417 nodes,   0.00s,   2.42 Mnps
perft[ 9/ 7.5] =        22369 leafs,         7272 nodes,   0.00s,   7.27 Mnps
perft[10/ 8.5] =        88050 leafs,        23053 nodes,   0.00s,  23.05 Mnps
perft[11/ 9.6] =       377436 leafs,        78388 nodes,   0.03s,   2.45 Mnps
perft[12/10.6] =      1910989 leafs,       273261 nodes,   0.14s,   1.94 Mnps
perft[13/11.6] =      9872645 leafs,       996580 nodes,   0.55s,   1.82 Mnps
perft[14/12.6] =     58360286 leafs,      3564962 nodes,   1.97s,   1.81 Mnps
perft[15/13.6] =    346184885 leafs,     13086344 nodes,   7.34s,   1.78 Mnps

jj · Post by jj » Wed Sep 22, 2010 14:45

Rein Halbersma wrote:Hi Jan-Jaap,

Here are 3 perft runs on the 3 standard positions. The first run is perft with bulk-counting at depth==1. I use a specialized move_count() function that simply counts the number of set bits in the various bit patterns, rather than trying to generate all the moves into their respective structs. I should probably also explain that I calculate the nodes per second as the number of recursive calls to the perft routine, not as the number of leafs per second.

Rein

Hi Rein,

Thank you for your reply. I use the same method to calculate the nodes per second, and I get the same node counts. I don't use something like move_count(), I just generate all the moves.

Rein Halbersma wrote: This first run does not maintain a Zobrist hash key.

The third run is with the dual hash table lookup with a total of 32M entries. Total node reduction is 96%, 95% and 81% for the 3 standard positions. Total time reduction is 90%, 95% and 62% respectively.

I always maintain the Zobrist hash key, even when not using the hash table. My results for standard perft (nominal depth) without hash table vs. with a hash table of 32M entries are: total node reduction is 94%, 94% and 76% for the 3 standard positions. Total time reduction is 92%, 93% and 74% respectively.

So, similar to your hash table results.

Jan-Jaap

jj · Post by jj » Thu Sep 23, 2010 22:31

Forget crunch. The chess (and CPU) community has Fritz Chess Benchmark: http://www.jens.tauchclub-krems.at/dive ... marks.html

A very nice tool to benchmark multicore systems (and hyperthreading), and very easy to use. You can select the number of threads to use, and after one minute it gives the Knodes/sec and a number (in my case 22.23) indicating the "power" of your machine as compared to a Pentium 3 single core @ 1.0 GHz. The output can be used to check your parallel algorithm performance.

Code: Select all

Fritz Chess Benchmark  hypertreading OFF  hyperthreading ON  

              threads      Kn/s  speedup      Kn/s  speedup
                    1      2184     1,00      2170     1,00
                    2      4380     2,01      4147     1,91
                    3      6564     3,01      5836     2,69
                    4      8536     3,91      7344     3,38
                    5                         8624     3,97
                    6                         9578     4,41
                    7                        10214     4,71
                    8                        10646     4,91

Jan-Jaap

jj · Post by jj » Fri Oct 01, 2010 14:33

Some more perft counts for the three standard positions, calculated with parallel-perft-with-hash-table.

Spoiler:

Jan-Jaap

World Draughts Forum

Parallel perft (with hash table)

Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)

Re: Parallel perft (with hash table)