Perft(1) N = 9 0.00 sec. KN/sec = 0
Perft(2) N = 81 0.00 sec. KN/sec = 0
Perft(3) N = 658 0.00 sec. KN/sec = 0
Perft(4) N = 4265 0.00 sec. KN/sec = 0
Perft(5) N = 27117 0.00 sec. KN/sec = 0
Perft(6) N = 167140 0.00 sec. KN/sec = 83570
Perft(7) N = 1049442 0.01 sec. KN/sec = 131180
Perft(8) N = 6483961 0.05 sec. KN/sec = 144088
Perft(9) N = 41022423 0.25 sec. KN/sec = 165413
Perft(10) N = 258895763 1.55 sec. KN/sec = 166814
Perft(11) N = 1665861398 9.74 sec. KN/sec = 170997
Perft(1) N = 6 0.00 sec. KN/sec = 0
Perft(2) N = 12 0.00 sec. KN/sec = 0
Perft(3) N = 30 0.00 sec. KN/sec = 0
Perft(4) N = 73 0.00 sec. KN/sec = 0
Perft(5) N = 215 0.00 sec. KN/sec = 0
Perft(6) N = 590 0.00 sec. KN/sec = 0
Perft(7) N = 1944 0.00 sec. KN/sec = 0
Perft(8) N = 6269 0.00 sec. KN/sec = 0
Perft(9) N = 22369 0.00 sec. KN/sec = 22369
Perft(10) N = 88050 0.00 sec. KN/sec = 88050
Perft(11) N = 377436 0.01 sec. KN/sec = 53919
Perft(12) N = 1910989 0.03 sec. KN/sec = 70777
Perft(13) N = 9872645 0.13 sec. KN/sec = 75363
Perft(14) N = 58360286 0.55 sec. KN/sec = 106496
Perft(15) N = 346184885 3.22 sec. KN/sec = 107611
Perft(1) N = 14 0.00 sec. KN/sec = 0
Perft(2) N = 55 0.00 sec. KN/sec = 0
Perft(3) N = 1168 0.00 sec. KN/sec = 0
Perft(4) N = 5432 0.00 sec. KN/sec = 0
Perft(5) N = 87195 0.00 sec. KN/sec = 87195
Perft(6) N = 629010 0.00 sec. KN/sec = 125802
Perft(7) N = 9041010 0.07 sec. KN/sec = 132956
Perft(8) N = 86724219 0.48 sec. KN/sec = 179182
Perft(9) N = 1216917193 6.52 sec. KN/sec = 186615
Your perft() runs very fast, on what kind of computer CPU/clock frequency you measured this? Which compiler you used?
It was not my goal to make the fastest perft() per se, but to make a move-generator that performs a few times better than the one in my old mailbox program.
The perft() in my old program runs at ~33.5 mnps at the starting position (on the same hardware), that's about 4 times slower, with less pieces on the board the difference seems to get smaller though.
The evaluation-function and probing the hash-table are way more time consuming, a speed difference of a few percent in the move-generator and move-make you won't notice in the total program at all.
I have a 8-core Intel i7-5960X, but for Perft I only use 1 core.
The processor is water-cooled, is overclocked and runs at 4 GHz.
I use the Microsoft Visual Studio 2015, which nowadays is available as a free download.
BertTuyt wrote:
I have a 8-core Intel i7-5960X, but for Perft I only use 1 core.
The processor is water-cooled, is overclocked and runs at 4 GHz.
I use the Microsoft Visual Studio 2015, which nowadays is available as a free download.
Bert
This is exactly the same setup as I have over here, i7-5960X, normally I run it at 3.6 GHz., for tournaments I overclock it to 4.0 or 4.2 GHz.
I use Visual Studio 2015 (update 3). For final builds I use the Intel C++ (v16) compiler which gives a small boost compared to MSVC.
When I run my computer at 4.0 GHz. it will probably add 11% to my nps figures, when I have some time later today I will check my latest perft() at 4.0 GHz. to see what it does.
Joost
Last edited by Joost Buijs on Sat Jul 23, 2016 19:44, edited 1 time in total.
m m m m m
m m m m m
m m m m m
m m m m m
- - - - -
- - - - -
M M M M M
M M M M M
M M M M M
M M M M M
perft( 1) nodes 9 time 0.0000000 nps 0
perft( 2) nodes 81 time 0.0000003 nps 237717342
perft( 3) nodes 658 time 0.0000034 nps 193108656
perft( 4) nodes 4265 time 0.0000228 nps 186818586
perft( 5) nodes 27117 time 0.0001523 nps 178036876
perft( 6) nodes 167140 time 0.0009875 nps 169261375
perft( 7) nodes 1049442 time 0.0062778 nps 167166929
perft( 8) nodes 6483961 time 0.0390295 nps 166129855
perft( 9) nodes 41022423 time 0.2463491 nps 166521483
perft(10) nodes 258895763 time 1.5084528 nps 171630011
perft(11) nodes 1665861398 time 9.8068545 nps 169867046
- - - - -
M - - M M
M - - - -
- k - - M
M M M k -
- - - - M
K - M - -
- M - - -
M M M M -
M - - - -
perft( 1) nodes 14 time 0.0000221 nps 632107
perft( 2) nodes 55 time 0.0000133 nps 4138795
perft( 3) nodes 1168 time 0.0000187 nps 62324098
perft( 4) nodes 5432 time 0.0000515 nps 105574409
perft( 5) nodes 87195 time 0.0003997 nps 218157133
perft( 6) nodes 629010 time 0.0032401 nps 194132635
perft( 7) nodes 9041010 time 0.0384673 nps 235031343
perft( 8) nodes 86724219 time 0.3928527 nps 220755060
perft( 9) nodes 1216917193 time 5.1363549 nps 236922333
- - - - -
- - - - -
- m m m -
m - m m -
m - m m M
m M M - M
- M M M M
- M M - -
- - - - -
- - - - -
perft( 1) nodes 6 time 0.0000003 nps 17608692
perft( 2) nodes 12 time 0.0000010 nps 11739128
perft( 3) nodes 30 time 0.0000010 nps 29347820
perft( 4) nodes 73 time 0.0000024 nps 30605584
perft( 5) nodes 215 time 0.0000061 nps 35054341
perft( 6) nodes 590 time 0.0000157 nps 37641769
perft( 7) nodes 1944 time 0.0000412 nps 47150547
perft( 8) nodes 6269 time 0.0001216 nps 51535430
perft( 9) nodes 22369 time 0.0003830 nps 58405817
perft(10) nodes 88050 time 0.0012628 nps 69726809
perft(11) nodes 377436 time 0.0047445 nps 79552742
perft(12) nodes 1910989 time 0.0198608 nps 96219331
perft(13) nodes 9872645 time 0.0974842 nps 101274265
perft(14) nodes 58360286 time 0.5050651 nps 115550024
perft(15) nodes 346184885 time 2.9987416 nps 115443385
Maybe I can get a few more percent out of it by tweaking but I don't think this is relevant.
Perft(2) seems to run faster than Perft(1), I could not find a bug in my code and now I assume this is due to cache effects.
It seems positions with kings do particularly well, this is probably because I have to scan less due to the magics I use for generating king-moves.
Since the times to optimize with PGO seemed a little bit short to me I added some extra depth, 1 ply for the first two positions and 2 plies for the last position.
m m m m m
m m m m m
m m m m m
m m m m m
- - - - -
- - - - -
M M M M M
M M M M M
M M M M M
M M M M M
perft( 1) nodes 9 time 0.0000003 nps 26413002
perft( 2) nodes 81 time 0.0000003 nps 237717018
perft( 3) nodes 658 time 0.0000027 nps 241385491
perft( 4) nodes 4265 time 0.0000211 nps 201884325
perft( 5) nodes 27117 time 0.0001448 nps 187252647
perft( 6) nodes 167140 time 0.0009098 nps 183714904
perft( 7) nodes 1049442 time 0.0058877 nps 178244070
perft( 8) nodes 6483961 time 0.0363438 nps 178406222
perft( 9) nodes 41022423 time 0.2282231 nps 179747060
perft(10) nodes 258895763 time 1.4253068 nps 181642132
perft(11) nodes 1665861398 time 9.1216300 nps 182627601
perft(12) nodes 10749771911 time 57.2861119 nps 187650576
- - - - -
M - - M M
M - - - -
- k - - M
M M M k -
- - - - M
K - M - -
- M - - -
M M M M -
M - - - -
perft( 1) nodes 14 time 0.0000215 nps 652173
perft( 2) nodes 55 time 0.0000181 nps 3045524
perft( 3) nodes 1168 time 0.0000215 nps 54409852
perft( 4) nodes 5432 time 0.0000497 nps 109189823
perft( 5) nodes 87195 time 0.0003997 nps 218156835
perft( 6) nodes 629010 time 0.0031069 nps 202457196
perft( 7) nodes 9041010 time 0.0370157 nps 244247671
perft( 8) nodes 86724219 time 0.3636623 nps 238474619
perft( 9) nodes 1216917193 time 4.9510586 nps 245789291
perft(10) nodes 13106503411 time 52.2715701 nps 250738659
- - - - -
- - - - -
- m m m -
m - m m -
m - m m M
m M M - M
- M M M M
- M M - -
- - - - -
- - - - -
perft( 1) nodes 6 time 0.0000003 nps 17608668
perft( 2) nodes 12 time 0.0000007 nps 17608668
perft( 3) nodes 30 time 0.0000010 nps 29347780
perft( 4) nodes 73 time 0.0000027 nps 26779849
perft( 5) nodes 215 time 0.0000061 nps 35054293
perft( 6) nodes 590 time 0.0000164 nps 36073313
perft( 7) nodes 1944 time 0.0000412 nps 47150483
perft( 8) nodes 6269 time 0.0001135 nps 55249619
perft( 9) nodes 22369 time 0.0003568 nps 62701097
perft(10) nodes 88050 time 0.0011824 nps 74468935
perft(11) nodes 377436 time 0.0043697 nps 86376393
perft(12) nodes 1910989 time 0.0181785 nps 105123308
perft(13) nodes 9872645 time 0.0888282 nps 111143159
perft(14) nodes 58360286 time 0.4710588 nps 123891722
perft(15) nodes 346184885 time 3.0538893 nps 113358690
perft(16) nodes 2272406115 time 17.3304478 nps 131122181
perft(17) nodes 14962263728 time 113.5997428 nps 131710366
I used the PGO from the Intel compiler, I guess the PGO from MSVC is about the same.
First you have to instrument your program for optimization, then you have to let the program run for some time to let it resolve branches etc., after this you can run the optimization pass.
I have no idea whether the PGO of MSVC is as good as the one from Intel, I never tried.
It also depends upon the program you are optimizing, sometimes PGO does almost nothing and in other cases it makes a difference of 10 to 15%.
Out of curiosity I also tried MSVC PGO, it is not as efficient as the one from Intel but it still does something.
I compared with the standard optimization Maximize speed (/O2), Intrinsics Yes (/Oi), Favor fast code (/Ot) and Omit frame pointers (/Oy).
With MSVC PGO:
position 1 nps +6.5%
position 2 nps +0.0% (approx. equal)
position 3 nps +1.7%
This is only one run, to have more accurate statistics this should be repeated several times.
Your results seem to be quite comparable to mine.
Don't forget that I'm using the Intel compiler which has a better optimization compared to MSVC.
Anyway, the differences are so small that they are not relevant for game play at all.
I'm off now to retrograde analysis, which is full of pitfalls when you have never done that before, but it is a nice exercise.