Perft

BertTuyt · Post by **BertTuyt** » Mon Aug 15, 2016 09:53

Joost, thanks.
I would also like to see an as-is comparison.
So just run the executable on your (faster) computer, then I also better know , and can quantify what the effect of the Intel compiler is.

Bert

Rein Halbersma · Post by **Rein Halbersma** » Mon Aug 15, 2016 14:19

Joost Buijs wrote:
Rein Halbersma wrote:
Joost Buijs wrote: To determine king captures I basically do the following: ((magic_mask_for_current_direction & opposite_pieces & ~edge_squares) >> current_direction) & empty_squares;
This gives you the first empty square after the opposite piece you are allowed to capture.
Why do you need the edge_squares masking? If you have the "ghost" squares at each edge, then the opposite_pieces & (empty_squares << current_direction) would be the cheapest way to isolate all opposite pieces with an empty piece behind them.
To be honest, I never looked at it in very great detail, I use ghost squares at 10,21,32, and 43 and I assumed that when there sits an enemy piece on the edge this test can fail that's why I mask out the edge-squares.
The funny thing is that the test fails indeed when I remove that mask and I still believe it is needed, but I will take another look at it.
My empty-square mask is the inverse of the occupied square-mask and it is possible that I should mark the ghost bits as being not empty, that will take time as well.
It is not so important anyway, with or without that edge-mask there won't be much difference in speed.

I suspect that your empty_squares contains bits on the ghost squares. What I have are two masks called "squares" and "ghosts". They satisfy the following

bitcount(squares) == 50
bitcount(ghosts) == 14
squares ^ ghosts == ~(0ULL) (all 64 bits set to 1)
squares == ~ghosts

If you have your board class with bitboards for black, white and kings, then occupied == black ^ white. However, if you now define empty_squares == ~occupied, you have introduced 14 bits on the ghosts squares for empty_squares. This means that in patterns as opponent_pieces & (empty_squares << dir) you can also find that jumps over pieces located on the board edge are possible. Hence, you need edge mask detection. Is this what is the case in your code?

To avoid this, the only time that I have an explicit use of the ghosts squares is when I compute empty_squares

empty_squares = squares ^ occupied (equivalent to ~ghosts ^ occupied)

I think that if you compute your empty_squares in the above way, then you don't need any edge_mask when computing move or jump patterns.

Joost Buijs · Post by **Joost Buijs** » Mon Aug 15, 2016 14:52

Rein Halbersma wrote:
Joost Buijs wrote:
Rein Halbersma wrote:
Why do you need the edge_squares masking? If you have the "ghost" squares at each edge, then the opposite_pieces & (empty_squares << current_direction) would be the cheapest way to isolate all opposite pieces with an empty piece behind them.
To be honest, I never looked at it in very great detail, I use ghost squares at 10,21,32, and 43 and I assumed that when there sits an enemy piece on the edge this test can fail that's why I mask out the edge-squares.
The funny thing is that the test fails indeed when I remove that mask and I still believe it is needed, but I will take another look at it.
My empty-square mask is the inverse of the occupied square-mask and it is possible that I should mark the ghost bits as being not empty, that will take time as well.
It is not so important anyway, with or without that edge-mask there won't be much difference in speed.
I suspect that your empty_squares contains bits on the ghost squares. What I have are two masks called "squares" and "ghosts". They satisfy the following

bitcount(squares) == 50
bitcount(ghosts) == 14
squares ^ ghosts == ~(0ULL) (all 64 bits set to 1)
squares == ~ghosts

If you have your board class with bitboards for black, white and kings, then occupied == black ^ white. However, if you now define empty_squares == ~occupied, you have introduced 14 bits on the ghosts squares for empty_squares. This means that in patterns as opponent_pieces & (empty_squares << dir) you can also find that jumps over pieces located on the board edge are possible. Hence, you need edge mask detection. Is this what is the case in your code?

To avoid this, the only time that I have an explicit use of the ghosts squares is when I compute empty_squares

empty_squares = squares ^ occupied (equivalent to ~ghosts ^ occupied)

I think that if you compute your empty_squares in the above way, then you don't need any edge_mask when computing move or jump patterns.

Yes, this is indeed what is happening. This morning I already figured it out.
It is very easy to change in my code, but I expect that it doesn't have much influence on the performance (if any).
This evening I will modify it and let you know the results.

Joost

Joost Buijs · Post by **Joost Buijs** » Mon Aug 15, 2016 15:34

BertTuyt wrote:Joost, thanks.
I would also like to see an as-is comparison.
So just run the executable on your (faster) computer, then I also better know , and can quantify what the effect of the Intel compiler is.
Bert

Bert,

These are the results for the MSVC compiler @ 3600MHz. I took the liberty to add 1 or 2 ply to the depth.
At first glance your results are a few percent better than mine with the same compiler.
I will post the results for the Intel compiler somewhat later.

Code: Select all

Perft(1)        N = 9      0.00 sec.    KN/sec = 0
Perft(2)        N = 81     0.00 sec.    KN/sec = 0
Perft(3)        N = 658    0.00 sec.    KN/sec = 0
Perft(4)        N = 4265           0.00 sec.    KN/sec = 0
Perft(5)        N = 27117          0.00 sec.    KN/sec = 0
Perft(6)        N = 167140         0.00 sec.    KN/sec = 0
Perft(7)        N = 1049442        0.00 sec.    KN/sec = 0
Perft(8)        N = 6483961        0.04 sec.    KN/sec = 162099
Perft(9)        N = 41022423       0.29 sec.    KN/sec = 141456
Perft(10)       N = 258895763      1.81 sec.    KN/sec = 143036
Perft(11)       N = 1665861398    11.56 sec.    KN/sec = 144105
Perft(12)       N = 10749771911   75.53 sec.    KN/sec = 142324


Perft(1)        N = 14     0.00 sec.    KN/sec = 0
Perft(2)        N = 55     0.00 sec.    KN/sec = 0
Perft(3)        N = 1168           0.00 sec.    KN/sec = 0
Perft(4)        N = 5432           0.00 sec.    KN/sec = 0
Perft(5)        N = 87195          0.00 sec.    KN/sec = 0
Perft(6)        N = 629010         0.00 sec.    KN/sec = 0
Perft(7)        N = 9041010        0.05 sec.    KN/sec = 180820
Perft(8)        N = 86724219       0.45 sec.    KN/sec = 192720
Perft(9)        N = 1216917193     6.56 sec.    KN/sec = 185505
Perft(10)       N = 13106503411   65.77 sec.    KN/sec = 199277


Perft(1)        N = 6      0.00 sec.    KN/sec = 0
Perft(2)        N = 12     0.00 sec.    KN/sec = 0
Perft(3)        N = 30     0.00 sec.    KN/sec = 0
Perft(4)        N = 73     0.00 sec.    KN/sec = 0
Perft(5)        N = 215    0.00 sec.    KN/sec = 0
Perft(6)        N = 590    0.00 sec.    KN/sec = 0
Perft(7)        N = 1944           0.00 sec.    KN/sec = 0
Perft(8)        N = 6269           0.00 sec.    KN/sec = 0
Perft(9)        N = 22369          0.00 sec.    KN/sec = 0
Perft(10)       N = 88050          0.00 sec.    KN/sec = 0
Perft(11)       N = 377436         0.00 sec.    KN/sec = 0
Perft(12)       N = 1910989        0.02 sec.    KN/sec = 95549
Perft(13)       N = 9872645        0.11 sec.    KN/sec = 89751
Perft(14)       N = 58360286       0.60 sec.    KN/sec = 97267
Perft(15)       N = 346184885      3.52 sec.    KN/sec = 98347
Perft(16)       N = 2272406115    21.93 sec.    KN/sec = 103620
Perft(17)       N = 14962263728  144.16 sec.    KN/sec = 103789

Joost Buijs · Post by **Joost Buijs** » Mon Aug 15, 2016 15:45

These are the results for the Intel compiler without PGO.

Code: Select all

Perft(1)        N = 9      0.00 sec.    KN/sec = 0
Perft(2)        N = 81     0.00 sec.    KN/sec = 0
Perft(3)        N = 658    0.00 sec.    KN/sec = 0
Perft(4)        N = 4265           0.00 sec.    KN/sec = 0
Perft(5)        N = 27117          0.00 sec.    KN/sec = 0
Perft(6)        N = 167140         0.00 sec.    KN/sec = 0
Perft(7)        N = 1049442        0.00 sec.    KN/sec = 0
Perft(8)        N = 6483961        0.04 sec.    KN/sec = 162099
Perft(9)        N = 41022423       0.27 sec.    KN/sec = 151934
Perft(10)       N = 258895763      1.67 sec.    KN/sec = 155027
Perft(11)       N = 1665861398    10.78 sec.    KN/sec = 154532
Perft(12)       N = 10749771911   67.70 sec.    KN/sec = 158785


Perft(1)        N = 14     0.00 sec.    KN/sec = 0
Perft(2)        N = 55     0.00 sec.    KN/sec = 0
Perft(3)        N = 1168           0.00 sec.    KN/sec = 0
Perft(4)        N = 5432           0.00 sec.    KN/sec = 0
Perft(5)        N = 87195          0.00 sec.    KN/sec = 0
Perft(6)        N = 629010         0.00 sec.    KN/sec = 0
Perft(7)        N = 9041010        0.05 sec.    KN/sec = 180820
Perft(8)        N = 86724219       0.43 sec.    KN/sec = 201684
Perft(9)        N = 1216917193     6.47 sec.    KN/sec = 188086
Perft(10)       N = 13106503411   62.73 sec.    KN/sec = 208935


Perft(1)        N = 6      0.00 sec.    KN/sec = 0
Perft(2)        N = 12     0.00 sec.    KN/sec = 0
Perft(3)        N = 30     0.00 sec.    KN/sec = 0
Perft(4)        N = 73     0.00 sec.    KN/sec = 0
Perft(5)        N = 215    0.00 sec.    KN/sec = 0
Perft(6)        N = 590    0.00 sec.    KN/sec = 0
Perft(7)        N = 1944           0.00 sec.    KN/sec = 0
Perft(8)        N = 6269           0.00 sec.    KN/sec = 0
Perft(9)        N = 22369          0.00 sec.    KN/sec = 0
Perft(10)       N = 88050          0.00 sec.    KN/sec = 0
Perft(11)       N = 377436         0.00 sec.    KN/sec = 0
Perft(12)       N = 1910989        0.02 sec.    KN/sec = 95549
Perft(13)       N = 9872645        0.10 sec.    KN/sec = 98726
Perft(14)       N = 58360286       0.57 sec.    KN/sec = 102386
Perft(15)       N = 346184885      3.41 sec.    KN/sec = 101520
Perft(16)       N = 2272406115    21.36 sec.    KN/sec = 106386
Perft(17)       N = 14962263728  141.06 sec.    KN/sec = 106070

Joost Buijs · Post by **Joost Buijs** » Mon Aug 15, 2016 16:09

I will post the Intel PGO results somewhat later, installing VS2015 update 3 broke something in the Intel VS2015 integration.
I tried with /QProf-gen and /QProf-use and that didn't work, I guess I have to repair or reinstall the Intel compiler.
Later...

Ok, here are the Intel results with PGO. It is somewhat faster, the last position didn't change much.

It is very peculiar that the first position at depth 8 always gives 162099 kns.??? Somehow a low resolution of your timing routine?

Code: Select all

Perft(1)        N = 9      0.00 sec.    KN/sec = 0
Perft(2)        N = 81     0.00 sec.    KN/sec = 0
Perft(3)        N = 658    0.00 sec.    KN/sec = 0
Perft(4)        N = 4265           0.00 sec.    KN/sec = 0
Perft(5)        N = 27117          0.00 sec.    KN/sec = 0
Perft(6)        N = 167140         0.00 sec.    KN/sec = 0
Perft(7)        N = 1049442        0.00 sec.    KN/sec = 0
Perft(8)        N = 6483961        0.04 sec.    KN/sec = 162099
Perft(9)        N = 41022423       0.24 sec.    KN/sec = 170926
Perft(10)       N = 258895763      1.54 sec.    KN/sec = 168114
Perft(11)       N = 1665861398     9.76 sec.    KN/sec = 170682
Perft(12)       N = 10749771911   62.24 sec.    KN/sec = 172714


Perft(1)        N = 14     0.00 sec.    KN/sec = 0
Perft(2)        N = 55     0.00 sec.    KN/sec = 0
Perft(3)        N = 1168           0.00 sec.    KN/sec = 0
Perft(4)        N = 5432           0.00 sec.    KN/sec = 0
Perft(5)        N = 87195          0.00 sec.    KN/sec = 0
Perft(6)        N = 629010         0.00 sec.    KN/sec = 0
Perft(7)        N = 9041010        0.04 sec.    KN/sec = 226025
Perft(8)        N = 86724219       0.42 sec.    KN/sec = 206486
Perft(9)        N = 1216917193     6.38 sec.    KN/sec = 190739
Perft(10)       N = 13106503411   60.54 sec.    KN/sec = 216493


Perft(1)        N = 6      0.00 sec.    KN/sec = 0
Perft(2)        N = 12     0.00 sec.    KN/sec = 0
Perft(3)        N = 30     0.00 sec.    KN/sec = 0
Perft(4)        N = 73     0.00 sec.    KN/sec = 0
Perft(5)        N = 215    0.00 sec.    KN/sec = 0
Perft(6)        N = 590    0.00 sec.    KN/sec = 0
Perft(7)        N = 1944           0.00 sec.    KN/sec = 0
Perft(8)        N = 6269           0.00 sec.    KN/sec = 0
Perft(9)        N = 22369          0.00 sec.    KN/sec = 0
Perft(10)       N = 88050          0.00 sec.    KN/sec = 0
Perft(11)       N = 377436         0.00 sec.    KN/sec = 0
Perft(12)       N = 1910989        0.02 sec.    KN/sec = 95549
Perft(13)       N = 9872645        0.10 sec.    KN/sec = 98726
Perft(14)       N = 58360286       0.55 sec.    KN/sec = 106109
Perft(15)       N = 346184885      3.26 sec.    KN/sec = 106191
Perft(16)       N = 2272406115    20.65 sec.    KN/sec = 110043
Perft(17)       N = 14962263728  135.13 sec.    KN/sec = 110724

Joost Buijs · Post by **Joost Buijs** » Mon Aug 15, 2016 19:28

Rein Halbersma wrote: If you have your board class with bitboards for black, white and kings, then occupied == black ^ white. However, if you now define empty_squares == ~occupied, you have introduced 14 bits on the ghosts squares for empty_squares. This means that in patterns as opponent_pieces & (empty_squares << dir) you can also find that jumps over pieces located on the board edge are possible. Hence, you need edge mask detection. Is this what is the case in your code?

This problem was very easy to fix. My position structure contains besides the pieces also a field for the empty squares.
The only thing I had to change when I setup the position is to mask all the squares that do not belong to the actual board as being not empty, this is a one time instruction, the rest of the logic is already ok.
The improvement in speed is between 1 to 3%, not very shocking but it helps.

It is not my goal to make the fastest perft(), it is more a means of checking my move-generator.
I do bulk counting at the last ply but I still store the moves as usual, when I omit storing the moves and only count them it will run a lot faster.
Maybe for captures this is difficult because it makes detecting redundant moves somewhat difficult but for sliders it is not necessary to store them on the last ply.

Rein Halbersma · Post by **Rein Halbersma** » Mon Aug 15, 2016 20:01

Joost Buijs wrote: It is not my goal to make the fastest perft(), it is more a means of checking my move-generator.

The fastest code is removed code, and it's also easiest to maintain

What's not to like?

I do bulk counting at the last ply but I still store the moves as usual, when I omit storing the moves and only count them it will run a lot faster.
Maybe for captures this is difficult because it makes detecting redundant moves somewhat difficult but for sliders it is not necessary to store them on the last ply.

Yes for redundant captures you need to store the move. If not, you can only store the max number of jumps so far. Even better, for checkers (no majority capture rule) you can only increment a counter.

There some more tricks that people use in chess. If e.g. a white move (in draughts, e.g. a backrank pawn move) would not influence any of the black moves, then you can just do a null move at depth = 1 and count the number of black moves. It complicates the logic quite a bit, but saves a lot of potential work. Also hashing is a sure way of speeding up perft.

BertTuyt · Post by **BertTuyt** » Mon Aug 15, 2016 20:14

Joost, thanks for testing.
If I multiply the traditional results (optimized Intel) with 3.6/4.0 I yield next timings:

1. Perft(11) 8.78s
2. Perft(9) 5.7s
3. Perft(15) 2.9s

So 1. and 3. very similar to your results.
As these positions are mainly based on Man Capture, and Man Move, I guess we might have reached a limit for efficiency.

For 2. which is King dominant I need to do some work.
Initially for the King Moves the Magic did not yield faster results, so it might be beneficial for the King Capture.
I will also do a test with leaving out the KingMove Bitmaps which are constructed during King Capture detection, not sure they are faster.

You are also right regarding a faster Perft() when you dont include move storage at ply 1.
As you stated this can be done easily for Moves, but Captures is tricky.
For the moment I will leave Perft() as is.

Which timing function did you use?
I might need another one

Bert

Joost Buijs · Post by **Joost Buijs** » Mon Aug 15, 2016 20:32

BertTuyt wrote: Which timing function did you use?
I might need another one
Bert

I use QueryPerformanceFrequency() and QueryPerformanceCounter().

BertTuyt · Post by **BertTuyt** » Mon Aug 15, 2016 21:17

I have removed the KingMove Mapping in the capture routine.
It seems that The Perft(9) position 2 is around 8% - 9% faster.
8.51s -> 7.69s

But hope Joost can confirm, and hope i approach now his results.

See the link as I have updated the Dropbox.

Bert

https://www.dropbox.com/sh/9yx1f7z3x1q6 ... h2mVa?dl=0

Joost Buijs · Post by **Joost Buijs** » Tue Aug 16, 2016 06:53

BertTuyt wrote:I have removed the KingMove Mapping in the capture routine.
It seems that The Perft(9) position 2 is around 8% - 9% faster.
8.51s -> 7.69s

But hope Joost can confirm, and hope i approach now his results.

See the link as I have updated the Dropbox.

Bert

https://www.dropbox.com/sh/9yx1f7z3x1q6 ... h2mVa?dl=0

Bert, I did one more run with your last version and Intel PGO, these are the results:

Code: Select all

Perft(1)        N = 9      0.00 sec.    KN/sec = 0
Perft(2)        N = 81     0.00 sec.    KN/sec = 0
Perft(3)        N = 658    0.00 sec.    KN/sec = 0
Perft(4)        N = 4265           0.00 sec.    KN/sec = 0
Perft(5)        N = 27117          0.00 sec.    KN/sec = 0
Perft(6)        N = 167140         0.00 sec.    KN/sec = 0
Perft(7)        N = 1049442        0.00 sec.    KN/sec = 0
Perft(8)        N = 6483961        0.04 sec.    KN/sec = 162099
Perft(9)        N = 41022423       0.24 sec.    KN/sec = 170926
Perft(10)       N = 258895763      1.58 sec.    KN/sec = 163858
Perft(11)       N = 1665861398     9.90 sec.    KN/sec = 168268
Perft(12)       N = 10749771911   63.72 sec.    KN/sec = 168703


Perft(1)        N = 14     0.00 sec.    KN/sec = 0
Perft(2)        N = 55     0.00 sec.    KN/sec = 0
Perft(3)        N = 1168           0.00 sec.    KN/sec = 0
Perft(4)        N = 5432           0.00 sec.    KN/sec = 0
Perft(5)        N = 87195          0.00 sec.    KN/sec = 0
Perft(6)        N = 629010         0.00 sec.    KN/sec = 0
Perft(7)        N = 9041010        0.04 sec.    KN/sec = 226025
Perft(8)        N = 86724219       0.40 sec.    KN/sec = 216810
Perft(9)        N = 1216917193     5.77 sec.    KN/sec = 210904
Perft(10)       N = 13106503411   57.65 sec.    KN/sec = 227346


Perft(1)        N = 6      0.00 sec.    KN/sec = 0
Perft(2)        N = 12     0.00 sec.    KN/sec = 0
Perft(3)        N = 30     0.00 sec.    KN/sec = 0
Perft(4)        N = 73     0.00 sec.    KN/sec = 0
Perft(5)        N = 215    0.00 sec.    KN/sec = 0
Perft(6)        N = 590    0.00 sec.    KN/sec = 0
Perft(7)        N = 1944           0.00 sec.    KN/sec = 0
Perft(8)        N = 6269           0.00 sec.    KN/sec = 0
Perft(9)        N = 22369          0.00 sec.    KN/sec = 0
Perft(10)       N = 88050          0.00 sec.    KN/sec = 0
Perft(11)       N = 377436         0.00 sec.    KN/sec = 0
Perft(12)       N = 1910989        0.02 sec.    KN/sec = 95549
Perft(13)       N = 9872645        0.10 sec.    KN/sec = 98726
Perft(14)       N = 58360286       0.56 sec.    KN/sec = 104214
Perft(15)       N = 346184885      3.38 sec.    KN/sec = 102421
Perft(16)       N = 2272406115    20.88 sec.    KN/sec = 108831
Perft(17)       N = 14962263728  138.16 sec.    KN/sec = 108296

Your results are almost exactly the same as mine, I guess there is a certain upper limit to the speed you can reach.
I use exactly the same code for perft() as I'm going to use in my program for my search(), I'm sure that a specialized perft() program can be made faster but that is not the goal here.

I think I will leave perft() behind for the time being, I still have a lot of other things to do before my program is able to play games.

Joost

BertTuyt · Post by **BertTuyt** » Tue Aug 16, 2016 12:08

Joost, think further optimization does not make sense, as the Movegenerator is already very fast.
As I want to make some changes to the overall architecture of Damage, I want to spent still somewhat time on the Movegenerator.
I still believe that the King Capture (even without Magic) can be faster.
I will to do some tests with RAYMASK based solutions, as is also embedded in the program of Harm Jetten (Moby Dam).

Another question think the Intel Compiler is not for free, and ballpark figure for costs?

Bert

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 16, 2016 12:09

Joost Buijs wrote: To determine king captures I basically do the following: ((magic_mask_for_current_direction & opposite_pieces & ~edge_squares) >> current_direction) & empty_squares;
This gives you the first empty square after the opposite piece you are allowed to capture.
It is possible to generate another set of magic_masks to avoid masking the edge-squares but that is bean counting, you only have to mask the opposite_pieces and edge_squares once for all directions.

Hi Joost,

Could you show us your code how you use magics to generate king captures?

I think you can do it with only one magic multiplication mask for all 4 directions simultaneously, combined with 4 post-processing steps (1 for each direction). Basically you re-use your king move magic generation function

Code: Select all

passing_sqrs = king_magic_moves(from, ~empty_sqrs); // 1 magic multiplication, gives all squares a king can travel: "from" until blocked by the first bit along a direction in "~empty_sqrs"
target_LU = (passing_sqrs & diag_LU[from] >> dir_LU) & opponent_pieces & (empty_sqrs << dir_LU); // at most 1 bit
recursive_find_king_jumps(from, target_LU, dir_LU); 
// ...
// same for LD, RU and RD directions

It takes only an extra 4 bitboard tables (called diag_LU etc. above) per square (for each half-diagonal from each square). So 64 * 8 * 4 = 2Kb.

BertTuyt · Post by **BertTuyt** » Tue Aug 16, 2016 13:06

Rein/Joost, still believe that it can be done faster and without MAGIC

.
See attached code from the program of Harm Jetten (Moby Dam), which I also want to test.
Here you dont need an array , and array memory acces, which should be faster.

Bert

Code: Select all

static void kingcapt_main(movelist *listptr)
{
    u64 ray, nearest, pcbit;

    /* starting position of capturing king */
    pcbit = listptr->frombit;

    /* find non-empty squares in nw direction */
    ray = (RAYMASK_NW >> __builtin_clzll(pcbit)) & ~listptr->empty;
    /* get MS1B */
    nearest = 1ULL << (63 ^ __builtin_clzll(ray | 1ULL));
    /* is it an opponent piece followed by an empty square? */
    if ((nearest & listptr->oppbits & (listptr->empty << 6)) != 0)
    {
        kingcapt_nw(listptr, nearest >> 6, nearest);
    }

World Draughts Forum

Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft

Re: Perft