Perft
-
- Posts: 1722
- Joined: Wed Apr 14, 2004 16:04
- Contact:
Re: Perft
I suspect that your empty_squares contains bits on the ghost squares. What I have are two masks called "squares" and "ghosts". They satisfy the followingJoost Buijs wrote:To be honest, I never looked at it in very great detail, I use ghost squares at 10,21,32, and 43 and I assumed that when there sits an enemy piece on the edge this test can fail that's why I mask out the edge-squares.Rein Halbersma wrote:Why do you need the edge_squares masking? If you have the "ghost" squares at each edge, then the opposite_pieces & (empty_squares << current_direction) would be the cheapest way to isolate all opposite pieces with an empty piece behind them.Joost Buijs wrote: To determine king captures I basically do the following: ((magic_mask_for_current_direction & opposite_pieces & ~edge_squares) >> current_direction) & empty_squares;
This gives you the first empty square after the opposite piece you are allowed to capture.
The funny thing is that the test fails indeed when I remove that mask and I still believe it is needed, but I will take another look at it.
My empty-square mask is the inverse of the occupied square-mask and it is possible that I should mark the ghost bits as being not empty, that will take time as well.
It is not so important anyway, with or without that edge-mask there won't be much difference in speed.
bitcount(squares) == 50
bitcount(ghosts) == 14
squares ^ ghosts == ~(0ULL) (all 64 bits set to 1)
squares == ~ghosts
If you have your board class with bitboards for black, white and kings, then occupied == black ^ white. However, if you now define empty_squares == ~occupied, you have introduced 14 bits on the ghosts squares for empty_squares. This means that in patterns as opponent_pieces & (empty_squares << dir) you can also find that jumps over pieces located on the board edge are possible. Hence, you need edge mask detection. Is this what is the case in your code?
To avoid this, the only time that I have an explicit use of the ghosts squares is when I compute empty_squares
empty_squares = squares ^ occupied (equivalent to ~ghosts ^ occupied)
I think that if you compute your empty_squares in the above way, then you don't need any edge_mask when computing move or jump patterns.
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
Yes, this is indeed what is happening. This morning I already figured it out.Rein Halbersma wrote:I suspect that your empty_squares contains bits on the ghost squares. What I have are two masks called "squares" and "ghosts". They satisfy the followingJoost Buijs wrote:To be honest, I never looked at it in very great detail, I use ghost squares at 10,21,32, and 43 and I assumed that when there sits an enemy piece on the edge this test can fail that's why I mask out the edge-squares.Rein Halbersma wrote:
Why do you need the edge_squares masking? If you have the "ghost" squares at each edge, then the opposite_pieces & (empty_squares << current_direction) would be the cheapest way to isolate all opposite pieces with an empty piece behind them.
The funny thing is that the test fails indeed when I remove that mask and I still believe it is needed, but I will take another look at it.
My empty-square mask is the inverse of the occupied square-mask and it is possible that I should mark the ghost bits as being not empty, that will take time as well.
It is not so important anyway, with or without that edge-mask there won't be much difference in speed.
bitcount(squares) == 50
bitcount(ghosts) == 14
squares ^ ghosts == ~(0ULL) (all 64 bits set to 1)
squares == ~ghosts
If you have your board class with bitboards for black, white and kings, then occupied == black ^ white. However, if you now define empty_squares == ~occupied, you have introduced 14 bits on the ghosts squares for empty_squares. This means that in patterns as opponent_pieces & (empty_squares << dir) you can also find that jumps over pieces located on the board edge are possible. Hence, you need edge mask detection. Is this what is the case in your code?
To avoid this, the only time that I have an explicit use of the ghosts squares is when I compute empty_squares
empty_squares = squares ^ occupied (equivalent to ~ghosts ^ occupied)
I think that if you compute your empty_squares in the above way, then you don't need any edge_mask when computing move or jump patterns.
It is very easy to change in my code, but I expect that it doesn't have much influence on the performance (if any).
This evening I will modify it and let you know the results.
Joost
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
Bert,BertTuyt wrote:Joost, thanks.
I would also like to see an as-is comparison.
So just run the executable on your (faster) computer, then I also better know , and can quantify what the effect of the Intel compiler is.
Bert
These are the results for the MSVC compiler @ 3600MHz. I took the liberty to add 1 or 2 ply to the depth.
At first glance your results are a few percent better than mine with the same compiler.
I will post the results for the Intel compiler somewhat later.
Code: Select all
Perft(1) N = 9 0.00 sec. KN/sec = 0
Perft(2) N = 81 0.00 sec. KN/sec = 0
Perft(3) N = 658 0.00 sec. KN/sec = 0
Perft(4) N = 4265 0.00 sec. KN/sec = 0
Perft(5) N = 27117 0.00 sec. KN/sec = 0
Perft(6) N = 167140 0.00 sec. KN/sec = 0
Perft(7) N = 1049442 0.00 sec. KN/sec = 0
Perft(8) N = 6483961 0.04 sec. KN/sec = 162099
Perft(9) N = 41022423 0.29 sec. KN/sec = 141456
Perft(10) N = 258895763 1.81 sec. KN/sec = 143036
Perft(11) N = 1665861398 11.56 sec. KN/sec = 144105
Perft(12) N = 10749771911 75.53 sec. KN/sec = 142324
Perft(1) N = 14 0.00 sec. KN/sec = 0
Perft(2) N = 55 0.00 sec. KN/sec = 0
Perft(3) N = 1168 0.00 sec. KN/sec = 0
Perft(4) N = 5432 0.00 sec. KN/sec = 0
Perft(5) N = 87195 0.00 sec. KN/sec = 0
Perft(6) N = 629010 0.00 sec. KN/sec = 0
Perft(7) N = 9041010 0.05 sec. KN/sec = 180820
Perft(8) N = 86724219 0.45 sec. KN/sec = 192720
Perft(9) N = 1216917193 6.56 sec. KN/sec = 185505
Perft(10) N = 13106503411 65.77 sec. KN/sec = 199277
Perft(1) N = 6 0.00 sec. KN/sec = 0
Perft(2) N = 12 0.00 sec. KN/sec = 0
Perft(3) N = 30 0.00 sec. KN/sec = 0
Perft(4) N = 73 0.00 sec. KN/sec = 0
Perft(5) N = 215 0.00 sec. KN/sec = 0
Perft(6) N = 590 0.00 sec. KN/sec = 0
Perft(7) N = 1944 0.00 sec. KN/sec = 0
Perft(8) N = 6269 0.00 sec. KN/sec = 0
Perft(9) N = 22369 0.00 sec. KN/sec = 0
Perft(10) N = 88050 0.00 sec. KN/sec = 0
Perft(11) N = 377436 0.00 sec. KN/sec = 0
Perft(12) N = 1910989 0.02 sec. KN/sec = 95549
Perft(13) N = 9872645 0.11 sec. KN/sec = 89751
Perft(14) N = 58360286 0.60 sec. KN/sec = 97267
Perft(15) N = 346184885 3.52 sec. KN/sec = 98347
Perft(16) N = 2272406115 21.93 sec. KN/sec = 103620
Perft(17) N = 14962263728 144.16 sec. KN/sec = 103789
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
These are the results for the Intel compiler without PGO.
Code: Select all
Perft(1) N = 9 0.00 sec. KN/sec = 0
Perft(2) N = 81 0.00 sec. KN/sec = 0
Perft(3) N = 658 0.00 sec. KN/sec = 0
Perft(4) N = 4265 0.00 sec. KN/sec = 0
Perft(5) N = 27117 0.00 sec. KN/sec = 0
Perft(6) N = 167140 0.00 sec. KN/sec = 0
Perft(7) N = 1049442 0.00 sec. KN/sec = 0
Perft(8) N = 6483961 0.04 sec. KN/sec = 162099
Perft(9) N = 41022423 0.27 sec. KN/sec = 151934
Perft(10) N = 258895763 1.67 sec. KN/sec = 155027
Perft(11) N = 1665861398 10.78 sec. KN/sec = 154532
Perft(12) N = 10749771911 67.70 sec. KN/sec = 158785
Perft(1) N = 14 0.00 sec. KN/sec = 0
Perft(2) N = 55 0.00 sec. KN/sec = 0
Perft(3) N = 1168 0.00 sec. KN/sec = 0
Perft(4) N = 5432 0.00 sec. KN/sec = 0
Perft(5) N = 87195 0.00 sec. KN/sec = 0
Perft(6) N = 629010 0.00 sec. KN/sec = 0
Perft(7) N = 9041010 0.05 sec. KN/sec = 180820
Perft(8) N = 86724219 0.43 sec. KN/sec = 201684
Perft(9) N = 1216917193 6.47 sec. KN/sec = 188086
Perft(10) N = 13106503411 62.73 sec. KN/sec = 208935
Perft(1) N = 6 0.00 sec. KN/sec = 0
Perft(2) N = 12 0.00 sec. KN/sec = 0
Perft(3) N = 30 0.00 sec. KN/sec = 0
Perft(4) N = 73 0.00 sec. KN/sec = 0
Perft(5) N = 215 0.00 sec. KN/sec = 0
Perft(6) N = 590 0.00 sec. KN/sec = 0
Perft(7) N = 1944 0.00 sec. KN/sec = 0
Perft(8) N = 6269 0.00 sec. KN/sec = 0
Perft(9) N = 22369 0.00 sec. KN/sec = 0
Perft(10) N = 88050 0.00 sec. KN/sec = 0
Perft(11) N = 377436 0.00 sec. KN/sec = 0
Perft(12) N = 1910989 0.02 sec. KN/sec = 95549
Perft(13) N = 9872645 0.10 sec. KN/sec = 98726
Perft(14) N = 58360286 0.57 sec. KN/sec = 102386
Perft(15) N = 346184885 3.41 sec. KN/sec = 101520
Perft(16) N = 2272406115 21.36 sec. KN/sec = 106386
Perft(17) N = 14962263728 141.06 sec. KN/sec = 106070
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
I will post the Intel PGO results somewhat later, installing VS2015 update 3 broke something in the Intel VS2015 integration.
I tried with /QProf-gen and /QProf-use and that didn't work, I guess I have to repair or reinstall the Intel compiler.
Later...
Ok, here are the Intel results with PGO. It is somewhat faster, the last position didn't change much.
It is very peculiar that the first position at depth 8 always gives 162099 kns.??? Somehow a low resolution of your timing routine?
I tried with /QProf-gen and /QProf-use and that didn't work, I guess I have to repair or reinstall the Intel compiler.
Later...
Ok, here are the Intel results with PGO. It is somewhat faster, the last position didn't change much.
It is very peculiar that the first position at depth 8 always gives 162099 kns.??? Somehow a low resolution of your timing routine?
Code: Select all
Perft(1) N = 9 0.00 sec. KN/sec = 0
Perft(2) N = 81 0.00 sec. KN/sec = 0
Perft(3) N = 658 0.00 sec. KN/sec = 0
Perft(4) N = 4265 0.00 sec. KN/sec = 0
Perft(5) N = 27117 0.00 sec. KN/sec = 0
Perft(6) N = 167140 0.00 sec. KN/sec = 0
Perft(7) N = 1049442 0.00 sec. KN/sec = 0
Perft(8) N = 6483961 0.04 sec. KN/sec = 162099
Perft(9) N = 41022423 0.24 sec. KN/sec = 170926
Perft(10) N = 258895763 1.54 sec. KN/sec = 168114
Perft(11) N = 1665861398 9.76 sec. KN/sec = 170682
Perft(12) N = 10749771911 62.24 sec. KN/sec = 172714
Perft(1) N = 14 0.00 sec. KN/sec = 0
Perft(2) N = 55 0.00 sec. KN/sec = 0
Perft(3) N = 1168 0.00 sec. KN/sec = 0
Perft(4) N = 5432 0.00 sec. KN/sec = 0
Perft(5) N = 87195 0.00 sec. KN/sec = 0
Perft(6) N = 629010 0.00 sec. KN/sec = 0
Perft(7) N = 9041010 0.04 sec. KN/sec = 226025
Perft(8) N = 86724219 0.42 sec. KN/sec = 206486
Perft(9) N = 1216917193 6.38 sec. KN/sec = 190739
Perft(10) N = 13106503411 60.54 sec. KN/sec = 216493
Perft(1) N = 6 0.00 sec. KN/sec = 0
Perft(2) N = 12 0.00 sec. KN/sec = 0
Perft(3) N = 30 0.00 sec. KN/sec = 0
Perft(4) N = 73 0.00 sec. KN/sec = 0
Perft(5) N = 215 0.00 sec. KN/sec = 0
Perft(6) N = 590 0.00 sec. KN/sec = 0
Perft(7) N = 1944 0.00 sec. KN/sec = 0
Perft(8) N = 6269 0.00 sec. KN/sec = 0
Perft(9) N = 22369 0.00 sec. KN/sec = 0
Perft(10) N = 88050 0.00 sec. KN/sec = 0
Perft(11) N = 377436 0.00 sec. KN/sec = 0
Perft(12) N = 1910989 0.02 sec. KN/sec = 95549
Perft(13) N = 9872645 0.10 sec. KN/sec = 98726
Perft(14) N = 58360286 0.55 sec. KN/sec = 106109
Perft(15) N = 346184885 3.26 sec. KN/sec = 106191
Perft(16) N = 2272406115 20.65 sec. KN/sec = 110043
Perft(17) N = 14962263728 135.13 sec. KN/sec = 110724
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
This problem was very easy to fix. My position structure contains besides the pieces also a field for the empty squares.Rein Halbersma wrote: If you have your board class with bitboards for black, white and kings, then occupied == black ^ white. However, if you now define empty_squares == ~occupied, you have introduced 14 bits on the ghosts squares for empty_squares. This means that in patterns as opponent_pieces & (empty_squares << dir) you can also find that jumps over pieces located on the board edge are possible. Hence, you need edge mask detection. Is this what is the case in your code?
The only thing I had to change when I setup the position is to mask all the squares that do not belong to the actual board as being not empty, this is a one time instruction, the rest of the logic is already ok.
The improvement in speed is between 1 to 3%, not very shocking but it helps.
It is not my goal to make the fastest perft(), it is more a means of checking my move-generator.
I do bulk counting at the last ply but I still store the moves as usual, when I omit storing the moves and only count them it will run a lot faster.
Maybe for captures this is difficult because it makes detecting redundant moves somewhat difficult but for sliders it is not necessary to store them on the last ply.
-
- Posts: 1722
- Joined: Wed Apr 14, 2004 16:04
- Contact:
Re: Perft
The fastest code is removed code, and it's also easiest to maintain What's not to like?Joost Buijs wrote: It is not my goal to make the fastest perft(), it is more a means of checking my move-generator.
Yes for redundant captures you need to store the move. If not, you can only store the max number of jumps so far. Even better, for checkers (no majority capture rule) you can only increment a counter.I do bulk counting at the last ply but I still store the moves as usual, when I omit storing the moves and only count them it will run a lot faster.
Maybe for captures this is difficult because it makes detecting redundant moves somewhat difficult but for sliders it is not necessary to store them on the last ply.
There some more tricks that people use in chess. If e.g. a white move (in draughts, e.g. a backrank pawn move) would not influence any of the black moves, then you can just do a null move at depth = 1 and count the number of black moves. It complicates the logic quite a bit, but saves a lot of potential work. Also hashing is a sure way of speeding up perft.
Re: Perft
Joost, thanks for testing.
If I multiply the traditional results (optimized Intel) with 3.6/4.0 I yield next timings:
1. Perft(11) 8.78s
2. Perft(9) 5.7s
3. Perft(15) 2.9s
So 1. and 3. very similar to your results.
As these positions are mainly based on Man Capture, and Man Move, I guess we might have reached a limit for efficiency.
For 2. which is King dominant I need to do some work.
Initially for the King Moves the Magic did not yield faster results, so it might be beneficial for the King Capture.
I will also do a test with leaving out the KingMove Bitmaps which are constructed during King Capture detection, not sure they are faster.
You are also right regarding a faster Perft() when you dont include move storage at ply 1.
As you stated this can be done easily for Moves, but Captures is tricky.
For the moment I will leave Perft() as is.
Which timing function did you use?
I might need another one
Bert
If I multiply the traditional results (optimized Intel) with 3.6/4.0 I yield next timings:
1. Perft(11) 8.78s
2. Perft(9) 5.7s
3. Perft(15) 2.9s
So 1. and 3. very similar to your results.
As these positions are mainly based on Man Capture, and Man Move, I guess we might have reached a limit for efficiency.
For 2. which is King dominant I need to do some work.
Initially for the King Moves the Magic did not yield faster results, so it might be beneficial for the King Capture.
I will also do a test with leaving out the KingMove Bitmaps which are constructed during King Capture detection, not sure they are faster.
You are also right regarding a faster Perft() when you dont include move storage at ply 1.
As you stated this can be done easily for Moves, but Captures is tricky.
For the moment I will leave Perft() as is.
Which timing function did you use?
I might need another one
Bert
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
I use QueryPerformanceFrequency() and QueryPerformanceCounter().BertTuyt wrote: Which timing function did you use?
I might need another one
Bert
Re: Perft
I have removed the KingMove Mapping in the capture routine.
It seems that The Perft(9) position 2 is around 8% - 9% faster.
8.51s -> 7.69s
But hope Joost can confirm, and hope i approach now his results.
See the link as I have updated the Dropbox.
Bert
https://www.dropbox.com/sh/9yx1f7z3x1q6 ... h2mVa?dl=0
It seems that The Perft(9) position 2 is around 8% - 9% faster.
8.51s -> 7.69s
But hope Joost can confirm, and hope i approach now his results.
See the link as I have updated the Dropbox.
Bert
https://www.dropbox.com/sh/9yx1f7z3x1q6 ... h2mVa?dl=0
-
- Posts: 471
- Joined: Wed May 04, 2016 11:45
- Real name: Joost Buijs
Re: Perft
Bert, I did one more run with your last version and Intel PGO, these are the results:BertTuyt wrote:I have removed the KingMove Mapping in the capture routine.
It seems that The Perft(9) position 2 is around 8% - 9% faster.
8.51s -> 7.69s
But hope Joost can confirm, and hope i approach now his results.
See the link as I have updated the Dropbox.
Bert
https://www.dropbox.com/sh/9yx1f7z3x1q6 ... h2mVa?dl=0
Code: Select all
Perft(1) N = 9 0.00 sec. KN/sec = 0
Perft(2) N = 81 0.00 sec. KN/sec = 0
Perft(3) N = 658 0.00 sec. KN/sec = 0
Perft(4) N = 4265 0.00 sec. KN/sec = 0
Perft(5) N = 27117 0.00 sec. KN/sec = 0
Perft(6) N = 167140 0.00 sec. KN/sec = 0
Perft(7) N = 1049442 0.00 sec. KN/sec = 0
Perft(8) N = 6483961 0.04 sec. KN/sec = 162099
Perft(9) N = 41022423 0.24 sec. KN/sec = 170926
Perft(10) N = 258895763 1.58 sec. KN/sec = 163858
Perft(11) N = 1665861398 9.90 sec. KN/sec = 168268
Perft(12) N = 10749771911 63.72 sec. KN/sec = 168703
Perft(1) N = 14 0.00 sec. KN/sec = 0
Perft(2) N = 55 0.00 sec. KN/sec = 0
Perft(3) N = 1168 0.00 sec. KN/sec = 0
Perft(4) N = 5432 0.00 sec. KN/sec = 0
Perft(5) N = 87195 0.00 sec. KN/sec = 0
Perft(6) N = 629010 0.00 sec. KN/sec = 0
Perft(7) N = 9041010 0.04 sec. KN/sec = 226025
Perft(8) N = 86724219 0.40 sec. KN/sec = 216810
Perft(9) N = 1216917193 5.77 sec. KN/sec = 210904
Perft(10) N = 13106503411 57.65 sec. KN/sec = 227346
Perft(1) N = 6 0.00 sec. KN/sec = 0
Perft(2) N = 12 0.00 sec. KN/sec = 0
Perft(3) N = 30 0.00 sec. KN/sec = 0
Perft(4) N = 73 0.00 sec. KN/sec = 0
Perft(5) N = 215 0.00 sec. KN/sec = 0
Perft(6) N = 590 0.00 sec. KN/sec = 0
Perft(7) N = 1944 0.00 sec. KN/sec = 0
Perft(8) N = 6269 0.00 sec. KN/sec = 0
Perft(9) N = 22369 0.00 sec. KN/sec = 0
Perft(10) N = 88050 0.00 sec. KN/sec = 0
Perft(11) N = 377436 0.00 sec. KN/sec = 0
Perft(12) N = 1910989 0.02 sec. KN/sec = 95549
Perft(13) N = 9872645 0.10 sec. KN/sec = 98726
Perft(14) N = 58360286 0.56 sec. KN/sec = 104214
Perft(15) N = 346184885 3.38 sec. KN/sec = 102421
Perft(16) N = 2272406115 20.88 sec. KN/sec = 108831
Perft(17) N = 14962263728 138.16 sec. KN/sec = 108296
I use exactly the same code for perft() as I'm going to use in my program for my search(), I'm sure that a specialized perft() program can be made faster but that is not the goal here.
I think I will leave perft() behind for the time being, I still have a lot of other things to do before my program is able to play games.
Joost
Re: Perft
Joost, think further optimization does not make sense, as the Movegenerator is already very fast.
As I want to make some changes to the overall architecture of Damage, I want to spent still somewhat time on the Movegenerator.
I still believe that the King Capture (even without Magic) can be faster.
I will to do some tests with RAYMASK based solutions, as is also embedded in the program of Harm Jetten (Moby Dam).
Another question think the Intel Compiler is not for free, and ballpark figure for costs?
Bert
As I want to make some changes to the overall architecture of Damage, I want to spent still somewhat time on the Movegenerator.
I still believe that the King Capture (even without Magic) can be faster.
I will to do some tests with RAYMASK based solutions, as is also embedded in the program of Harm Jetten (Moby Dam).
Another question think the Intel Compiler is not for free, and ballpark figure for costs?
Bert
-
- Posts: 1722
- Joined: Wed Apr 14, 2004 16:04
- Contact:
Re: Perft
Hi Joost,Joost Buijs wrote: To determine king captures I basically do the following: ((magic_mask_for_current_direction & opposite_pieces & ~edge_squares) >> current_direction) & empty_squares;
This gives you the first empty square after the opposite piece you are allowed to capture.
It is possible to generate another set of magic_masks to avoid masking the edge-squares but that is bean counting, you only have to mask the opposite_pieces and edge_squares once for all directions.
Could you show us your code how you use magics to generate king captures?
I think you can do it with only one magic multiplication mask for all 4 directions simultaneously, combined with 4 post-processing steps (1 for each direction). Basically you re-use your king move magic generation function
Code: Select all
passing_sqrs = king_magic_moves(from, ~empty_sqrs); // 1 magic multiplication, gives all squares a king can travel: "from" until blocked by the first bit along a direction in "~empty_sqrs"
target_LU = (passing_sqrs & diag_LU[from] >> dir_LU) & opponent_pieces & (empty_sqrs << dir_LU); // at most 1 bit
recursive_find_king_jumps(from, target_LU, dir_LU);
// ...
// same for LD, RU and RD directions
Re: Perft
Rein/Joost, still believe that it can be done faster and without MAGIC .
See attached code from the program of Harm Jetten (Moby Dam), which I also want to test.
Here you dont need an array , and array memory acces, which should be faster.
Bert
See attached code from the program of Harm Jetten (Moby Dam), which I also want to test.
Here you dont need an array , and array memory acces, which should be faster.
Bert
Code: Select all
static void kingcapt_main(movelist *listptr)
{
u64 ray, nearest, pcbit;
/* starting position of capturing king */
pcbit = listptr->frombit;
/* find non-empty squares in nw direction */
ray = (RAYMASK_NW >> __builtin_clzll(pcbit)) & ~listptr->empty;
/* get MS1B */
nearest = 1ULL << (63 ^ __builtin_clzll(ray | 1ULL));
/* is it an opponent piece followed by an empty square? */
if ((nearest & listptr->oppbits & (listptr->empty << 6)) != 0)
{
kingcapt_nw(listptr, nearest >> 6, nearest);
}