perft for 8x8 checkers (depth 25)

Discussion about development of draughts in the time of computer and Internet.
AartBik
Posts: 103
Joined: Wed Mar 11, 2009 01:30
Location: Mountain View
Contact:

Re: perft for 8x8 checkers (depth 25)

Post by AartBik »

Rein Halbersma wrote:Without hashing, the numbers are in the same ball park as Aart's, but apparently my hash table -as did Murray's- has a big performance hit (>2x :( ) compared to Aart's.

@Aart: can you tell us a bit about your hash table specs? How much memory, how many entries, which kind of data structure and replacement scheme?

@Murray, Aart: (assuming you have an incremental Zobrist hash) did you also turn off hash updating for the plain / bulk-counting perft() runs?
Rein,
Glad to give some more details.

My move generator always does incremental Zobrist updates, even for all cases I reported that don't use transposition tables at all (I simply did not feel like removing this intricate code; I suspect it would not have helped a lot anyway).

My hash entries in the transposition table consist of a 64-bit perft count, a 97-bit full position and an 8-bit depth, padded up to 24 bytes total. And, yes, I am aware that this does not line up nicely with cache line sizes. But since I could not reduce the size of hash entries further to the next smaller power of two, and since I found more padding beyond a multiple of 4 rather wasteful, I preferred this compromise with more entries over avoiding cache line splits.

I use a very simple replace-always scheme for the transposition table. That is, I simply overwrite hash entries with the most recent computed perft computations, regardless of their depth. I tried a few other schemes, such as only overwriting hash entries with deeper perft results, but found that the simple replace-always scheme performed surprisingly well. I did not investigate this in much detail, though, and I am sure better schemes exist.

Aart
Rein Halbersma
Posts: 1722
Joined: Wed Apr 14, 2004 16:04
Contact:

Re: perft for 8x8 checkers (depth 25)

Post by Rein Halbersma »

AartBik wrote: My move generator always does incremental Zobrist updates, even for all cases I reported that don't use transposition tables at all (I simply did not feel like removing this intricate code; I suspect it would not have helped a lot anyway).
Hi Aart,

A little slower response than usual, because of some chores I had to resolve.

Thanks for your explanation. For me, adding hash updating made quite a bit of difference (~30%), mainly because I didn't use the intrinsic BitScanForward64() routine to compute the Zobrist indices... :oops: Fixing that indeed made the bulk-count almost insensitive to the hash updating. It also reduced the hashed perft() by the same percentage from 75 to 58 seconds. Still looking for the remaining ~150% compared to your code...
My hash entries in the transposition table consist of a 64-bit perft count, a 97-bit full position and an 8-bit depth, padded up to 24 bytes total. And, yes, I am aware that this does not line up nicely with cache line sizes. But since I could not reduce the size of hash entries further to the next smaller power of two, and since I found more padding beyond a multiple of 4 rather wasteful, I preferred this compromise with more entries over avoiding cache line splits.

I use a very simple replace-always scheme for the transposition table. That is, I simply overwrite hash entries with the most recent computed perft computations, regardless of their depth. I tried a few other schemes, such as only overwriting hash entries with deeper perft results, but found that the simple replace-always scheme performed surprisingly well. I did not investigate this in much detail, though, and I am sure better schemes exist.
I still have to try your exact layout, but I suspect that using N-way associative buckets only makes sense if one really makes sure that the buckets are cache-aligned. My entry size nicely divides 64, but I haven't made sure that the std::vector starts the elements at a cache-aligned memory address. I should write an allocator to do that. (I recently also included Howard Hinnant's stack allocator into my move generator in order to have a safe dynamic move array, that only goes to the heap if the stack array runs out of space).

My entries with position info are 32 bytes, but I also experimented with using only the upper 32-bits of the hash keys as the signature (gives correct results up to depth = 22), and the speed was almost the same (less than 5% difference). Perhaps one reason was that with the smaller TT entries I have 4-way buckets instead of 2-way ones, and the find/insert then looks at twice the number of elements, even though the storage of each element is half the size. I will test 1-way entries with always-replace soon.

Rein
Post Reply