Perft

Discussion about development of draughts in the time of computer and Internet.
Post Reply
Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: Perft

Post by Joost Buijs » Fri Sep 02, 2016 10:38

Rein Halbersma wrote:
Joost Buijs wrote: I guess I have to revert back to Zobrist hashing, the problem is that the capture sequences in draughts are rather lengthy, because of this the engine will slow down considerably when updating the hash incrementally.
I never noticed this as a bottleneck. Is it really that bad? E.g. a regular move has 3 bits to update (color, from, dest), and a capture with N pieces only 1 bit per piece extra (1 more if it's a king). It's been a while since I measured the average number of captured pieces, but IIRC, it was around 2.
Maybe I'm just to picky, the loss in speed with Zobrist hashing is ~10%.
And I still have collisions, I suspect that there is a problem (bug) somewhere else in the TT code.
In my chess engine I never observe collisions with a 64 bit key.

Edit:

I found the problem, it was not a bug in the TT code but a stupid oversight in the code I use to test it.
Now I want to revert back to a faster way of calculating the hash, and measure the collisions over a longer period of time.

Ed Gilbert
Posts: 854
Joined: Sat Apr 28, 2007 14:53
Real name: Ed Gilbert
Location: Morristown, NJ USA
Contact:

Re: Perft

Post by Ed Gilbert » Fri Sep 02, 2016 13:47

In kingsrow I use a non-incremental hashing function that is a series of shifts and XORs. It is very similar to Thomas Wang's 64-bit mix function shown here: http://opensource.apple.com/source/Java ... unctions.h. I tuned the shifts using a program that tests that a single bit change in the key can affect every bit in the hash. I don't know how it compares to Zobrist because I never tried that. Until recently I never had make/unmake move functions because my movelist is just an array of new board positions.

-- Ed

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: Perft

Post by Joost Buijs » Fri Sep 02, 2016 15:02

The last days I tried several hash functions including mix type functions, I thought they weren't working properly but the problem was inside the program I use to test the hash-table, this is solved now, so I have to retest everything.

When you use a good random generator Zobrist hashing is very reliable albeit somewhat slow.
I optimized the engine for speed, on a single core with full evaluation it does 18.4 mnps, when I add incremental Zobrist hashing it slows down to 16.5 mnps, when I add the hash-table in the main search it slows down to 11.2 mnps, this is with a 1024 MB table, with a larger table it even gets slower.
This is measured with the Woldouby position at depth 24.
Memory speed is the major bottleneck, using large-pages helps a bit by lowering the stress on the TLB.

At the moment I pack a whole entry in 128 bits, there was no room for an age field, I don't find this very useful anyway. A move is 44 bits in size.

Joost

Code: Select all

	uint64_t key;
	uint64_t data;

	void pack(uint64_t hash, uint64_t move, int depth, int value, uint type)
	{
		key = (hash & 0xFFFFFFFFFFFFFF00) | static_cast<uint8_t>(depth);
		data = move << 20 | type << 16 | static_cast<uint16_t>(value);
		key ^= data; // lockless entry
	}

	void unpack(xentry_t *pxe)
	{
		pxe->depth = static_cast<int8_t>(key ^ data);
		pxe->move = data >> 20;
		pxe->type = data >> 16 & 0xF;
		pxe->value = static_cast<int16_t>(data);
	}

BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: Perft

Post by BertTuyt » Fri Sep 02, 2016 15:55

Joost Buijs wrote:The last days I tried several hash functions including mix type functions, I thought they weren't working properly but the problem was inside the program I use to test the hash-table, this is solved now, so I have to retest everything.

When you use a good random generator Zobrist hashing is very reliable albeit somewhat slow.
I optimized the engine for speed, on a single core with full evaluation it does 18.4 mnps, when I add incremental Zobrist hashing it slows down to 16.5 mnps, when I add the hash-table in the main search it slows down to 11.2 mnps, this is with a 1024 MB table, with a larger table it even gets slower.
This is measured with the Woldouby position at depth 24.
Memory speed is the major bottleneck, using large-pages helps a bit by lowering the stress on the TLB.

At the moment I pack a whole entry in 128 bits, there was no room for an age field, I don't find this very useful anyway. A move is 44 bits in size.

Joost

Code: Select all

	uint64_t key;
	uint64_t data;

	void pack(uint64_t hash, uint64_t move, int depth, int value, uint type)
	{
		key = (hash & 0xFFFFFFFFFFFFFF00) | static_cast<uint8_t>(depth);
		data = move << 20 | type << 16 | static_cast<uint16_t>(value);
		key ^= data; // lockless entry
	}

	void unpack(xentry_t *pxe)
	{
		pxe->depth = static_cast<int8_t>(key ^ data);
		pxe->move = data >> 20;
		pxe->type = data >> 16 & 0xF;
		pxe->value = static_cast<int16_t>(data);
	}
Joost impressive mnps numbers, for my understanding (and again think this is the discusssion we should continue in the search section) , is it plain alpha-beta (so without any refinements)? And what is your definition of a full evaluation, the material including the Fabien Function ?

Bert

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: Perft

Post by Joost Buijs » Fri Sep 02, 2016 16:45

Bert,

Yes, at the moment it is still plain alpha-beta, no refinements yet, I still have to add a null-window search, history, killers etc.

Full evaluation is not full yet, it is very basic, but it is indeed material + the 20 structures Fabien uses, I use this because I want to compare apples to apples, it is coded entirely different though. The tables are a lot larger because there are holes in the index function, fortunately this doesn't seem to hurt speed by much.
The evaluation runs in ~100ns., of course it only needs to be called when the position is quiet.

When my search is ready and before I start with automatic learning I will probably switch to different structures more in line with what is known about Draughts theory.

Code: Select all

int ep_8bi[8][65536];
int ep_9bi[2][262144];

inline int v8i(int n, bb_t mb, bb_t mw, bb_t pf, bb_t pr)
{
	return ep_8bi[n][(PEXT(mw, pf) << 8) + PEXT(mb, pf)] - ep_8bi[n][(r_8b[PEXT(mb, pr)] << 8) + r_8b[PEXT(mw, pr)]];
}

inline int v9i(int n, bb_t mb, bb_t mw, bb_t pf, bb_t pr)
{
	return ep_9bi[n][(PEXT(mw, pf) << 9) + PEXT(mb, pf)] - ep_9bi[n][(r_9b[PEXT(mb, pr)] << 9) + r_9b[PEXT(mw, pr)]];
}

int evaluate(position_t *pos)
{
	const bb_t mb = pos->men(Black);
	const bb_t mw = pos->men(White);
	int value[2];

	*value =
		+ 100 * POPCOUNT(mb)
		- 100 * POPCOUNT(mw)
		+ 300 * POPCOUNT(pos->kings(Black))
		- 300 * POPCOUNT(pos->kings(White));

	// 8 bit patterns
	*value += v8i(0, mb, mw, 0x0000000000031863ULL, 0x0031863000000000ULL);
	*value += v8i(1, mb, mw, 0x00000000000630c6ULL, 0x0018c31800000000ULL); 
	*value += v8i(2, mb, mw, 0x00000000000c618cULL, 0x000c618c00000000ULL);
	*value += v8i(3, mb, mw, 0x000000000018c318ULL, 0x000630c600000000ULL);
	*value += v8i(4, mb, mw, 0x0000000018c31800ULL, 0x00000630c6000000ULL);
	*value += v8i(5, mb, mw, 0x0000000031863000ULL, 0x0000031863000000ULL);
	*value += v8i(6, mb, mw, 0x00000000630c6000ULL, 0x0000018c31800000ULL);
	*value += v8i(7, mb, mw, 0x00000000c618c000ULL, 0x000000c618c00000ULL);

	// 9 bit patterns
	*value += v9i(0, mb, mw, 0x0000000018430861ULL, 0x0021843086000000ULL);
	*value += v9i(1, mb, mw, 0x000000008610c218ULL, 0x000610c218400000ULL);

	value[White] = -value[Black];

	return value[pos->own()];
}

BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: Perft

Post by BertTuyt » Sun Sep 04, 2016 13:10

Slightly off record.....

But the new Intel Kaby Lake i7-7700K (4 cores, 8 threads), has a base clock of 4.2 Ghz and Turbo Boost to 4.5 Ghz.
And (I guess) the K means overclocking, which must be possible as the TDP is 95W.

So good news for Perft addicts (like me) :)

Bert

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: Perft

Post by Joost Buijs » Sun Sep 04, 2016 13:29

BertTuyt wrote:Slightly off record.....

But the new Intel Kaby Lake i7-7700K (4 cores, 8 threads), has a base clock of 4.2 Ghz and Turbo Boost to 4.5 Ghz.
And (I guess) the K means overclocking, which must be possible as the TDP is 95W.

So good news for Perft addicts (like me) :)

Bert
Yes K means overclocking, I assume they will be very fast.

I'm thinking about building a machine with 24 cores, recently I sold 2 old machines with i7-3960X, the plan was to build a machine with i7-6950X but that CPU is so expensive that I might as well go for a dual Xeon.
Maybe AMD will have some impact on the market with their Zen CPU when it comes out early next year, hopefully it is good enough to force Intel to lower their prices, the currently low value of the Euro doesn't help either.
In the mean time I can do almost everything with my i7-5960X which is still one heck of a machine.

Joost

BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: Perft

Post by BertTuyt » Sun Sep 04, 2016 15:05

I also hope that AMD will be back in the competitive arena.
Intel has a huge dominance in the high end, and as you stated the 10-core i7-6950 has an insane price tag.

It was (always) my wish to get towards the 100 mnps, as this was close to the Deep Blue figures.
Asssume this requires (whith a speed optimized program) around (max) 16 cores.
If possible in a single processor and not dual core.

If Im right, the total memory the 8-core i7-5960X can address, is 64G.
Im not sure if the i7-6950 has the same limitation?

On my wishlist however is something like 256G.
Then I can preload the 8P DB, a huge opening book, and a midgame database (see previous discussions with Michel).

And if all have access to this amount of power, we might reach the boring phase of Computer Draughts (like checkers) where we will mainly see draw games.

Bert

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: Perft

Post by Joost Buijs » Sun Sep 04, 2016 15:51

The 6950X has a limitation of 128GB, the Xeon counterparts can address 1536GB, you need a hefty mainboard for this.

In the past I build several dual CPU machines, the last 5 years I switched back to single CPU because it is cheaper.
Most chess-programs don't scale very well above 8 cores, I guess with draughts it will be even worse because the branching factor is much lower.
It is my goal to get >50 mnps on my 5960X, I'm very confident that this is going to work, I was a bit hold up by the TT stuff but I hope to have my SMP search ready the coming week, I still have to add move ordering but that can also be done after I have added SMP.

Joost

Joost Buijs
Posts: 460
Joined: Wed May 04, 2016 11:45
Real name: Joost Buijs

Re: Perft

Post by Joost Buijs » Fri Sep 09, 2016 06:55

Bert,

Before adding the move-generator to my engine I refactored it and found some additional optimizations that I overlooked earlier.
After this I will leave it like it is, trying to improve it any further is not very sensible.

Joost

Code: Select all


    m  m  m  m  m
  m  m  m  m  m
    m  m  m  m  m
  m  m  m  m  m
    -  -  -  -  -
  -  -  -  -  -
    M  M  M  M  M
  M  M  M  M  M
    M  M  M  M  M
  M  M  M  M  M

perft( 1)  nodes            9  time   0.0000003  nps  26367138
perft( 2)  nodes           81  time   0.0000044  nps  18254172
perft( 3)  nodes          658  time   0.0000024  nps 275390108
perft( 4)  nodes         4265  time   0.0000218  nps 195235840
perft( 5)  nodes        27117  time   0.0001232  nps 220066999
perft( 6)  nodes       167140  time   0.0008090  nps 206610569
perft( 7)  nodes      1049442  time   0.0051265  nps 204709457
perft( 8)  nodes      6483961  time   0.0320854  nps 202084509
perft( 9)  nodes     41022423  time   0.1971822  nps 208043287
perft(10)  nodes    258895763  time   1.2322286  nps 210103677
perft(11)  nodes   1665861398  time   7.7659500  nps 214508385

    -  -  -  -  -
  M  -  -  M  M
    M  -  -  -  -
  -  k  -  -  M
    M  M  M  k  -
  -  -  -  -  M
    K  -  M  -  -
  -  M  -  -  -
    M  M  M  M  -
  M  -  -  -  -

perft( 1)  nodes           14  time   0.0000212  nps    661541
perft( 2)  nodes           55  time   0.0000137  nps   4028313
perft( 3)  nodes         1168  time   0.0000160  nps  72805714
perft( 4)  nodes         5432  time   0.0000474  nps 114489443
perft( 5)  nodes        87195  time   0.0002840  nps 307035603
perft( 6)  nodes       629010  time   0.0026576  nps 236681130
perft( 7)  nodes      9041010  time   0.0274149  nps 329784283
perft( 8)  nodes     86724219  time   0.2878647  nps 301267304
perft( 9)  nodes   1216917193  time   3.7375490  nps 325592306

    -  -  -  -  -
  -  -  -  -  -
    -  m  m  m  -
  m  -  m  m  -
    m  -  m  m  M
  m  M  M  -  M
    -  M  M  M  M
  -  M  M  -  -
    -  -  -  -  -
  -  -  -  -  -

perft( 1)  nodes            6  time   0.0000000  nps         0
perft( 2)  nodes           12  time   0.0000010  nps  11718728
perft( 3)  nodes           30  time   0.0000010  nps  29296820
perft( 4)  nodes           73  time   0.0000024  nps  30552398
perft( 5)  nodes          215  time   0.0000061  nps  34993424
perft( 6)  nodes          590  time   0.0000157  nps  37576356
perft( 7)  nodes         1944  time   0.0000386  nps  50400901
perft( 8)  nodes         6269  time   0.0001123  nps  55824245
perft( 9)  nodes        22369  time   0.0003458  nps  64693047
perft(10)  nodes        88050  time   0.0011534  nps  76341669
perft(11)  nodes       377436  time   0.0041947  nps  89980263
perft(12)  nodes      1910989  time   0.0171585  nps 111372617
perft(13)  nodes      9872645  time   0.0817375  nps 120784709
perft(14)  nodes     58360286  time   0.4290722  nps 136015083
perft(15)  nodes    346184885  time   2.5032157  nps 138296066


BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: Perft

Post by BertTuyt » Fri Sep 09, 2016 19:08

Joost, I didnt thought further speed increase was possible, but you proved I was wrong.
I also want to add first some more features to Dwarf, but at one point I will further have a look at my MoveGenerator()
As I like challences :)

Bert

jj
Posts: 190
Joined: Sun Sep 13, 2009 23:33
Real name: Jan-Jaap van Horssen
Location: Zeist, Netherlands

Re: Perft

Post by jj » Sat Aug 12, 2017 13:23

Joost Buijs wrote:

Code: Select all

perft(11)  nodes   1665861398  time   7.7659500  nps 214508385
perft( 9)  nodes   1216917193  time   3.7375490  nps 325592306
perft(15)  nodes    346184885  time   2.5032157  nps 138296066
That is really fast! I can't do (much) better than

Code: Select all

perft(11) time 14.91
perft( 9) time  7.99
perft(15) time  4.61
in Java. I no longer think Java can be as fast as C++. Loops, arithmetic, bit operations and function calls: yes; heavy array access: no. Also, Java uses too much memory (and garbage collection time) unless you use some tricks to work around that. For a complete engine the speed difference may be about a factor 2. (25 Elo?, depending on the time schedule.) I think this is consistent with what some chess programmers found.
However, changing to C++ now would be a big step, and later updates to my app (written in Java, using Codename One) would mean a lot of extra work.

Jan-Jaap
www.maximusdraughts.org

Rein Halbersma
Posts: 1720
Joined: Wed Apr 14, 2004 16:04
Contact:

Re: Perft

Post by Rein Halbersma » Sun Aug 13, 2017 22:05

jj wrote:
Joost Buijs wrote:

Code: Select all

perft(11)  nodes   1665861398  time   7.7659500  nps 214508385
perft( 9)  nodes   1216917193  time   3.7375490  nps 325592306
perft(15)  nodes    346184885  time   2.5032157  nps 138296066
That is really fast! I can't do (much) better than

Code: Select all

perft(11) time 14.91
perft( 9) time  7.99
perft(15) time  4.61
in Java. I no longer think Java can be as fast as C++. Loops, arithmetic, bit operations and function calls: yes; heavy array access: no. Also, Java uses too much memory (and garbage collection time) unless you use some tricks to work around that. For a complete engine the speed difference may be about a factor 2. (25 Elo?, depending on the time schedule.) I think this is consistent with what some chess programmers found.
However, changing to C++ now would be a big step, and later updates to my app (written in Java, using Codename One) would mean a lot of extra work.

Jan-Jaap
After quite a bit of tuning and refactoring, I now get the fastest timings :)

Xeon E5-1650v4 @3.6 Ghz with a gcc-7 PGO build within a Linux VM on Windows 10 using full bulk-counting

Code: Select all

info depth 11 leafs   1665861398 time   7387 225513 knps
info depth  9 leafs   1216917193 time   3539 343859 knps
info depth 15 leafs    346184885 time   2511 137867 knps
This is virtually identical to what Joost got, on very similar hardware (an i7-5960X overclocked to 3.6 Ghz).
Note that Joost and I do "full" bulk-counting: returning the sum of popcounts of the bitboards of valid moves at depth=1. The usual "semi" bulk-counting generates all those moves and returns the size of the move list. This can add up to ~30%.

Xeon E5-1650v4 @3.6 Ghz with a gcc-7 PGO build within a Linux VM on Windows 10 using semi bulk-counting

Code: Select all

info depth 11 leafs   1665861398 time  11521 144593 knps
info depth  9 leafs   1216917193 time   6287 193561 knps
info depth 15 leafs    346184885 time   3493 99108.2 knps
In any case, the fastest move generator that I have tested is Moby Dam.

Moby Dam: Xeon E5-1650v4 @3.6 Ghz with a reguler gcc-7 build within a Linux VM on Windows 10 using semi bulk-counting

Code: Select all

perft(11) 1665861398 nodes, 8.51 sec, 195828 kN/s, bulk
perft(9) 1216917193 nodes, 5.05 sec, 241196 kN/s, bulk
perft(15) 346184885 nodes, 2.60 sec, 132919 kN/s, bulk
My guess is (yet to be tested) that this is because Moby Dam generates successor positions, instead of moves. Kingsrow also does this and with semi bulk-counting it is also faster than my engine (but a little slower than Moby Dam).

Kingsrow: Xeon E5-1650v4 @3.6 Ghz on Windows 10 using semi bulk-counting

Code: Select all

depth 11 1665861398 nodes, 9.08 sec, 183384 knodes/sec
depth 9 1216917193 nodes, 5.45 sec, 223165 knodes/sec
depth 15 346184885 nodes, 2.80 sec, 123549 knodes/sec

jj
Posts: 190
Joined: Sun Sep 13, 2009 23:33
Real name: Jan-Jaap van Horssen
Location: Zeist, Netherlands

Re: Perft

Post by jj » Mon Aug 14, 2017 18:42

Rein Halbersma wrote:Note that Joost and I do "full" bulk-counting: returning the sum of popcounts of the bitboards of valid moves at depth=1. The usual "semi" bulk-counting generates all those moves and returns the size of the move list. This can add up to ~30%.
Indeed, I saw that later. Thanks for the overview. I do semi bulk-counting and I generate moves, to copy/make. So I have to compare to Moby Dam, which is "only" 1.7 x faster. I use an Intel i7-3930K @3.2 GHz, by the way. Task manager claims the speed is about 3.7 GHz during perft, for what it's worth.
When I profile perft, makeMove() consumes about 0% of the time. All time is consumed by generator routines (and perft), mainly the loop that adds man moves (almost 30%). So I'm not sure generating successor positions will be much faster. Which also interferes with using a transposition table (store move) and a history table (indexed by move). I think I'll leave things the way they are and move on to more important stuff.

BertTuyt
Posts: 1573
Joined: Wed Sep 01, 2004 19:42

Re: Perft

Post by BertTuyt » Sat Aug 26, 2017 03:07

A weekend in Holland.
Full bulk counting, but should still be a little beter :D

Code: Select all

pos XW XB 20 20
Perft(1)        N = 9      0.00 sec.    KN/sec = 0
Perft(2)        N = 81     0.00 sec.    KN/sec = 0
Perft(3)        N = 658    0.00 sec.    KN/sec = 0
Perft(4)        N = 4265           0.00 sec.    KN/sec = 0
Perft(5)        N = 27117          0.00 sec.    KN/sec = 0
Perft(6)        N = 167140         0.00 sec.    KN/sec = 0
Perft(7)        N = 1049442        0.00 sec.    KN/sec = 0
Perft(8)        N = 6483961        0.04 sec.    KN/sec = 162099
Perft(9)        N = 41022423       0.26 sec.    KN/sec = 157778
Perft(10)       N = 258895763      1.27 sec.    KN/sec = 203854
Perft(11)       N = 1665861398     7.35 sec.    KN/sec = 226647
pos XW XB 17 2
Perft(1)        N = 14     0.00 sec.    KN/sec = 0
Perft(2)        N = 55     0.00 sec.    KN/sec = 0
Perft(3)        N = 1168           0.00 sec.    KN/sec = 0
Perft(4)        N = 5432           0.00 sec.    KN/sec = 0
Perft(5)        N = 87195          0.00 sec.    KN/sec = 0
Perft(6)        N = 629010         0.00 sec.    KN/sec = 0
Perft(7)        N = 9041010        0.05 sec.    KN/sec = 180820
Perft(8)        N = 86724219       0.34 sec.    KN/sec = 255071
Perft(9)        N = 1216917193     3.95 sec.    KN/sec = 308080
pos XW XB 10 10
Perft(1)        N = 6      0.00 sec.    KN/sec = 0
Perft(2)        N = 12     0.00 sec.    KN/sec = 0
Perft(3)        N = 30     0.00 sec.    KN/sec = 0
Perft(4)        N = 73     0.00 sec.    KN/sec = 0
Perft(5)        N = 215    0.00 sec.    KN/sec = 0
Perft(6)        N = 590    0.00 sec.    KN/sec = 0
Perft(7)        N = 1944           0.00 sec.    KN/sec = 0
Perft(8)        N = 6269           0.00 sec.    KN/sec = 0
Perft(9)        N = 22369          0.00 sec.    KN/sec = 0
Perft(10)       N = 88050          0.00 sec.    KN/sec = 0
Perft(11)       N = 377436         0.00 sec.    KN/sec = 0
Perft(12)       N = 1910989        0.01 sec.    KN/sec = 191098
Perft(13)       N = 9872645        0.07 sec.    KN/sec = 141037
Perft(14)       N = 58360286       0.41 sec.    KN/sec = 142342
Perft(15)       N = 346184885      2.38 sec.    KN/sec = 145455
Bert

Post Reply