Parallel search

Ed Gilbert · Post by **Ed Gilbert** » Wed Jan 30, 2008 18:19

I have a multi-threaded parallel search working in Kingsrow, and would like to compare notes with others that have tried this. Gerard and Feike, I think you also have parallel search implemented, and Bert I know is working on it but not sure if it is fully functional yet. Today I ran some tests on an Intel dual core E6700 comparing node counts, search speed in nodes/sec, and time to depth between a single-threaded search and a parallel search. The tests were a suite of 67 test positions taken from a game. Comparing the averages of searches of these positions, I find the ratios of (2-threads/1-thread) are:

Speed (nodes/sec): 1.76
Time to same depth: 0.62
Node counts: 1.09

I really have no idea if these results are reasonable, or if I should be able to get a lot closer to 2x in speed and 0.5x in time to depth. How do these compare with other parallel search engines?

-- Ed

BertTuyt · Post by **BertTuyt** » Wed Jan 30, 2008 21:12

Ed, I am working also on parallel (64bit) search on my Quad Q6600.
My algorithm in the end would be scalable to any number of processor (so 2, 4, 8 ,16).

I think it is good to define a test-suit with positions to compare results.
As i have a work-overload in February, I think my parallel implementation is finished tested and debugged in March.

If you could sent me the test-set it would already be helpful.
The speed increase, i guess, is also dependant on the search-depth. So did you search to a specific depth (for example 15 ply), or do we need to define a test search-depth for these positions.

Although i don't have any numbers to compare, your results look quite good to me.

By the way i know that Michel Grimminck (Dragon), also has implemented a parallel search in his engine (even in 64bit, but im not 100% sure).

Bert

Ed Gilbert · Post by **Ed Gilbert** » Thu Jan 31, 2008 01:52

Bert, I can send you the test positions, but I don't think there is anything special about them. I just wanted to have enough of them to get a good average, because there is some variation in the ratio of parallel/single search times. I also see about the same ratios regardless of the depth or time of search. I chose a nominal search time of 10 seconds for the test positions, and the search depths varied from 15 to 21 ply, but I made sure to compare the statistics at the same search depths for parallel vs. single for each position. I wrote a little test driver to run these tests and gather the results automatically, so that I can easily run it many times and try tuning the parallel search parameters for better performance. With only 2 cores there are not that many parameters to tune. When I can try this on a quad core there will be a few more settings to play with.

-- Ed

TAILLE · Post by **TAILLE** » Fri Feb 01, 2008 21:54

Ed Gilbert wrote:I have a multi-threaded parallel search working in Kingsrow, and would like to compare notes with others that have tried this. Gerard and Feike, I think you also have parallel search implemented, and Bert I know is working on it but not sure if it is fully functional yet. Today I ran some tests on an Intel dual core E6700 comparing node counts, search speed in nodes/sec, and time to depth between a single-threaded search and a parallel search. The tests were a suite of 67 test positions taken from a game. Comparing the averages of searches of these positions, I find the ratios of (2-threads/1-thread) are:

Speed (nodes/sec): 1.76
Time to same depth: 0.62
Node counts: 1.09

I really have no idea if these results are reasonable, or if I should be able to get a lot closer to 2x in speed and 0.5x in time to depth. How do these compare with other parallel search engines?

-- Ed

Yes I have implemented a 2 treaded parallel search in order to take advantage of my CORE 2 DUO T7200 processor (only a 32bits processor).
The only figure I have concernes the speed ratio : 1.81
But it is important to note that this ratio is correct only in the beginning and in the middle of the game.
When Damy begins to use the disk endgame database I have to discover some new ideas because the two processors are always waiting for the disk and I gain nothing with my current algorithm.
Gérard

Ed Gilbert · Post by **Ed Gilbert** » Sat Feb 02, 2008 14:37

Gerard, your results are similar to mine. I also see that the ratios drop when hitting the endgame databases hard. The problem is that my endgame db caching code cannot be re-entrant, so it has to be protected with a mutual exclusion primitive. Yours is probably the same. I have an idea for solving this problem, but it adds some complications. When a lookup of an endgame position occurs, the cache buffer is checked first, and normally the position is found in cache and the result can be returned quickly. A cache miss occurs only once for every few hundred lookups. But a miss means that a new block must be loaded from disk, which takes milliseconds, and all the while the other threads are blocked from doing lookups. I think the problem can be solved this way: when a cache miss occurs, release the mutual exclusion primitive, and read the new data from the file into a local buffer. Then acquire the exclusion primitive, copy the data block into the cache, and finish the lookup. In this way other threads can do lookups while the file I/O is taking place. The complication is that file handles cannot be shared by multiple threads, so each one has to have its own file handles. My database currently has 110 data files, so I will need 440 open file handles to work with a 4-core processor.

My basic parallel search algorithm is what the chess programmers call "younger brothers wait". At any position in the search tree, except for root nodes and nodes within 5 plies of a leaf, I can split the search to allow other threads to help after the first move is searched. From what I have read, some chess programs only allow the search to split at nodes where it is expected there will not be a beta cutoff. I'm not doing that as I think it is difficult to predict this in kingsrow's search which is MTD-f and all searches use a null window. My tests showed that the parallel search is only searching 9% more nodes than with a single-thread which I think is acceptable.

-- Ed

FeikeBoomstra · Post by **FeikeBoomstra** » Fri Feb 08, 2008 20:21

I didn't study the alpha-beta parallel literature, and just implemented the parallelism at the top level. The gain from working parallel is compensated by the overhead introduced for multi threading, so the net result was around 1.

So I removed the code. I studied the literature and had the feeling that implementing the "youngest brother waiting" algorithm with the cut-off requirement was rather complicated. I now understand that Ed has skipped the cut-off requirement.

So maybe I will give it another try.

Feike

Ed Gilbert · Post by **Ed Gilbert** » Sun Feb 10, 2008 19:59

Hi Feike,

I have been working on improving the parallel search, and I have found that the performance was really being badly affected by having too many shared variables between the threads. When I first tried it with 3 and 4 CPUs, there was so much contention between the threads that 3 cores were only slightly faster than 2, and with 4 cores the performance was about the same as with 2! It took me a while to understand what was going on, and I don't have a profiler so it was difficult to see where the bottleneck was. I set up an infinite search running under the debugger, and then stopped it at 20 random intervals and each time observed how many of the 4 cores were actively searching vs. just sitting idle waiting for work. The utilization of the 4 cores was about 95%, so the slow performance was not due to that. I took a guess that it might be contentions between the L1 caches of the 4 cores, and I identified a half dozen shared variables that were being read and written by all 4 cores at leaf nodes and did not really need to be shared. When I changed the code so that each thread had its own private instance of these variables the speed was much better. I measure these ratios of (nCPUs/1CPU):

2 CPUs:
nodes: 1.07
time: 0.54 (1/t = 1.85)
speed: 1.98

3 CPUs:
nodes: 1.08
time: 0.42 (1/t = 2.38)
speed: 2.57

4 CPUs:
nodes: 1.04
time: 0.36 (1/t = 2.75)
speed: 2.85

These are the averages measured on 67 test positions, each position searched to the same depth for the 4 cases. It looks like a plateau has almost been reached and probably more than 4 CPUs would not be any additional benefit. Still this is a nice improvement over single-threaded. It's kind of fun to see the search speed reading 7500Knodes/sec.

There is a good example of YBW parallel search in the open source chess program named Glaurung, and a lot of my parallel search is leveraged from that.

-- Ed

TAILLE · Post by **TAILLE** » Mon Feb 11, 2008 14:38

Hi,
I noted that Ed. use MTD-f algorithms in Kingrow.
As far as I concerned, I am exploring the MTD-best algorithms witch looks like a very promising approach for a game like draughts.
In addition it "seems" easy to use such algorithm in a multithreading environment.
Did somebody have experience with this algorithms ?
Gérard

FeikeBoomstra · Post by **FeikeBoomstra** » Mon Feb 11, 2008 18:43

Hi Gérard,

does this means that you have an evaluation procedure with say an upperlimit and you stop evaluating that position at the moment you are sure you can't come above that limit any more?

Do you have a reference for MTD-Best, for I am not sure I completely understood it from the very short description of v.d. Plaat.

And about shared variables: I my case there are only a very few shared variables: the current search window and some housekeeping. But also the hashtable is shared! I guess that is the biggest problem. But I couldn't get my profiler running to obtain the hitrate degradation. I hope that problem is solved now (I use intel's vtec).

So enough things to do.

TAILLE · Post by **TAILLE** » Mon Feb 11, 2008 23:16

Hi Feike
I think there not a lot of article on MTD-best algorithms and I have to study it myself. You can at least look at the following article :
http://www.mimuw.edu.pl/~awojna/SID/referaty/MTD-f.pdf
Gérard

Ed Gilbert · Post by **Ed Gilbert** » Tue Feb 12, 2008 01:20

Gerard, I am using MTD-f, not -best, as I usually want to know the exact search score. You are right that it is relatively simple to convert it to parallel search, because there is only one search function, unlike PVS search where the first move gets handled differently from the others. Still, most of the effort to make a parallel search is in writing the functions that manage the split points and you have to do this in either case. My experience with MTD is that it is very slightly faster than windowed alphabeta search using an initial window size of about 1/4 of the value of a checker. One drawback though is that you have no exact values in the hashtable, and so if you extract your PV from the hashtable after the search is done it will not be as reliable.

-- Ed

TAILLE · Post by **TAILLE** » Tue Feb 12, 2008 11:04

Ed Gilbert wrote:Gerard, I am using MTD-f, not -best, as I usually want to know the exact search score. You are right that it is relatively simple to convert it to parallel search, because there is only one search function, unlike PVS search where the first move gets handled differently from the others. Still, most of the effort to make a parallel search is in writing the functions that manage the split points and you have to do this in either case. My experience with MTD is that it is very slightly faster than windowed alphabeta search using an initial window size of about 1/4 of the value of a checker. One drawback though is that you have no exact values in the hashtable, and so if you extract your PV from the hashtable after the search is done it will not be as reliable.

-- Ed

Hi Ed,
I am not sure to understand your point. When using PVS search or MTD-f you have of course the exact value in the hashtable concerning the Principal Variation which represents very few nodes but what about the other nodes ? If the PV is made of only 20 nodes I can imagine that you have maybe the exact value for 1000 or 10000 nodes which seems negligeagle regarding the millions of nodes explored.
As I said I am only exploring this algorithms and I have not yet finshed to program it.
Concerning this algorithms I am particularly interested by the two following points :
1) The algorithms is particularly efficient in forcing move situation
2) The algorithms is rather easy to parallelise

Concerning the first point lets take the following position :

white to move.
By a very quickly analysis, depth 2 or 3, you can remark that 40-35 leads to a position evaluated to a value near from 0,000 and you can detect that the next best move is evaluated to a value near from -1,000.
As a consequence, in the MTD-best algorithms, you can begin the exploration of the tree with the test value -0,500. You can see that you will confirm, at each level of depth, that 40-35 is the only move that is better than -0,500. You don't know what is the value of this move but you know it is the best move and you can see that the exploration is very efficient due to the high number of cutoffs. It seems to me far better than exploring the tree with the 0,000 test value in order to find the exact value of the best move, ... but maybe I am completly wrong.

Concerning the second point the algorithms seems also very efficient because the test value is not changing as often as in MTD-f algorithms. As a consequence I see a good chance of limiting the overhead and I see a better reuse of hashtable information.

I hope I will be able to confirm that when I finish to program this MTD-best algorithms

Gérard

Ed Gilbert · Post by **Ed Gilbert** » Tue Feb 12, 2008 14:26

TAILLE wrote:I am not sure to understand your point. When using PVS search or MTD-f you have of course the exact value in the hashtable concerning the Principal Variation which represents very few nodes but what about the other nodes ? If the PV is made of only 20 nodes I can imagine that you have maybe the exact value for 1000 or 10000 nodes which seems negligeagle regarding the millions of nodes explored.

MTD-f always uses a null window search, meaning that beta == (alpha+1). So at all nodes you either have an upper bound or a lower bound, never an exact value. You converge on an exact value only at the root node by proving some lower bound V and then an upper bound of V. The hashtable never has any exact values, which is why I made the comment about extracting the PV.

Concerning this algorithms I am particularly interested by the two following points :
1) The algorithms is particularly efficient in forcing move situation
...
By a very quickly analysis, depth 2 or 3, you can remark that 40-35 leads to a position evaluated to a value near from 0,000 and you can detect that the next best move is evaluated to a value near from -1,000.
As a consequence, in the MTD-best algorithms, you can begin the exploration of the tree with the test value -0,500. You can see that you will confirm, at each level of depth, that 40-35 is the only move that is better than -0,500. You don't know what is the value of this move but you know it is the best move and you can see that the exploration is very efficient due to the high number of cutoffs. It seems to me far better than exploring the tree with the 0,000 test value in order to find the exact value of the best move, ... but maybe I am completly wrong.

In your example an alphabeta search using an aspiration window of 1/4 of a checker would also have a high number of cutoffs. But I agree that MTD-best should be particularly efficient when there is only one good move such as this case. In order to do MTD-best, you have to make an initial guess of the best move, find a lower bound on only this move, and then find an upper bound on all the other moves that is not greater than the lower bound of the best guess. But if you guess wrong about the best move, then I think MTD-best is less efficient than MTD-f, because you have to do some extra searches. In any case I think having an exact search score is usually important for analysis, but maybe not for tournament matches, so you could choose to turn this behavior on or off as needed. It only affects a few lines of code where the root search is called.

2) The algorithms is rather easy to parallelise
...
Concerning the second point the algorithms seems also very efficient because the test value is not changing as often as in MTD-f algorithms. As a consequence I see a good chance of limiting the overhead and I see a better reuse of hashtable information.

If you use the most popular parallel search algorithm (YBW), then I don't see any difference in these two MTD search methods as far as ease to do parallel search. As I said, all the search code is identical except for a few lines where the root search is called. In my parallel search all the splitting occurs below the root node, so it would not make a difference if I used MTD-f or MTD-best. MTD in general might be a little simpler to parallelise when compared to PVS search, because with PVs you have to write code to split where the search is using a non-null window and then usually in a different part of the code where the remaining moves are being searched with a null window.

-- Ed

BertTuyt · Post by **BertTuyt** » Mon Oct 06, 2008 19:24

After some time, I have time (again limited) to work on Computer Draughts. So far I allow myself only to work on the Parallel Search.
Although the program does not collapse anymore, anyway a result, I am still not satisfied.
What I dislike, but I would like to hear the experiences of others, is that i not always get the same move and/or score.
Im not sure if this is bug, or others face the same issues.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Tue Oct 07, 2008 01:37

Hi Bert,

I have not noticed this with kingsrow, but I don't often compare the single threaded to parallel search results, so it might be the case and I just don't see it. It does not surprise me that the best move is sometimes different. This can be a normal consequence of taking different paths through the tree. I expect this could also change the search scores, so this by itself is not necessarily an indication of an software bug. What percentage of searches show different score values between single and multi-threaded? If you want to run an experiment, take the moves from some random game, do a search at each position with a single thread, and then do it again with the multi-threaded and record how many searches had different best moves and different scores at the same search depths. Send me the pdn so I can do the same and then we can compare results.

-- Ed

World Draughts Forum

Parallel search

Parallel search

Re: Parallel search

MTD-f or MTD-best

MTD-best