Parallel search

BertTuyt · Post by **BertTuyt** » Tue Oct 07, 2008 21:41

Ed, i use the game you forwarded me (hope you remember).

1. 34-30 20-25
2. 32-28 25x34
3. 40x29 ........

In the total game 10 moves & scores are different from the single-thread (normal) search.
In one instance, score is different but move is the same.
I expected moves to be different at equal score, as move-order is different.
But in most cases move & score is different.
As I don't understand this yet, and Im still not 10% convinced that the implementation is bug-free, i will go into details.
Im curious what others encountered so far.
And next to that, any root-cause analysis would be welcomed.

Bert

FeikeBoomstra · Post by **FeikeBoomstra** » Wed Oct 08, 2008 00:00

This is what I observed: I am using MTD with zero window.
In the single thread situation you get the best move by looking at the best move obtained in the root node using in alphabetha:

if (z>g) node->best_move = current_move

But in case of multiple threads you are not sure any more that the root node has obtained the real best move. This has to do with the z>g condition I think now.

What I did is to check whether the upperbound and the lowerbound of the target of the best move are equal.

I tried it in computer-computer games with first single thread and then multi thread for the same depth.

Ed Gilbert · Post by **Ed Gilbert** » Wed Oct 08, 2008 04:25

Bert, I didn't get a chance to run the experiment today, but I will do it before Friday.

How are you making out with the 7-piece db build? I imagine you are well into it by now.

-- Ed

BertTuyt · Post by **BertTuyt** » Wed Oct 08, 2008 22:44

Ed, unfortunately no progress.
When I encountered all kinds of problems with the parallel search, I focussed on this one as I need to have all cores.
Nevertheless, I hope to restart anytime, keep you posted.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Thu Oct 09, 2008 00:50

Hi Bert,

I stopped the test after comparing 60 searches of each engine. Results:

7 different moves only
6 different scores only
2 different moves and scores

In the 8 total cases with different scores, the difference was the smallest possible in my eval.

I am not concerned about this at all, I think it is a normal consequence of alphabeta and searching the tree in a different move order. I have played enough games with the parallel search engine to be convinced that it does not make bad moves and it plays much better with 4 cores than 1 because of the deeper searches. Your results look somewhat similar to mine.

Have you tested how much improvement you get in time-to-depth using 4 cores vs 1?

-- Ed

BertTuyt · Post by **BertTuyt** » Sat Oct 18, 2008 16:05

Although I still don't trust my Multi-Core implementation for 100% (in one case I get a score difference of 0.6 Man-Value), most scores and moves are all the same, so im making progress.

I started with the. PV-Split. This implementation has one major drawback (at least my implementation), in case the PV on the root-level changes only one core will be assigned. It is obvious that in the next iteration I will try to create a work-around.
Also I use still a "huge" number of additional code in the parallel routines, for debugging, and monitoring purposes, which has a slight negative impact on speed.

Nevertheless:Results on the game Ed also used for comparison, and on a Q6600 Quad-core, 4Giga and Vista64.

Speed Nodes/Sec: 3.89M (normal algorithm 1.96M)
Time to Depth: 0.58
Node Counts: 1.16

I will try to get into this strange score i only observed once, try to improve and get the modified PV-Split working.

My guess is that with small optimizations I could reach a Time to Depth of 0.5. Also I doubt if a speedup (for 4 cores) beyond 3 is realistic.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Mon Oct 20, 2008 23:58

Bert, I realized after reading your results is that you are probably using 2 search threads where I was using 4. I imagine that using 4 parallel threads might tend to give more differences between the searches.

Your one strange search result of 0.6man different from single-threaded should certainly be looked into, but it's possible that it is ok, especially with your very agressive futility pruning in the latest Damage.

-- Ed

TAILLE · Post by **TAILLE** » Tue Dec 16, 2008 21:50

Hi Bert,

BertTuyt wrote:Feike,
My question what is the best approach:
- A Master-Slave where the search-thread invites idle treads to join.
- Or a approach where the idle threads themselves are looking for work, and then take the initiative.
Bert

I studied in detail the YBW procedure and as a consequence I could be able to identify some very interesting advantages but also some drawbacks:
1) It happens that a thread may effectively be put in an "idle" state. Of course it is a pity because you do not use all the CPU available.
2) In addition, the "idle" thread may disturb the others if it waits in a loop with with lock instructions or with simply the access on a "volatile" data
3) If you split a node in the low part of the tree (Ed proposal was to keep a distance from the leaves greater than 5) it could be relatively costly
4) The efficiency of the procedure depends on the quality of the tree exploration algorithm and the quality of your evaluation function.

Just a small comment on this last point. Let'us take a position in which a strong player considers that, for the next moves, you have only 1 or 2 interesting moves (all other moves are either tactical or strategical mistakes). If you are confident with your evaluation fonction then you can look at each depth at these 2 moves and you can quickly eliminate the other moves. As a consequence the tree will become very narrow but very deep. Now what about YBW in this situation ? The answer is : it does not work because only one thread works on these interestings moves.; the other threads could just eliminate quickly the bad moves.

My conclusion is the following : the YBW works very well in a game like chess were the number of potentiel good moves at each depth is large but it is less interesting in draughts, especially if your evaluation function is efficient enough to allow reducing the number of potentiel good moves at each depth.

It is a little disturbing to know that the YBW becomes less efficient when you improve your evaluation function!

Bert, your question is of course very interesting but the answer is difficult because it depends on the way you explore the tree and the way you use the evaluation fuction to eliminate bad moves as quickly as possible.

Gérard

FeikeBoomstra · Post by **FeikeBoomstra** » Wed Dec 17, 2008 00:00

It isn't my observation that only one core is active if you narrow the search tree. As soon as a core is in the idle state, it is allocated for the next potential split point. Due to the very low overhead to check if a thread is available, this check is done on every level above a certain threshold. I use >4 as a threshold.

TAILLE · Post by **TAILLE** » Wed Dec 17, 2008 10:09

Hi Feike,

FeikeBoomstra wrote:It isn't my observation that only one core is active if you narrow the search tree. As soon as a core is in the idle state, it is allocated for the next potential split point. Due to the very low overhead to check if a thread is available, this check is done on every level above a certain threshold. I use >4 as a threshold.

I agree with you. This kind of improvement of the procedure is necessary. That means that you use not only "pure split points" but also "potential split points" in order to avoid to stay in an "idle" state.

It seems that you chose the master-slave approach. Is it true ?

Gérard

BertTuyt · Post by **BertTuyt** » Wed Dec 17, 2008 19:35

Feike, just a few questions for clarification:

Do the "masters" check for idle threads, or do the "slaves" (idle threads), poll for the "most interesting" split-point.

Next to that, if several thread offer potential splitpoints and both starting sending requests, how do you guarantee that the "most interesting" split point is chosen?

Bert

FeikeBoomstra · Post by **FeikeBoomstra** » Wed Dec 17, 2008 20:07

Hi Bert,

we are not talking about loosely coupled systems, it is 4 cores with shared memory. In my implementation (just copied glaurung) the master tests if there are idle threads as simple as:
for each thread:
if (!(thread.running))

So yes, the master tries to find slaves, if not available he goes to the next depth.
I don't do any qualifications of a split-point, so I do the left-most-tree first in a single thread, and after reaching sufficient distance from the leaves I start splitting.

If more masters are competing for a slave, just the first wins. So you need a lock around this action. As I suppose most of the times there is no thread available, maybe you can first have a lookup, if successful then lock and try again.

Hope this answers your questions.

FeikeBoomstra · Post by **FeikeBoomstra** » Wed Dec 17, 2008 22:07

update: it is implemented this way. Checking for availability is without a lock, claiming is (of course) within the critical region.

Ed Gilbert · Post by **Ed Gilbert** » Thu Dec 18, 2008 14:17

BertTuyt wrote:Do the "masters" check for idle threads, or do the "slaves" (idle threads), poll for the "most interesting" split-point.

Next to that, if several thread offer potential splitpoints and both starting sending requests, how do you guarantee that the "most interesting" split point is chosen?

At the start of a new search there is only one search thread, and the others are spinning on locks waiting for work. When the search arrives at a node which has remaining depth >= some threshold (I use 4 ply) and there is more than one remaining move to be searched, then a split point is created at that node. The original search thread becomes the master of that split point. It may engage several idle threads to start searching the remaining unsearched moves from this node. In kingsrow I limit the number of threads working on a split point to 2, as this seemed to help slightly in my tests. As the threads descend into the search tree, each node visited is potentially a candidate for being another split point, if it meets the remaining depth and number of moves criteria mentioned. So a thread that was a slave at some predecessor node can also become a master at some lower node in the tree. Similarly, if the master returns to its split node and finds that there are no more moves to be searched but that the slave thread is still searching its last move, the master can become a slave at some successor position to that split node -- but only in that subtree, it cannot be a slave to any position not a successor of the original split point.

One other small point: when the search is completely idle, I set all the search threads to block waiting on a semaphore (instead of spinning on a spinlock), so they don't waste cpu resources when not searching.

-- Ed

TAILLE · Post by **TAILLE** » Thu Dec 18, 2008 16:37

Hi Ed,

Ed Gilbert wrote:
BertTuyt wrote: When the search arrives at a node which has remaining depth >= some threshold (I use 4 ply) and there is more than one remaining move to be searched, then a split point is created at that node.

That was my understanding of the original procedure. Correct me if I am wrong : if each node of the tree has at a maximum 2 successors, does that mean that only one thread is running ?

Of course it is another problem to know if it could be realistic to consider that a node has no more than 2 successors. My question is only to verify my understanding.

Gérard