
Parallel search
Ed, i use the game you forwarded me (hope you remember).
1. 34-30 20-25
2. 32-28 25x34
3. 40x29 ........
In the total game 10 moves & scores are different from the single-thread (normal) search.
In one instance, score is different but move is the same.
I expected moves to be different at equal score, as move-order is different.
But in most cases move & score is different.
As I don't understand this yet, and Im still not 10% convinced that the implementation is bug-free, i will go into details.
Im curious what others encountered so far.
And next to that, any root-cause analysis would be welcomed.
Bert
1. 34-30 20-25
2. 32-28 25x34
3. 40x29 ........
In the total game 10 moves & scores are different from the single-thread (normal) search.
In one instance, score is different but move is the same.
I expected moves to be different at equal score, as move-order is different.
But in most cases move & score is different.
As I don't understand this yet, and Im still not 10% convinced that the implementation is bug-free, i will go into details.
Im curious what others encountered so far.
And next to that, any root-cause analysis would be welcomed.
Bert
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
This is what I observed: I am using MTD with zero window.
In the single thread situation you get the best move by looking at the best move obtained in the root node using in alphabetha:
if (z>g) node->best_move = current_move
But in case of multiple threads you are not sure any more that the root node has obtained the real best move. This has to do with the z>g condition I think now.
What I did is to check whether the upperbound and the lowerbound of the target of the best move are equal.
I tried it in computer-computer games with first single thread and then multi thread for the same depth.
In the single thread situation you get the best move by looking at the best move obtained in the root node using in alphabetha:
if (z>g) node->best_move = current_move
But in case of multiple threads you are not sure any more that the root node has obtained the real best move. This has to do with the z>g condition I think now.
What I did is to check whether the upperbound and the lowerbound of the target of the best move are equal.
I tried it in computer-computer games with first single thread and then multi thread for the same depth.
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Hi Bert,
I stopped the test after comparing 60 searches of each engine. Results:
7 different moves only
6 different scores only
2 different moves and scores
In the 8 total cases with different scores, the difference was the smallest possible in my eval.
I am not concerned about this at all, I think it is a normal consequence of alphabeta and searching the tree in a different move order. I have played enough games with the parallel search engine to be convinced that it does not make bad moves and it plays much better with 4 cores than 1 because of the deeper searches. Your results look somewhat similar to mine.
Have you tested how much improvement you get in time-to-depth using 4 cores vs 1?
-- Ed
I stopped the test after comparing 60 searches of each engine. Results:
7 different moves only
6 different scores only
2 different moves and scores
In the 8 total cases with different scores, the difference was the smallest possible in my eval.
I am not concerned about this at all, I think it is a normal consequence of alphabeta and searching the tree in a different move order. I have played enough games with the parallel search engine to be convinced that it does not make bad moves and it plays much better with 4 cores than 1 because of the deeper searches. Your results look somewhat similar to mine.
Have you tested how much improvement you get in time-to-depth using 4 cores vs 1?
-- Ed
Although I still don't trust my Multi-Core implementation for 100% (in one case I get a score difference of 0.6 Man-Value), most scores and moves are all the same, so im making progress.
I started with the. PV-Split. This implementation has one major drawback (at least my implementation), in case the PV on the root-level changes only one core will be assigned. It is obvious that in the next iteration I will try to create a work-around.
Also I use still a "huge" number of additional code in the parallel routines, for debugging, and monitoring purposes, which has a slight negative impact on speed.
Nevertheless:Results on the game Ed also used for comparison, and on a Q6600 Quad-core, 4Giga and Vista64.
Speed Nodes/Sec: 3.89M (normal algorithm 1.96M)
Time to Depth: 0.58
Node Counts: 1.16
I will try to get into this strange score i only observed once, try to improve and get the modified PV-Split working.
My guess is that with small optimizations I could reach a Time to Depth of 0.5. Also I doubt if a speedup (for 4 cores) beyond 3 is realistic.
Bert
I started with the. PV-Split. This implementation has one major drawback (at least my implementation), in case the PV on the root-level changes only one core will be assigned. It is obvious that in the next iteration I will try to create a work-around.
Also I use still a "huge" number of additional code in the parallel routines, for debugging, and monitoring purposes, which has a slight negative impact on speed.
Nevertheless:Results on the game Ed also used for comparison, and on a Q6600 Quad-core, 4Giga and Vista64.
Speed Nodes/Sec: 3.89M (normal algorithm 1.96M)
Time to Depth: 0.58
Node Counts: 1.16
I will try to get into this strange score i only observed once, try to improve and get the modified PV-Split working.
My guess is that with small optimizations I could reach a Time to Depth of 0.5. Also I doubt if a speedup (for 4 cores) beyond 3 is realistic.
Bert
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Bert, I realized after reading your results is that you are probably using 2 search threads where I was using 4. I imagine that using 4 parallel threads might tend to give more differences between the searches.
Your one strange search result of 0.6man different from single-threaded should certainly be looked into, but it's possible that it is ok, especially with your very agressive futility pruning in the latest Damage.
-- Ed
Your one strange search result of 0.6man different from single-threaded should certainly be looked into, but it's possible that it is ok, especially with your very agressive futility pruning in the latest Damage.
-- Ed
Hi Bert,
1) It happens that a thread may effectively be put in an "idle" state. Of course it is a pity because you do not use all the CPU available.
2) In addition, the "idle" thread may disturb the others if it waits in a loop with with lock instructions or with simply the access on a "volatile" data
3) If you split a node in the low part of the tree (Ed proposal was to keep a distance from the leaves greater than 5) it could be relatively costly
4) The efficiency of the procedure depends on the quality of the tree exploration algorithm and the quality of your evaluation function.
Just a small comment on this last point. Let'us take a position in which a strong player considers that, for the next moves, you have only 1 or 2 interesting moves (all other moves are either tactical or strategical mistakes). If you are confident with your evaluation fonction then you can look at each depth at these 2 moves and you can quickly eliminate the other moves. As a consequence the tree will become very narrow but very deep. Now what about YBW in this situation ? The answer is : it does not work because only one thread works on these interestings moves.; the other threads could just eliminate quickly the bad moves.
My conclusion is the following : the YBW works very well in a game like chess were the number of potentiel good moves at each depth is large but it is less interesting in draughts, especially if your evaluation function is efficient enough to allow reducing the number of potentiel good moves at each depth.
It is a little disturbing to know that the YBW becomes less efficient when you improve your evaluation function!
Bert, your question is of course very interesting but the answer is difficult because it depends on the way you explore the tree and the way you use the evaluation fuction to eliminate bad moves as quickly as possible.
Gérard
I studied in detail the YBW procedure and as a consequence I could be able to identify some very interesting advantages but also some drawbacks:BertTuyt wrote:Feike,
My question what is the best approach:
- A Master-Slave where the search-thread invites idle treads to join.
- Or a approach where the idle threads themselves are looking for work, and then take the initiative.
Bert
1) It happens that a thread may effectively be put in an "idle" state. Of course it is a pity because you do not use all the CPU available.
2) In addition, the "idle" thread may disturb the others if it waits in a loop with with lock instructions or with simply the access on a "volatile" data
3) If you split a node in the low part of the tree (Ed proposal was to keep a distance from the leaves greater than 5) it could be relatively costly
4) The efficiency of the procedure depends on the quality of the tree exploration algorithm and the quality of your evaluation function.
Just a small comment on this last point. Let'us take a position in which a strong player considers that, for the next moves, you have only 1 or 2 interesting moves (all other moves are either tactical or strategical mistakes). If you are confident with your evaluation fonction then you can look at each depth at these 2 moves and you can quickly eliminate the other moves. As a consequence the tree will become very narrow but very deep. Now what about YBW in this situation ? The answer is : it does not work because only one thread works on these interestings moves.; the other threads could just eliminate quickly the bad moves.
My conclusion is the following : the YBW works very well in a game like chess were the number of potentiel good moves at each depth is large but it is less interesting in draughts, especially if your evaluation function is efficient enough to allow reducing the number of potentiel good moves at each depth.
It is a little disturbing to know that the YBW becomes less efficient when you improve your evaluation function!
Bert, your question is of course very interesting but the answer is difficult because it depends on the way you explore the tree and the way you use the evaluation fuction to eliminate bad moves as quickly as possible.
Gérard
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
It isn't my observation that only one core is active if you narrow the search tree. As soon as a core is in the idle state, it is allocated for the next potential split point. Due to the very low overhead to check if a thread is available, this check is done on every level above a certain threshold. I use >4 as a threshold.
Hi Feike,
It seems that you chose the master-slave approach. Is it true ?
Gérard
I agree with you. This kind of improvement of the procedure is necessary. That means that you use not only "pure split points" but also "potential split points" in order to avoid to stay in an "idle" state.FeikeBoomstra wrote:It isn't my observation that only one core is active if you narrow the search tree. As soon as a core is in the idle state, it is allocated for the next potential split point. Due to the very low overhead to check if a thread is available, this check is done on every level above a certain threshold. I use >4 as a threshold.
It seems that you chose the master-slave approach. Is it true ?
Gérard
Feike, just a few questions for clarification:
Do the "masters" check for idle threads, or do the "slaves" (idle threads), poll for the "most interesting" split-point.
Next to that, if several thread offer potential splitpoints and both starting sending requests, how do you guarantee that the "most interesting" split point is chosen?
Bert
Do the "masters" check for idle threads, or do the "slaves" (idle threads), poll for the "most interesting" split-point.
Next to that, if several thread offer potential splitpoints and both starting sending requests, how do you guarantee that the "most interesting" split point is chosen?
Bert
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
Hi Bert,
we are not talking about loosely coupled systems, it is 4 cores with shared memory. In my implementation (just copied glaurung) the master tests if there are idle threads as simple as:
for each thread:
if (!(thread.running))
So yes, the master tries to find slaves, if not available he goes to the next depth.
I don't do any qualifications of a split-point, so I do the left-most-tree first in a single thread, and after reaching sufficient distance from the leaves I start splitting.
If more masters are competing for a slave, just the first wins. So you need a lock around this action. As I suppose most of the times there is no thread available, maybe you can first have a lookup, if successful then lock and try again.
Hope this answers your questions.
we are not talking about loosely coupled systems, it is 4 cores with shared memory. In my implementation (just copied glaurung) the master tests if there are idle threads as simple as:
for each thread:
if (!(thread.running))
So yes, the master tries to find slaves, if not available he goes to the next depth.
I don't do any qualifications of a split-point, so I do the left-most-tree first in a single thread, and after reaching sufficient distance from the leaves I start splitting.
If more masters are competing for a slave, just the first wins. So you need a lock around this action. As I suppose most of the times there is no thread available, maybe you can first have a lookup, if successful then lock and try again.
Hope this answers your questions.
- FeikeBoomstra
- Posts: 306
- Joined: Mon Dec 19, 2005 16:48
- Location: Emmen
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
At the start of a new search there is only one search thread, and the others are spinning on locks waiting for work. When the search arrives at a node which has remaining depth >= some threshold (I use 4 ply) and there is more than one remaining move to be searched, then a split point is created at that node. The original search thread becomes the master of that split point. It may engage several idle threads to start searching the remaining unsearched moves from this node. In kingsrow I limit the number of threads working on a split point to 2, as this seemed to help slightly in my tests. As the threads descend into the search tree, each node visited is potentially a candidate for being another split point, if it meets the remaining depth and number of moves criteria mentioned. So a thread that was a slave at some predecessor node can also become a master at some lower node in the tree. Similarly, if the master returns to its split node and finds that there are no more moves to be searched but that the slave thread is still searching its last move, the master can become a slave at some successor position to that split node -- but only in that subtree, it cannot be a slave to any position not a successor of the original split point.BertTuyt wrote:Do the "masters" check for idle threads, or do the "slaves" (idle threads), poll for the "most interesting" split-point.
Next to that, if several thread offer potential splitpoints and both starting sending requests, how do you guarantee that the "most interesting" split point is chosen?
One other small point: when the search is completely idle, I set all the search threads to block waiting on a semaphore (instead of spinning on a spinlock), so they don't waste cpu resources when not searching.
-- Ed
Hi Ed,
Of course it is another problem to know if it could be realistic to consider that a node has no more than 2 successors. My question is only to verify my understanding.
Gérard
That was my understanding of the original procedure. Correct me if I am wrong : if each node of the tree has at a maximum 2 successors, does that mean that only one thread is running ?Ed Gilbert wrote:BertTuyt wrote: When the search arrives at a node which has remaining depth >= some threshold (I use 4 ply) and there is more than one remaining move to be searched, then a split point is created at that node.
Of course it is another problem to know if it could be realistic to consider that a node has no more than 2 successors. My question is only to verify my understanding.
Gérard