Parallel Panic

BertTuyt · Post by **BertTuyt** » Wed Aug 18, 2010 15:07

After completing the 7p DB, i started with the debugging of the parallel code.
I basically implemented the code already 1 year ago, but so far i focused more on generating the 7P DB then full debugging the parallel algorithms.

As some people already have mentioned before, if programming a draughts engine is already a challence, programming a parallel search is a nightmare.
Although there are good examples available to learn from ( i really recommend all to browse the Glaurung, or nowadays Stockfish code), debugging is not straightforward.

In normal cases when you restart a sequential (bug-free) search process and cleared all tables (like hash-table history-table whatever) the outcome is always the same (number nodes , score and best move).
In the case of a parallel search not only every time the number of nodes is different , but even the score and best move could be different (this has already been mentioned in previous posts in this forum).

Therefore I decided to use a different tree and search-algorithm to test if the parallel implementation is "bug-free", in my case I based all on the YBWC (Young Brothers wait Concept).

The search I used for testing is the minmax-search where i search to a given depth and stop the search when the side to move has no capture (if the opposite does have a capture option, i still stop). And if the movecount is 1, i don't 'count" this as an additional ply.
With a "slightly" modified perft i calculated the end -nodes of these search-trees, see table below (maybe some-one can check this, if the count is OK).

Code: Select all

Ply    Number of End-Nodes
7      4,202,204
8      37,413,597
9      351,706,838
10     3,396,182,697

I modified my search-algorithm (PVS) to also enable minmax search, and i added on-off switches to all features like hash-tables, history-tables, pruning options, extension mechanisms..........

In this way I'm sure that the search-tree is always the same.
Although not relevant for normal game-play, the advantage is that whatever parallel implementation is used, the number of nodes should be always the same.
It is also my assumption that the resulting speed-up is most likely a upper-limit for the "real" case, as the parallel search in general searches bigger trees.

The good news is, that after some initial fail-starts, the parallel code now examines exactly the same tree, and node-count is always the same as the sequential version (note this doesn't mean that the implementation is completely bug-free, as i (for example) don't test the abort-mechanism in case at a splitpoint one of the threads find a beta-cutoff).
In the table below , I have provided an overview of search-times (in seconds) as a function of number of threads/cores used (C1 - C4).

Code: Select all

Ply      Time     C1        C2       C3       C4
7                 4,99      3,23     2,65     2,74
8                 42,26     23,09    16,94    16,43
9                 383,07    198,67   160,43   123,75
10                3606,93   1871,99  1469,43  1087,12

In the next table I provide speed-up factors based on the Core 1 timing:

Code: Select all

Ply     Speed     C1       C2       C3       C4
7                 1,00     1,54     1,88     1,82
8                 1,00     1,83     2,49     2,57
9                 1,00     1,93     2,39     3,10
10                1,00     1,93     2,45     3,32

So far I did the test with a constant minimal splitdepth setting.
So there might be optimization possible.
If the splidepth parameter is to small the number of splitpoints will increase, but the search-effort for every thread could be so little (as trees as small, with limited node-count), that the overhead is far bigger then the gain due to more cores working on the search-tree.
On the other hand if the splitdepth is large, the number of splitpoints are small, so there might be not enough work for all threads involved, and the cores are most of the time "idle".

It might be that more speed-up is possible (in the current implementation , i don't do splitting at the root-level, as my root-code is slightly different, so i did not implement this yet).

But anyway the upper-bound for speed-up for my (i740 2.93 Ghz Quad-core system) seems to be 3,3.

As i have a holiday i most likely post more results in the next days.
In the mean time I really like some feedback from the other draughts programming on this forum.

* Do you get the same Perft numbers for the trees as I described here.
* What parallel search-algorithm do you use.
* I guess that Ed and Gerard use 8-core systems (based on 2 processors), so do you also observe diminishing returns when using more cores? (this is interesting as most likely next year Intel will go to 8-core processors (think they already have one for server-systems), and there are already processors available for dedicated testers containing more then 30 cores)
* Do you have good examples, suggestions for parallel-debugging, or whatever....

If you have any questions, or whatever, please let me know.

Keep you all posted....

Bert

BertTuyt · Post by **BertTuyt** » Thu Aug 19, 2010 00:21

As my processor has a hyper-threading option I can increase the number of threads to 8 (4 real cores and 4 virtual cores). My OS detects all 8 cores.
I run the minmax test also with cores 5 - 8.
The resulted speed-increase (again with the 1 core algorithm as a basis) for a 10-ply minmax tree , is depicted in the next table.

Code: Select all

Although the relatively speed-increase for the hyper-threaded virtual cores is less (when you make a graph of this data you clearly see the 2 core intervals), it is overall still an interesting +/- 20% improvement (for free

)

Is there anybody out there with a 6-core CPU, then we can continue till 12 threads.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Thu Aug 19, 2010 00:43

Bert,

I don't have any better suggestions for debugging and testing than what you have already done. The only speedup numbers I have for up to 8 cores are these from the kingsrow help file. They are search speed at the game start position to depth 18 ply.

Code: Select all

Search threads  Search speed in KN/sec
1                     2827
2                     5276
4                     9890
6                    13734
8                    17129

-- Ed

BertTuyt · Post by **BertTuyt** » Thu Aug 19, 2010 01:13

Ed, I assume the stats are from your 2-processor system with 8 real cores???

Bert

Ed Gilbert · Post by **Ed Gilbert** » Thu Aug 19, 2010 03:53

BertTuyt wrote:Ed, I assume the stats are from your 2-processor system with 8 real cores???

Yes.

BertTuyt · Post by **BertTuyt** » Thu Aug 19, 2010 14:48

Shit happens.
Whereas the minmax search performed according to expectations, the first test with alpha-beta reveals a problem.
Although with all options/mechanisms (other then plain alpha-beta) switched of, which should yield always the same score , but sometimes with a different move(this can happen in parallel search as the search of the tree can be sometimes a little different) but then the score should be the same.

I can not quantify the differences (will do a separate test later) but i already found as an outcome of the search a different move with a different score as the same move with a different score.
So it looks like a race-condition for failure, but will find out.
Anyway will keep all here posted, maybe I can find something which will also help others who are digging in the parallel mud.

Bert

TAILLE · Post by **TAILLE** » Thu Aug 19, 2010 14:54

Hi Bert,

BertTuyt wrote:In normal cases when you restart a sequential (bug-free) search process and cleared all tables (like hash-table history-table whatever) the outcome is always the same (number nodes , score and best move).
In the case of a parallel search not only every time the number of nodes is different , but even the score and best move could be different (this has already been mentioned in previous posts in this forum).

Therefore I decided to use a different tree and search-algorithm to test if the parallel implementation is "bug-free", in my case I based all on the YBWC (Young Brothers wait Concept).
Bert

For me parallel implementation specific problems are typically the access in read/write mode to common ressources like the hash table, the endgame db cache, the harddisk, the splitpoints or whatever shared variable.

It is true that when using a parallel implementation you can get different results when restarting the process : the score and best move could be different. But in my view this is not due to the parallel implementation itself. It is only a natural consequence of the use of a hashtable when the moves can be searched in different order.
I am pretty sure that you can also obtained different resultats by using only one thread but by simply modifying your move generator in order to change at random the order of the generated moves.
If you switch off your hashtable I guess then you will obtain always the same results (score and best move).

BertTuyt wrote: programming a draughts engine is already a challence, programming a parallel search is a nightmare
Bert

I do not feel that programming a parallel search is a nightmare.
What is a nightmare for me is to be able to use efficiently a hastable. I have in mind especially the two following points:
What about a position already reached but with different parameters (depth) ?
What about using the hashtable when draw by repetition is encountered ?
This last point needs an example just to show the point that can cause difficulties

diag.1

Black to move

Suppose the search begins by 06-28 01-40 28-19 40-35 19-41 30-24 41-46
diag.2

White to move

Suppose in the above position that you begin by analysing all king moves and then the men moves. What may happen ?
On any white king move the black king can wait on the long diagonal (avaiding square 46) and wait for a move of the white king again on square 35. As soon as white king moves on square 35 the black king moves on square 46 in order to build repetition. As a consequence the search algorithm can no more find a white win due to this loop opportunity. Of course after having demonstrated that any white king move in the above diagram lead only to a loop you will try a man move and discover the win.
Now is the point : while failing finding a win by a white king move from diag.2 what will you store in the hashtable? Suppose you mark in the hashtable that the following position (and thousands of others) is a draw or at least is not a win even with very large depth! You can see that the hashtable values are not correct.

White to play

Now return to the diag.1
06-28 have been proved to be a losing move but what about 06-39 01-34 39-28 ? We all know that this sequence is a loss for black but due to incorrectness of the hashtable you can see that white cannot reach diag.2 above while avoiding all positions incorrectly populated.

BertTuyt · Post by **BertTuyt** » Thu Aug 19, 2010 15:22

Gerard, thanks for your reply.
In my case i even have switched off the hash-table for the initial alpha-beta test so i guess that in this situation at least the score should be the same.
I recognize the issues related to draw and hash-table, i didn't pay much attention, but on my things to do list is also to study (and find solutions) to the issues related to this.

Last but not least, for curiosity.
As I remember you also have a dual-processor system (i forgot if this was with 2 quad or 2 dual-core processors).
What type of parallel implementation did you use/choose?
And do you have some figures for speed-up as a function of number of threads?

I'm interested to see how parallel search scales (or non-scaling?) as a function of the number of threads.
Especially as the frequency of processors seems to stabilize around 3 Ghz - 4 Ghz (max), and therefore the processor improvement is related to core-increase.

I don't know if your processor has hyper-threading, but if this is the case and you have 2 quad-processors your system could recognize 16 virtual processors.
I'm really interested what this would bring to parallel search.

Bert

Rein Halbersma · Post by **Rein Halbersma** » Thu Aug 19, 2010 19:39

TAILLE wrote: What is a nightmare for me is to be able to use efficiently a hastable. I have in mind especially the two following points:
What about a position already reached but with different parameters (depth) ?
What about using the hashtable when draw by repetition is encountered ?

Hi Gérard,

I'm not as advanced as you and Bert with parallel stuff, but I agree on the nightmares that a hash table can give with repetitions and issues with different depths

There is a PhD thesis by Akihiro Kishimoto on the web http://www.is.titech.ac.jp/~kishi/pdf_f ... thesis.pdf
that develops an algorithm that stores a hash key of the path that reached the position. Ed and I discussed this a year ago in emails but we couldn't fully understand how it should be implemented. I asked for Kishimoto's source code recently but he didn't have it or wasn't allowed to share it (it was part of the Chinook code to proof the solution of checkers).

In any case, there is a much older paper by Murray Campbell (of Deep Blue) that gives the following shortcut:

While within a draw cycle, the following set of rules can be used to store values in the hash table’:
1) values less than the draw value can be stored, but only as upper bounds
2) values greater than the draw value can be stored, but only as lower bounds
3) values equal to the draw score can never be safely stored
4) if a value to be stored is supposed to be a (say) lower bound, but the above rules state it is an upper bound, the value cannot be safely used

This requires that you have already detected that you are in a draw cycle (what is called the "draw-first" case) and that you propagate this upwards in your alpha-beta function. The other case ("draw-last") is impossible to detect unless you use a path encoded hash key. According to Campbell, iterative deepening eliminates much of the errors in practice.

Rein

TAILLE · Post by **TAILLE** » Thu Aug 19, 2010 23:13

BertTuyt wrote: As I remember you also have a dual-processor system (i forgot if this was with 2 quad or 2 dual-core processors).
Bert

Yes I have a dual-processor system : 2 x xeon E5530 2.4GHz

BertTuyt wrote: What type of parallel implementation did you use/choose?
Bert

My implementation is based on YBWC but I introduced some personnal functionnalities in order to have a parallel implementation working as efficiently as possible with my basic search algorithm which is a kind of mixture of MTD-f and MTD-best algorithms.

BertTuyt wrote: And do you have some figures for speed-up as a function of number of threads?
I'm interested to see how parallel search scales (or non-scaling?) as a function of the number of threads.
Especially as the frequency of processors seems to stabilize around 3 Ghz - 4 Ghz (max), and therefore the processor improvement is related to core-increase.
Bert

I made some measurement almost one years ago but I do not remenber the exact figures.

BertTuyt wrote: I don't know if your processor has hyper-threading, but if this is the case and you have 2 quad-processors your system could recognize 16 virtual processors.
I'm really interested what this would bring to parallel search.
Bert

Yes Bert I try also to use hyperthreading. If I remember correctly the gain between 8 threads and 16 threads with hyperthreading were something like 15 or 20%. A point to note with the windows task manager with 16 threads : 8 threads was running at 100% and 8 other threads was running at only 20%.

MichelG · Post by **MichelG** » Fri Aug 27, 2010 23:37

Dragon uses multithreading, but only using a very simple algorithm. It's only splitting the root node, resulting in very poor speedup:

Code: Select all

Cores Speed
1     1,00
2     1,54
4     1,96
6     2,12
6+ht  2.09

(based on average time to reach a certain level in 50 test-positions)

My next project is to make some major improvements on this, and i am trying to choose an efficient algorithm.

What did make you choose the YBWC algorithm? Did you consider other algorithms, such as ABDADA, APHID or DTS?

I agree multithreading is very difficult to get right, especially with things like hash-tables and tree-culling, you get different results every run. I don't think you can do much about it, other that just play matches between your single core and multicore version. If the singlecore program wins, you have too many bugs

.

Michel

BertTuyt wrote:As my processor has a hyper-threading option I can increase the number of threads to 8 (4 real cores and 4 virtual cores). My OS detects all 8 cores.
I run the minmax test also with cores 5 - 8.
The resulted speed-increase (again with the 1 core algorithm as a basis) for a 10-ply minmax tree , is depicted in the next table.

...

Is there anybody out there with a 6-core CPU, then we can continue till 12 threads.

Bert

BertTuyt · Post by **BertTuyt** » Sat Aug 28, 2010 14:18

Michel,

I will react on your questions related to the choice for a specific algorithm somewhat later.
From your post it appear to me that you have a six-core processor?
Is this the "ultra-expensive" i7 980X, or another "animal"?

Do you have a speed comparison (nodes/second) compared to other processors which you used?

Bert

MichelG · Post by **MichelG** » Sat Aug 28, 2010 23:09

That's the one i have. The combination of high core count, high clock speed and hyperthreading makes it very fast compared to my other systems

.

As a test, i used the endgame database computation of 4 kings vs 2 kings. i get the following:

Code: Select all

system                      timeToSolve      Solved Positions(milions/sec)
2 duo 7450          2.13 Ghz     765.5       0.62
4 cores  6700       2.13 Ghz     454         1.05
4 cores  6700       2.66 Ghz     404        
6 core i980:        3.4 Ghz      134.3       3.55

I am not sure about the 4 core clock speed though, as it should be 2.66 but for some reason it seems to run at 2.13. I'll look into it later.

BertTuyt wrote:Michel,

I will react on your questions related to the choice for a specific algorithm somewhat later.
From your post it appear to me that you have a six-core processor?
Is this the "ultra-expensive" i7 980X, or another "animal"?

Do you have a speed comparison (nodes/second) compared to other processors which you used?

Bert

BertTuyt · Post by **BertTuyt** » Mon Aug 30, 2010 00:10

Michel,

basically my choice for YBWC is mainly based on popularity and availability of information.
For example the Glaurung program and later Stockfish have their source code available for everyone, and it is also supported by many comments, so quite self-explaining (although i expect that there was an initial flaw in the Glaurung YBWC, which was corrected by Stockfish later, and I felt into the same trap

).

I also had a look a DTS , which for me was to complicated to start with.
The other 2 APHID and ABDADA (though this was based on a shared hash-table , but I'm not 100% sure) i didn't study in detail.

Also , as in your case, I started by distributing work on root-level, which is not really an optimal way of using resources.

I also started like you, if the parallel version looses to much there might be something wrong with the implementation. But also here i took a different approach in the end.
Due to the nature of parallel search node count is always different, and even score and moves could be different.

To have more insight in what is really happening, and trying to include some constant factors, I am now working on 2 simple search-trees.
The traditional minimax, where in the end the node count and the score should be the same (if 2 moves result in equal scores, there could be differences in a parallel implementation).

Keep in mind that I switched off everything in my program like hash-table, extensions, move-ordering and pruning.

Next to that I also built very simple trees, based on multiple NULL-MOVES.
So i can set the width (W) of every node (number of moves) and also the depth (D) to search.

In this way i always end with the initial position (basically all positions are the same, but the search does not know). In this way I get perfect ordered trees, and I'm able to use the well known formulas for the number of end-nodes N (different for even search depth : N= 2 * W^(D/2) - 1 and odd : N = W^(D+1)/2 + W^(D-1)/2 - 1 ).

Also here the YBWC algorithm will always find the same number of nodes !!
I have this working now, and it is always a pleasure after many calculations, with 1 - 4 cores, that the end-node count is always correct.
At least this gives me confidence that at least a part of the algorithm is correct (with this approach not everything is tested, like an abort in case of a beta-cut).

I will post (maybe next week, as holiday is over) some results for these perfect ordered trees.
At least it gives some indication about the stability of the algorithm, and also the speed-increase obtained as a function of cores/threads.
It is also a perfect testbed for optimization.

In a later stage I will add hash-tables, etc.. and measure / evaluate the impact.
And so step by step i will re-activate my program.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Mon Aug 30, 2010 13:13

Bert, what is the bug that you found in the Glaurung parallel search?

-- Ed

World Draughts Forum

Parallel Panic

Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic

Re: Parallel Panic