I basically implemented the code already 1 year ago, but so far i focused more on generating the 7P DB then full debugging the parallel algorithms.
As some people already have mentioned before, if programming a draughts engine is already a challence, programming a parallel search is a nightmare.
Although there are good examples available to learn from ( i really recommend all to browse the Glaurung, or nowadays Stockfish code), debugging is not straightforward.
In normal cases when you restart a sequential (bug-free) search process and cleared all tables (like hash-table history-table whatever) the outcome is always the same (number nodes , score and best move).
In the case of a parallel search not only every time the number of nodes is different , but even the score and best move could be different (this has already been mentioned in previous posts in this forum).
Therefore I decided to use a different tree and search-algorithm to test if the parallel implementation is "bug-free", in my case I based all on the YBWC (Young Brothers wait Concept).
The search I used for testing is the minmax-search where i search to a given depth and stop the search when the side to move has no capture (if the opposite does have a capture option, i still stop). And if the movecount is 1, i don't 'count" this as an additional ply.
With a "slightly" modified perft i calculated the end -nodes of these search-trees, see table below (maybe some-one can check this, if the count is OK).
Code: Select all
Ply Number of End-Nodes
7 4,202,204
8 37,413,597
9 351,706,838
10 3,396,182,697
In this way I'm sure that the search-tree is always the same.
Although not relevant for normal game-play, the advantage is that whatever parallel implementation is used, the number of nodes should be always the same.
It is also my assumption that the resulting speed-up is most likely a upper-limit for the "real" case, as the parallel search in general searches bigger trees.
The good news is, that after some initial fail-starts, the parallel code now examines exactly the same tree, and node-count is always the same as the sequential version (note this doesn't mean that the implementation is completely bug-free, as i (for example) don't test the abort-mechanism in case at a splitpoint one of the threads find a beta-cutoff).
In the table below , I have provided an overview of search-times (in seconds) as a function of number of threads/cores used (C1 - C4).
Code: Select all
Ply Time C1 C2 C3 C4
7 4,99 3,23 2,65 2,74
8 42,26 23,09 16,94 16,43
9 383,07 198,67 160,43 123,75
10 3606,93 1871,99 1469,43 1087,12
Code: Select all
Ply Speed C1 C2 C3 C4
7 1,00 1,54 1,88 1,82
8 1,00 1,83 2,49 2,57
9 1,00 1,93 2,39 3,10
10 1,00 1,93 2,45 3,32
So there might be optimization possible.
If the splidepth parameter is to small the number of splitpoints will increase, but the search-effort for every thread could be so little (as trees as small, with limited node-count), that the overhead is far bigger then the gain due to more cores working on the search-tree.
On the other hand if the splitdepth is large, the number of splitpoints are small, so there might be not enough work for all threads involved, and the cores are most of the time "idle".
It might be that more speed-up is possible (in the current implementation , i don't do splitting at the root-level, as my root-code is slightly different, so i did not implement this yet).
But anyway the upper-bound for speed-up for my (i740 2.93 Ghz Quad-core system) seems to be 3,3.
As i have a holiday i most likely post more results in the next days.
In the mean time I really like some feedback from the other draughts programming on this forum.
* Do you get the same Perft numbers for the trees as I described here.
* What parallel search-algorithm do you use.
* I guess that Ed and Gerard use 8-core systems (based on 2 processors), so do you also observe diminishing returns when using more cores? (this is interesting as most likely next year Intel will go to 8-core processors (think they already have one for server-systems), and there are already processors available for dedicated testers containing more then 30 cores)
* Do you have good examples, suggestions for parallel-debugging, or whatever....
If you have any questions, or whatever, please let me know.
Keep you all posted....
Bert