Parallel search

Ed Gilbert · Post by **Ed Gilbert** » Thu Dec 18, 2008 17:10

TAILLE wrote:That was my understanding of the original procedure. Correct me if I am wrong : if each node of the tree has at a maximum 2 successors, does that mean that only one thread is running ?

Yes, because a split point is not created until the search is done on the first move, and then if there is only one other move there is no need for another thread.

Of course it is another problem to know if it could be realistic to consider that a node has no more than 2 successors. My question is only to verify my understanding.

I guess if you only calculate the moves incrementally this might be a problem. In my case I build a movelist of all the moves before searching any of them. Actually, even if you do it incrementally with a function like get_next_move(), you could just call this function twice after the first move is searched to see if the node can be split. It would be a little messier though as you have to consider that maybe you cannot split for some other reason and then you might want to 'pushback' the last move calculated so that get_next_move() produces it again as you iterate through them.

-- Ed

FeikeBoomstra · Post by **FeikeBoomstra** » Thu Dec 18, 2008 17:22

Thanks Ed,

Just by reading your answer, I realized I overlooked something. I was testing on nr_of_valid_moves > 1, but you are right, you have to test for on nr_of_valid_moves - actual_move_index > 1

TAILLE · Post by **TAILLE** » Thu Dec 18, 2008 18:35

[quote="Ed GilbertI guess if you only calculate the moves incrementally this might be a problem. In my case I build a movelist of all the moves before searching any of them.
[/quote]
I have also a movelist and, as a consequence, I have not the problem you mentionned. The issue I have is not here.
Suppose that, after having calculating the first successor, I do not find a cutoff. It is then a situation where I would like to create a split point. Suppose now that, before launching the calculation of the second successor, I look in the hashtable for all moves in the movelist, hoping finding some interesting information. It may happen that all moves but one are considered very bad moves and the situation might be considered as if the node had only 2 successors.
Of course it is just an idea for discussion. The current Damy implementation is different.

Gérard

Ed Gilbert · Post by **Ed Gilbert** » Fri Dec 19, 2008 14:55

TAILLE wrote:Suppose that, after having calculating the first successor, I do not find a cutoff. It is then a situation where I would like to create a split point. Suppose now that, before launching the calculation of the second successor, I look in the hashtable for all moves in the movelist, hoping finding some interesting information. It may happen that all moves but one are considered very bad moves and the situation might be considered as if the node had only 2 successors.
Of course it is just an idea for discussion. The current Damy implementation is different.

This sounds like what Schaeffer calls Enhanced Transposition Cutoffs, or looking up the immediate successor moves in the hashtable to try to find a quick cutoff. If you use ETC then it makes sense to do this before splitting. I tried ETC on the English 8x8 version of kingsrow and it was about a wash -- it lowered the node counts but also slowed the search speed by about the same percentage, so time to depth was the same.

-- Ed

BertTuyt · Post by **BertTuyt** » Thu Jan 15, 2009 21:51

During Xmas i worked on the parallel Search.
The algorithm Damage now is using is based on the weak YBWC (without the Helpfull Master).

The non-optimized results are based on the "standard" test game as also used by Ed.
The system used is a quadcore Q6600 at 2.4 Ghz.

Herewith the results

Code: Select all

Core	Time		1/Time		Nodes		Speed		Speed (Mnodes/sec)
1		1.00		1.00		1.00		1.00		1.92
2		0.62		1.62		1.06		1.71		3.28
3		0.48		2.08		1.18		2.44		4.69
4		0.41		2.42		1.25		3.03		5.81

In the first 4 columns the timing of the multi-core compared with the singlecore (basically the singlecore algorithm is the same).
In the last column the absolute speed in MNodes/sec.
So Damage reaches a speed of around 6 MNodes/sec.

The challenge is to get beyond the 3.0 "time"-improvement, which I set as my "development and implementation " target.

Bert

TAILLE · Post by **TAILLE** » Thu Jan 15, 2009 23:47

BertTuyt wrote:During Xmas i worked on the parallel Search.
The algorithm Damage now is using is based on the weak YBWC (without the Helpfull Master).
Bert

For your information I now have also a multicore version of Damy. After having studied the YBWC algorithm I finally created a different algorithm but not very far from YBWC.

BertTuyt wrote: In the first 4 columns the timing of the multi-core compared with the singlecore (basically the singlecore algorithm is the same).
Bert

Here I have a question : isn't it possible to improve the singlecore algorithm when working in a multicore environment ?

Let's take an obvious position to illustrate my point :

black to move

Suppose that you decide to spend 10 seconds for deciding between 19x30 and 20x29. What's is the best strategie to use these 10 seconds as efficently as possible?

With a "classical" single core algoritm it appears that you will spend more than 90% of the CPU on the 20x29 move and less than 10% on the 19x30. Is it what you want ? Isn't is preferable to spend 50% of the CPU on each move in order to demonstrate that 20x29 is far better than 19x20 ?

By the way it is quite easy to have this time repartition with 2 cores, ... but not by using the YBWC !

Gérard

Ed Gilbert · Post by **Ed Gilbert** » Fri Jan 16, 2009 01:36

Hi Bert,

Your parallel search speedups look very good, and a little better than mine. I need to repeat the benchmarks, because I afterward discovered that one of my cores spends 30% of its time servicing some interrupt, which the other cores do not do, and so my benchmarks are skewed. Also, last weekend I built a new computer, and now I have a machine with 8 cores, so I can repeat the test with that. But given the diminishing returns we both see going from 3 cores to 4, I don't expect more than 4 cores to be much additional benefit for parallel searches. The 8-core machine is for building databases and running many engine matches in parallel.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Fri Jan 16, 2009 01:42

TAILLE wrote:With a "classical" single core algoritm it appears that you will spend more than 90% of the CPU on the 20x29 move and less than 10% on the 19x30. Is it what you want ? Isn't is preferable to spend 50% of the CPU on each move in order to demonstrate that 20x29 is far better than 19x20 ?

Hi Gerard,

Why is it better to split the time 50/50 than to search each move to the same nominal depth? I always felt it was better to explore each alternative to the same nominal depth, which then gets altered by truncations and extensions. But I'm not following your point. Can you explain?

-- Ed

TAILLE · Post by **TAILLE** » Fri Jan 16, 2009 14:55

Ed Gilbert wrote: Why is it better to split the time 50/50 than to search each move to the same nominal depth? I always felt it was better to explore each alternative to the same nominal depth, which then gets altered by truncations and extensions. But I'm not following your point. Can you explain?
-- Ed

I 100% agree with you when you say it is better to explore each alternative to the same nominal depth but my point is different.

In the above diagram if, in a MTD-f approach, you explore the tree with a testvalue = 0 then you will find a lot of cutoffs when exploring 20x29 (because the real value of 20x29 is far higher than 0) as well as when exploring 19x30 (because the real value of 20x29 is far lesser than 0).
If now you discover that the value of 20x29 is about equal to 2,000 then you will probably launch a new tree exploration with a testvalue = 2,000 and here is my point : with this exploration you will have very few cutoffs for the 20x29 move because the test value is very near from the real value of the move. In the other hand you will have a lot of cutoffs when exploring 19x30. That the reason why I claim that you will spend 90% of the CPU on 20x29 and 10% on 19x30.
My feeling is that the optimum testvalue is the value for wich I spend 50% of the CPU time for each move, but ... it is only a feeling.
Gérard

Ed Gilbert · Post by **Ed Gilbert** » Sat Jan 17, 2009 20:42

TAILLE wrote: My feeling is that the optimum testvalue is the value for wich I spend 50% of the CPU time for each move, but ... it is only a feeling.

But for mtd-f the test value is driven by finding the upper and lower bounds at the root position and thus converging on the exact value, so I don't follow how that relates to how many nodes are visited by trying other test values.

Somewhat related to this, I read that some chess programs use the node counts to do move ordering at the root position. I tried this with kingsrow but found that it was inferior to the root move ordering scheme that I was already using.

-- Ed

BertTuyt · Post by **BertTuyt** » Sun Jan 18, 2009 12:03

Ed, the 8-core machine you built is quite interesting.
Can you say something about the hardware used (such as Processors, amount Memory, Motherboard, is the memory shared , so you could really address the cores directly).

Im planning to buy a new computer, based on the new Intel i7 processor (the 2.93 GHz version) and 12 GByte Memory. Hope I have this Hardware for the next tournament in France. Also this processor has again implemented hyper-threading, so for the outside world there are 8 threads (don't think this is really efficient so lets see)

Regarding the YBWC speedup, im doing now a profile to get a better understanding where I have improvement opportunities. Your remark is interesting "that because in Draughts the branching factor is lower diminishing returns will start earlier compared with Chess". In Chess speedups beyond 3 were reported for YBWC, whereas "we" stabilize below 3. Maybe I will do some simulations with artificial trees to verify this.

That would be bad news for the Draughts Programming community to be honest, as we will get 8-core (real cores) processors this year (and more cores in the future as frequency seems to stay around the 3 GHz ). Although for Database generation it is helpful.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Sun Jan 18, 2009 15:01

BertTuyt wrote:Ed, the 8-core machine you built is quite interesting.
Can you say something about the hardware used (such as Processors, amount Memory, Motherboard, is the memory shared , so you could really address the cores directly).

Im planning to buy a new computer, based on the new Intel i7 processor (the 2.93 GHz version) and 12 GByte Memory. Hope I have this Hardware for the next tournament in France. Also this processor has again implemented hyper-threading, so for the outside world there are 8 threads (don't think this is really efficient so lets see)

Hi Bert,

The new machine uses two Intel E5420 processors. These are quad core Xeons running at 2.5GHz and with 12mb of L2 cache. The machine is built around a Supermicro X7DCA-L server motherboard. Other components are 16GB of ECC ram and two 1TB hard drives. I may put two more ram sticks in to bump it to 24GB as these are now fairly inexpensive. Yes, all the memory is shared and it can address all 16GB with one process. I looked at Nehalem, but none of the chipsets support ECC memory now, and I would have to wait close to a year before anything suitable is available. Having 8 CPUs in one small box (14.7" x 10.9 x 10.3) is much nicer than all the separate boxes I used to have under my desk. I intend to use it for engine matches and building opening books and endgame dbs. Presently it is running 8 instances of the endgame db builder working on the 8pc db, which is 18% complete.

-- Ed

BertTuyt · Post by **BertTuyt** » Sat Jan 24, 2009 00:26

While testing the parallel algorithm , I ( = Damage) tried to solve the position in the next diagram.

[img]http://fmjd.org/dias2/save/12327526722.png[/img]

White to Move, and the position is a win for White.

My question to all is, what is the time needed for your program to find the absolute win-path, only using 6p WDL (Win-Draw-Loose) Databases (so no databases containing distance to Win , or distance to conversion).
It is evident that the program already recognizes at the root the correct move, but in will take much longer to find the move-sequence leading to the capture of all Black Kings.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Sat Jan 24, 2009 00:49

Hi Bert,

6.5 seconds to see the end of the game. The score -3977 indicates approximately 23 moves to the end of the game. How does damage do on this test?

Code: Select all

best 31-42, value -2720, depth 1/1.7/4, nodes 123, time 0.00, 123 kN/s, db 47, pv 31-42
best 31-42, value -2720, depth 3/3.6/7, nodes 464, time 0.00, 464 kN/s, db 221, pv 31-42 47-41 38-32
best 31-42, value -2720, depth 5/4.5/8, nodes 1035, time 0.00, 1035 kN/s, db 513, pv 31-42 47-41 38-32 41-47 42-15 47-41
best 31-42, value -2720, depth 7/8.3/13, nodes 3072, time 0.00, 3072 kN/s, db 1642, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-20
best 31-42, value -2720, depth 9/11.5/15, nodes 8889, time 0.02, 556 kN/s, db 5132, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -2720, depth 11/13.4/17, nodes 18637, time 0.02, 1165 kN/s, db 10813, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -2720, depth 13/15.2/20, nodes 39322, time 0.02, 2458 kN/s, db 22287, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -2720, depth 15/16.9/22, nodes 75122, time 0.03, 2348 kN/s, db 42477, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -2720, depth 17/18.5/24, nodes 132619, time 0.06, 2105 kN/s, db 74534, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -2720, depth 19/19.6/26, nodes 198056, time 0.08, 2539 kN/s, db 109594, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -2750, depth 21/22.3/29, nodes 340133, time 0.13, 2721 kN/s, db 188331, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -3977, depth 23/26.2/33, nodes 26700178, time 6.55, 4078 kN/s, db 13196075, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 
best 31-42, value -3977, depth 25/28.8/33, nodes 41801038, time 10.09, 4141 kN/s, db 21055657, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 48-25 
best 31-42, value -3977, depth 27/29.8/37, nodes 148908880, time 32.58, 4571 kN/s, db 61514957, pv 31-42 47-41 38-32 41-36 23-5 36-9 42-48 9-4 48-25

BertTuyt · Post by **BertTuyt** » Sat Jan 24, 2009 19:45

Ed, damage takes 29.2 sec. (examining around 144406604 positions, 4.9 MNodes/sec. ).

So room for optimization.

Im interested what the score is of the other programs ?

Bert

World Draughts Forum

Parallel search

singlecore/multicore algorithm

Re: singlecore/multicore algorithm

Re: singlecore/multicore algorithm

Re: singlecore/multicore algorithm