Software issue

TAILLE · Post by **TAILLE** » Wed Jul 30, 2008 23:40

Hi,
Just a small question.
Does somebody know the difference between the two following assembly instructions :

bts qword ptr [Var],1
lock bts qword ptr [Var],1

These 2 instructions are respectively generated by the 2 following intrinsec functions :
_bittestandset64(&Var, 1);
_interlockedbittestandset64(&Var, 1);

I guess that the first one might to be not thread safe if Var is a global variable used by several threads but I do not see clearly where is the risk.
Who can help me to understand it ?

Gérard

Ed Gilbert · Post by **Ed Gilbert** » Thu Jul 31, 2008 14:49

Hi Gerard,

A common use of these interlocked instructions is to implement a spin lock. I am using one of these in the parallel search to protect critical regions where globals are accessed that are shared among several search threads. In my case I am using _InterlockedExchange(), but it could be done just as easily with _interlockedbittestandset(). It looks something like this (from memory, I hope I have it right):

Code: Select all

long spinlock;

while (_InterlockedExchange(&spinlock, 1))
    ;

/* Start of critical region. */


/* end of critical region. */

spinlock = 0;

For parallel search, a spin lock is more efficient than using a Windows semaphore because it avoids the overhead of a context switch, and the critical regions are typically very short.

If your bit test and set operation is not a critical region, then you could use the non-interlocked version, but then there is not much point in using the instrinsic at all, just use the standard C binary operators.

-- Ed

TAILLE · Post by **TAILLE** » Thu Jul 31, 2008 16:13

Hi Ed,

Ed Gilbert wrote:

Code: Select all

long spinlock;

while (_InterlockedExchange(&spinlock, 1))
    ;

/* Start of critical region. */


/* end of critical region. */

spinlock = 0;

-- Ed

Effectively your implementation is quite similar as mine.
Just a little (theoritical) comment :
The instruction :
spinlock = 0;
seems not sufficent to end the critical region because you do not assure an effective writting in memory. To solve such problem you have 2 solutions :

First solution :

Code: Select all

volatile long spinlock; 

while (_InterlockedExchange(&spinlock, 1)) ; 
/* Start of critical region. */ 

/* end of critical region. */ 
spinlock = 0;

Second solution :

Code: Select all

long spinlock; 

while (_InterlockedExchange(&spinlock, 1)) ; 
/* Start of critical region. */ 

/* end of critical region. */ 
spinlock = 0;
_ReadWriteBarrier();

As far as I am concerned I prefer the second solution

Coming back to my initial question (using of _bittestandset instead of _interlockedbittestandset) what about the following sequence :

Code: Select all

long spinlock; 

while (_bittestandset(&spinlock, 0));
/* Start of critical region. */ 


/* end of critical region. */ 
spinlock = 0;
_ReadWriteBarrier();

My understanding is the following :
This sequence is safe in a multi-thread environment but only in a mono-processor environment; this sequence is not safe in a multi-processor environment.
Do you agree ?
Are 2 processors able to run in parallel the corresponding
bts qword ptr [spinlock ],0
assembly instruction and enter both the critical region ?
Gérard

Ed Gilbert · Post by **Ed Gilbert** » Thu Jul 31, 2008 17:39

Gerard, you're right that something must be done to prevent the optimizer from delaying or omitting the clearing of the spinlock. In my actual code the spinlock is declared volatile.

My understanding is the following :
This sequence is safe in a multi-thread environment but only in a mono-processor environment; this sequence is not safe in a multi-processor environment.
Do you agree ?
Are 2 processors able to run in parallel the corresponding
bts qword ptr [spinlock ],0
assembly instruction and enter both the critical region ?

Is it safe in a single processor environment with multi-threading? The non-atomic version can be preempted between the test and the set operations, and that seems to allow multiple threads to be in the critical region (through context switching, not simultaneously of course).

-- Ed

TAILLE · Post by **TAILLE** » Fri Jan 07, 2011 15:09

Hi,

I just begin some tests in order to try and use the 8 db as efficiently as possible.

First of all, because the disk blocks used by Damy are not sector-aligned I do not use the FILE_FLAG_NO_BUFFERING flag when using the Create() function. As a consequence Damy uses its own file cache but it also uses the file cache offered by windows.
All that is fine but I have a problem for testing two program versions and compare the impact on disk I/O.
I would like to use a given position and run these two program versions and compare the time needed in each case to resolve the position.
The problem is the following : after I have been running the first version the windows cache has been populated and I do not know how to clear this windows cache in order to run the second version in the same conditions.
I tried the FlushFileBuffers() function but I realised that the purpose of this function is only for writing the buffer on the disk but not for clearing the buffer to avoid the use of the cache for the next reading function.
For the moment the only solution is to restart my computer between the 2 tests ! Horrible !

Can somebody help me (I am not a expert on windows file cache system) ?

Ed Gilbert · Post by **Ed Gilbert** » Fri Jan 07, 2011 15:28

Hi Gerard,

I am not aware of any way to clear the Windows file cache other than rebooting. I wonder if it makes sense for you to read a disk block into a temporary buffer that has the proper alignment, then memcopy the block into the intended destination? That would allow you to use FILE_FLAG_NO_BUFFERING. I assume that the memcopy is going to be much faster than reading the block from disk, which might take milliseconds, so the additional few microseconds of doing the memcopy might not be noticeable. I think you really want FILE_FLAG_NO_BUFFERING for your driver anyway (not just for the purpose of benchmarking), because it seems inefficient for both you and Windows to be caching the data.

-- Ed

TAILLE · Post by **TAILLE** » Fri Jan 07, 2011 16:21

Ed Gilbert wrote:Hi Gerard,

I am not aware of any way to clear the Windows file cache other than rebooting. I wonder if it makes sense for you to read a disk block into a temporary buffer that has the proper alignment, then memcopy the block into the intended destination? That would allow you to use FILE_FLAG_NO_BUFFERING. I assume that the memcopy is going to be much faster than reading the block from disk, which might take milliseconds, so the additional few microseconds of doing the memcopy might not be noticeable. I think you really want FILE_FLAG_NO_BUFFERING for your driver anyway (not just for the purpose of benchmarking), because it seems inefficient for both you and Windows to be caching the data.

-- Ed

Thank you Ed. I will then try this solution. At least it looks quite simple.

Ed Gilbert · Post by **Ed Gilbert** » Sat Jan 08, 2011 21:22

Hi Gerard,

Here's a good test position for using the 8pc db that I came across this afternoon.

B:W24,27,28,30,32,34,35,37:B11,13,14,16,18,19,25,26

Kingsrow sees a white database win after 11 minutes of searching on my Q6600 quad-core 8gb machine.

-- Ed

TAILLE · Post by **TAILLE** » Mon Jan 31, 2011 17:20

Ed Gilbert wrote:Hi Gerard,

Here's a good test position for using the 8pc db that I came across this afternoon.

B:W24,27,28,30,32,34,35,37:B11,13,14,16,18,19,25,26

Kingsrow sees a white database win after 11 minutes of searching on my Q6600 quad-core 8gb machine.

-- Ed

Hi Ed,

After a lot of change in my search algorithm I am now able to use the 8 pieces db but I have to run many tests in order to tune my access to the disk.

Taking the following position you proposed I already can see some advantages :

Black to move

First of all remember I do not use an alpha-beta search but some mixure of mtd-f and mtd-f-best search. As a consequence Damy is not optimised in order to give the best evaluation of the position but it is optimised in order to give the best move whhich is very different.

In the above position and with my laptop (core 2 duo, 4Gb ram) Damy is able to prove in less than 20" that 1...18-23! is the only move because all the other moves are losing moves. Damy then stops searching. Without the 8 db Damy need about 50" for the same result.
As you see Damy as not proved the win but it has proved that 1.18-23! is the only move and here is a big difference between alpha-beta seach and mtd-f-best search.

After 1...18-23 the position is the following

Finding the best move is more difficult here. After having quickly eliminating 2.24-20 the Damy search algorithm try to separate as much as possible the two moves 2...28-22 and 2...27-22
With the 8 db the results are (1000 = 1 man):
after 3' :
28-22 => +2400
27-22 => +250
after 9'
28-22 => +2900
27-22 => +50
after 15'
28-22 => +4350
27-22 => +50
Of course the result is very clear. Strictly speaking Damy has not proved the win but as a result of my algorithm choice Damy spent a lot of time on 2.27-22 and the other moves and has almost proved the draw with these moves.
Without the 8 db the results are :
after 3' :
28-22 => +1850
27-22 => +400
after 9'
28-22 => +2150
27-22 => +250
after 15'
28-22 => +2150
27-22 => +50

Taking this exemple you can see that the 8 db allow to gain a lot of time : after 3' the move 2.28-22 is already evaluated to 2400, and without the 8db the 2.28-22 is still evaluated to 2150 after 15'.

Good news indeed

Ed or any other guy, do you have a test position in which only one move is winning but this move is quite difficult to choose comparing to an "apparently" good alternative ? This would be surely a more significant test for my algorithm.

World Draughts Forum

Software issue

Software issue

Re: Software issue

Re: Software issue

Re: Software issue

Re: Software issue

Re: Software issue