Hi,
Just a small question.
Does somebody know the difference between the two following assembly instructions :
bts qword ptr [Var],1
lock bts qword ptr [Var],1
These 2 instructions are respectively generated by the 2 following intrinsec functions :
_bittestandset64(&Var, 1);
_interlockedbittestandset64(&Var, 1);
I guess that the first one might to be not thread safe if Var is a global variable used by several threads but I do not see clearly where is the risk.
Who can help me to understand it ?
Gérard

Software issue
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Hi Gerard,
A common use of these interlocked instructions is to implement a spin lock. I am using one of these in the parallel search to protect critical regions where globals are accessed that are shared among several search threads. In my case I am using _InterlockedExchange(), but it could be done just as easily with _interlockedbittestandset(). It looks something like this (from memory, I hope I have it right):
For parallel search, a spin lock is more efficient than using a Windows semaphore because it avoids the overhead of a context switch, and the critical regions are typically very short.
If your bit test and set operation is not a critical region, then you could use the non-interlocked version, but then there is not much point in using the instrinsic at all, just use the standard C binary operators.
-- Ed
A common use of these interlocked instructions is to implement a spin lock. I am using one of these in the parallel search to protect critical regions where globals are accessed that are shared among several search threads. In my case I am using _InterlockedExchange(), but it could be done just as easily with _interlockedbittestandset(). It looks something like this (from memory, I hope I have it right):
Code: Select all
long spinlock;
while (_InterlockedExchange(&spinlock, 1))
;
/* Start of critical region. */
/* end of critical region. */
spinlock = 0;
If your bit test and set operation is not a critical region, then you could use the non-interlocked version, but then there is not much point in using the instrinsic at all, just use the standard C binary operators.
-- Ed
Hi Ed,
Just a little (theoritical) comment :
The instruction :
spinlock = 0;
seems not sufficent to end the critical region because you do not assure an effective writting in memory. To solve such problem you have 2 solutions :
First solution :
Second solution :
As far as I am concerned I prefer the second solution
Coming back to my initial question (using of _bittestandset instead of _interlockedbittestandset) what about the following sequence :
My understanding is the following :
This sequence is safe in a multi-thread environment but only in a mono-processor environment; this sequence is not safe in a multi-processor environment.
Do you agree ?
Are 2 processors able to run in parallel the corresponding
bts qword ptr [spinlock ],0
assembly instruction and enter both the critical region ?
Gérard
Effectively your implementation is quite similar as mine.Ed Gilbert wrote:-- EdCode: Select all
long spinlock; while (_InterlockedExchange(&spinlock, 1)) ; /* Start of critical region. */ /* end of critical region. */ spinlock = 0;
Just a little (theoritical) comment :
The instruction :
spinlock = 0;
seems not sufficent to end the critical region because you do not assure an effective writting in memory. To solve such problem you have 2 solutions :
First solution :
Code: Select all
volatile long spinlock;
while (_InterlockedExchange(&spinlock, 1)) ;
/* Start of critical region. */
/* end of critical region. */
spinlock = 0;
Code: Select all
long spinlock;
while (_InterlockedExchange(&spinlock, 1)) ;
/* Start of critical region. */
/* end of critical region. */
spinlock = 0;
_ReadWriteBarrier();
Coming back to my initial question (using of _bittestandset instead of _interlockedbittestandset) what about the following sequence :
Code: Select all
long spinlock;
while (_bittestandset(&spinlock, 0));
/* Start of critical region. */
/* end of critical region. */
spinlock = 0;
_ReadWriteBarrier();
This sequence is safe in a multi-thread environment but only in a mono-processor environment; this sequence is not safe in a multi-processor environment.
Do you agree ?
Are 2 processors able to run in parallel the corresponding
bts qword ptr [spinlock ],0
assembly instruction and enter both the critical region ?
Gérard
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Gerard, you're right that something must be done to prevent the optimizer from delaying or omitting the clearing of the spinlock. In my actual code the spinlock is declared volatile.
-- Ed
Is it safe in a single processor environment with multi-threading? The non-atomic version can be preempted between the test and the set operations, and that seems to allow multiple threads to be in the critical region (through context switching, not simultaneously of course).My understanding is the following :
This sequence is safe in a multi-thread environment but only in a mono-processor environment; this sequence is not safe in a multi-processor environment.
Do you agree ?
Are 2 processors able to run in parallel the corresponding
bts qword ptr [spinlock ],0
assembly instruction and enter both the critical region ?
-- Ed
Re: Software issue
Hi,
I just begin some tests in order to try and use the 8 db as efficiently as possible.
First of all, because the disk blocks used by Damy are not sector-aligned I do not use the FILE_FLAG_NO_BUFFERING flag when using the Create() function. As a consequence Damy uses its own file cache but it also uses the file cache offered by windows.
All that is fine but I have a problem for testing two program versions and compare the impact on disk I/O.
I would like to use a given position and run these two program versions and compare the time needed in each case to resolve the position.
The problem is the following : after I have been running the first version the windows cache has been populated and I do not know how to clear this windows cache in order to run the second version in the same conditions.
I tried the FlushFileBuffers() function but I realised that the purpose of this function is only for writing the buffer on the disk but not for clearing the buffer to avoid the use of the cache for the next reading function.
For the moment the only solution is to restart my computer between the 2 tests ! Horrible !
Can somebody help me (I am not a expert on windows file cache system) ?
I just begin some tests in order to try and use the 8 db as efficiently as possible.
First of all, because the disk blocks used by Damy are not sector-aligned I do not use the FILE_FLAG_NO_BUFFERING flag when using the Create() function. As a consequence Damy uses its own file cache but it also uses the file cache offered by windows.
All that is fine but I have a problem for testing two program versions and compare the impact on disk I/O.
I would like to use a given position and run these two program versions and compare the time needed in each case to resolve the position.
The problem is the following : after I have been running the first version the windows cache has been populated and I do not know how to clear this windows cache in order to run the second version in the same conditions.
I tried the FlushFileBuffers() function but I realised that the purpose of this function is only for writing the buffer on the disk but not for clearing the buffer to avoid the use of the cache for the next reading function.
For the moment the only solution is to restart my computer between the 2 tests ! Horrible !
Can somebody help me (I am not a expert on windows file cache system) ?
Gérard
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Re: Software issue
Hi Gerard,
I am not aware of any way to clear the Windows file cache other than rebooting. I wonder if it makes sense for you to read a disk block into a temporary buffer that has the proper alignment, then memcopy the block into the intended destination? That would allow you to use FILE_FLAG_NO_BUFFERING. I assume that the memcopy is going to be much faster than reading the block from disk, which might take milliseconds, so the additional few microseconds of doing the memcopy might not be noticeable. I think you really want FILE_FLAG_NO_BUFFERING for your driver anyway (not just for the purpose of benchmarking), because it seems inefficient for both you and Windows to be caching the data.
-- Ed
I am not aware of any way to clear the Windows file cache other than rebooting. I wonder if it makes sense for you to read a disk block into a temporary buffer that has the proper alignment, then memcopy the block into the intended destination? That would allow you to use FILE_FLAG_NO_BUFFERING. I assume that the memcopy is going to be much faster than reading the block from disk, which might take milliseconds, so the additional few microseconds of doing the memcopy might not be noticeable. I think you really want FILE_FLAG_NO_BUFFERING for your driver anyway (not just for the purpose of benchmarking), because it seems inefficient for both you and Windows to be caching the data.
-- Ed
Re: Software issue
Thank you Ed. I will then try this solution. At least it looks quite simple.Ed Gilbert wrote:Hi Gerard,
I am not aware of any way to clear the Windows file cache other than rebooting. I wonder if it makes sense for you to read a disk block into a temporary buffer that has the proper alignment, then memcopy the block into the intended destination? That would allow you to use FILE_FLAG_NO_BUFFERING. I assume that the memcopy is going to be much faster than reading the block from disk, which might take milliseconds, so the additional few microseconds of doing the memcopy might not be noticeable. I think you really want FILE_FLAG_NO_BUFFERING for your driver anyway (not just for the purpose of benchmarking), because it seems inefficient for both you and Windows to be caching the data.
-- Ed
Gérard
-
- Posts: 860
- Joined: Sat Apr 28, 2007 14:53
- Real name: Ed Gilbert
- Location: Morristown, NJ USA
- Contact:
Re: Software issue
Hi Gerard,
Here's a good test position for using the 8pc db that I came across this afternoon.
B:W24,27,28,30,32,34,35,37:B11,13,14,16,18,19,25,26
Kingsrow sees a white database win after 11 minutes of searching on my Q6600 quad-core 8gb machine.
-- Ed
Here's a good test position for using the 8pc db that I came across this afternoon.
B:W24,27,28,30,32,34,35,37:B11,13,14,16,18,19,25,26
Kingsrow sees a white database win after 11 minutes of searching on my Q6600 quad-core 8gb machine.
-- Ed
Re: Software issue
Hi Ed,Ed Gilbert wrote:Hi Gerard,
Here's a good test position for using the 8pc db that I came across this afternoon.
B:W24,27,28,30,32,34,35,37:B11,13,14,16,18,19,25,26
Kingsrow sees a white database win after 11 minutes of searching on my Q6600 quad-core 8gb machine.
-- Ed
After a lot of change in my search algorithm I am now able to use the 8 pieces db but I have to run many tests in order to tune my access to the disk.
Taking the following position you proposed I already can see some advantages :

Black to move
First of all remember I do not use an alpha-beta search but some mixure of mtd-f and mtd-f-best search. As a consequence Damy is not optimised in order to give the best evaluation of the position but it is optimised in order to give the best move whhich is very different.
In the above position and with my laptop (core 2 duo, 4Gb ram) Damy is able to prove in less than 20" that 1...18-23! is the only move because all the other moves are losing moves. Damy then stops searching. Without the 8 db Damy need about 50" for the same result.
As you see Damy as not proved the win but it has proved that 1.18-23! is the only move and here is a big difference between alpha-beta seach and mtd-f-best search.
After 1...18-23 the position is the following

Finding the best move is more difficult here. After having quickly eliminating 2.24-20 the Damy search algorithm try to separate as much as possible the two moves 2...28-22 and 2...27-22
With the 8 db the results are (1000 = 1 man):
after 3' :
28-22 => +2400
27-22 => +250
after 9'
28-22 => +2900
27-22 => +50
after 15'
28-22 => +4350
27-22 => +50
Of course the result is very clear. Strictly speaking Damy has not proved the win but as a result of my algorithm choice Damy spent a lot of time on 2.27-22 and the other moves and has almost proved the draw with these moves.
Without the 8 db the results are :
after 3' :
28-22 => +1850
27-22 => +400
after 9'
28-22 => +2150
27-22 => +250
after 15'
28-22 => +2150
27-22 => +50
Taking this exemple you can see that the 8 db allow to gain a lot of time : after 3' the move 2.28-22 is already evaluated to 2400, and without the 8db the 2.28-22 is still evaluated to 2150 after 15'.
Good news indeed
Ed or any other guy, do you have a test position in which only one move is winning but this move is quite difficult to choose comparing to an "apparently" good alternative ? This would be surely a more significant test for my algorithm.
Gérard