In a previous post I mentioned that initially the 158 game matches with variable ply depth with Damage were played with DEBUG mode, to test the (updated) parallel search routine.
In 2 occasions an ASSERT occurred. One was related to a bug in the Principal Variation, the other one i could not trace, and is the one I want to discuss now..
As the EndGame DB is too large, I use an internal DB cache , which for Damage is 4GByte, consisting of 1M 4KByte Cache blocks.
In the situation that a specific 4 KByte block is not loaded (yet) the cache handler replace the LRU (Least Recently Used ) cache block.
To optimize transfer to the main memory the EndGame DB is stored on a Solid State Disk (SSD), in stead of a (normal) Hard Disk.
Every 4 KByte block contains a small table ( 2 bytes for every table-entry) at the beginning with the address offset of the next 4K entries.
As the WDL values are compressed this table will slightly improve the de-compression speed (and as compression is used a 4 KByte block can/will contain in general far more than 4K positions).
During initial DEBUG of this routine, I have included an ASSERT to test if the offset was smaller than 4 KByte, as the situation that the DB-position is not included in this block would not be possible (with the boundary condition that the pre-process and block identification was ok so far).
The specific ASSERT which stopped the program, was related to this test.
So the possible options are:
* Pre-process was wrong.
* The processor did a wrong memory read (although the memory value was ok).
* The block was corrupted by a read from another part of the program.
* The transfer from SSD to memory was wrong.
* The information on the SSD is already corrupted.
As the part where Damage does a DB -read is a critical section, the situation that another part of the process modifies the block while a read is executed can not take place ( I hope so
).
As I was able to pin down on the specific Database position i could do a re-read later, which indicated no problem, so therefore it seems logical to assume that the content on the SSD was not the reason for the ASSERT.
Unfortunately (with hindsight one always know what to do better) i did not write down the specific 2 bytes index value when the ASSERT took place, and therefore was not able to compare these with the second read, which could reveal already some clues.
At this point I guess that the most plausible explanation is that (for whatever) reason an error occurred during SSD access.
I guess that these errors are not frequent, but when the computer is running day and night for 2 weeks, the chance might not be zero.
Basically i dont care that once in a while an error is introduced this way, as the chance that this will impact the tree result is small.
On the other hand, I dont like the program to crash if this situation occurs.
As the ASSERTs are not compiled in RELEASE mode, and so far i didn't see crashes lately, i dont have an insight in frequency.
So im thinking to also include some tests in the RELEASE mode (and especially in the DB-handler) , to avoid these surprises.
One can think, given the example, to do a read twice if the index is outside the 4 KByte range, or to re-read the specific block.
My question, for those who dont have ECC Memory, whether you know/recognize these types of random failures (cosmic rays, quantum mechanics or whatever), did you also see them in your program, and did you include some extra test mechanism in the search to detect and avoid them...
Bert