I (somewhere) suspect that the results are an effect of the Windows System cache and subsequent System caching behavior.
Maybe i should get into the details of the CreateFile (for example use the FILE_FLAG_NO_BUFFERING, or whatever).
Bert, this is no doubt the reason that it went so quickly the second time. I use the NO_BUFFERING flag (always, not only just for this benchmark) because it is inefficient for both me and Windows to do caching on this large data set. With this NO_BUFFERING flag set I get repeatable results every time I run the benchmark. IIRC there are some restrictions when you use this flag, like your buffers have to be aligned on 4k boundaries, but this is easy to achieve using VirtualAlloc, and the 4k size is perfect for this application.
First Pass: 245.9 sec
Next 100 passes: 16.3 sec
I used a cache-size of 256K 4KByte blocks, so the cache is not fully loaded after processing the 233879 position from the test set.
Ok, so your times should be compared to my "1GB" numbers. I think you said that you are not preloading your cache buffers at program startup, so they are all empty. That explains why your time for the first pass is so long. I preload the buffers with positions that are in general often arrived at during typical games. This makes a big difference as you can see. BTW, although there are more than 200k positions in the test file, the benchmark program only reads the first 200k of them.
Was this test run on your Q6600 or on the I7?
Ed, how many blocks does the 7GB cache contain? I assume that when you have 256K 4KByte blocks (and you only load 4KByte per cache-read) further increase of cache size does not yield further speed increase as we "only" process < 256K Positions.
In fact even when I am using only 500mb of memory for the endgame db driver, which is less than 200k cache blocks, all the positions are in cache after the first pass. This is because of locality of reference of the data, which means that many of the positions differ from one another in only small details, and when you load a 4k block for a particular position you are also bringing into cache many other positions that will be looked up.
Another reason is that there are a lot of repeated positions in the test data. I simply logged every position as it was sent to the endgame db driver for a lookup, and did not eliminate duplicates. No reason to do so since these are the actual positions as they occured during the search. Not all positions are stored in the hashtable, so many positions get looked up multiple times.
Another question, how many different cache-blocks do you need for the position set?
I don't know, I did not instrument this statistic.
And if you start first with the 7GByte cache (fresh start), do you then also get the 1.0 second speed.
Yes.
-- Ed