EndGame Databases

BertTuyt · Post by **BertTuyt** » Fri Jul 24, 2009 00:34

In the mean time I have reduced the time to generate the 4K - 2K and 2K - 4K towards 1408.1 seconds.
Which is beyond the 400 KPos/seconds.

Yes, still apples and non-apples, as this is based on non-compressed files and also i do not compress the result in the end.
Also I don't apply symmetry principles.

I continue with the quest, next goal is to go beyond the 500 KPos/seconds.

Keep you all posted.....

Bert

Ed Gilbert · Post by **Ed Gilbert** » Fri Jul 24, 2009 00:46

I do not understand clearly how you run this binary search. My intention is here to try to use a hash acces with the idea that a given block can be loaded only in 2 ou 3 different places. Do you also try such approach ?

You can use a hashtable, but this requires more memory, because you need to store the block number along with the starting index of the block in each hashtable entry. A ordered array is more space efficient, as you can use the block number as the index into the array of starting position indicies. If you figure out how many blocks exist for a 365GB database, and how big your hashtable nodes are, you will see that this starts to add up.

Do you have to reorganise this list of starting index each time you replace a block by another one ? May be you can consider that the time required for rearranging this list is not significant comparing to the loading of the block in memory ?

The cache blocks are arranged in a doubly linked list, so rearranging simply means changing the link pointers.

Other questions. In case you do not find the position in the cache, how do you find the identity of the block you have to load from the disk? Is it also a binary search based on a table of all starting index of all blocks of a given slice (with your definition) ?

Yes.

If yes is this table in memory or is it in a another cache ?

In memory, read from disk at startup.

-- Ed

TAILLE · Post by **TAILLE** » Fri Jul 24, 2009 13:49

Ed Gilbert wrote: The cache blocks are arranged in a doubly linked list, so rearranging simply means changing the link pointers.

That was my understanding but my question is about the binary search algorithm. How do you find the middle element of 2 elements in a doubly linked list ?

Ed Gilbert · Post by **Ed Gilbert** » Fri Jul 24, 2009 14:39

That was my understanding but my question is about the binary search algorithm. How do you find the middle element of 2 elements in a doubly linked list ?

You are referring to two different data structures.
1) An array of start indicies for each block of the entire database.
2) A cache of database blocks arranged as a doubly linked list.

The binary search operates on 1). The LRU replacement strategy operates on 2).

-- Ed

BertTuyt · Post by **BertTuyt** » Fri Jul 24, 2009 14:57

Ed,

for my understanding, you only cache the existing database-blocks?
And the new Database is fully kept in Memory?

Bert

TAILLE · Post by **TAILLE** » Fri Jul 24, 2009 17:30

Ed Gilbert wrote:
That was my understanding but my question is about the binary search algorithm. How do you find the middle element of 2 elements in a doubly linked list ?
You are referring to two different data structures.
1) An array of start indicies for each block of the entire database.
2) A cache of database blocks arranged as a doubly linked list.

The binary search operates on 1). The LRU replacement strategy operates on 2).

-- Ed

Thank you Ed. it looks now clearer.
I will try a completly different approach. As you know I handle blocks with different sizes but I discovered that I did not take into account all the advantages of such approach and I have in mind some changes to be more efficient. The first major change will be the following one : for a given slice (eg. 3wk,2wm,2bk,1bm) I will dispatch all positions in N blocks with the same number of positions per block. Of course some blocks will be quite small and other rather large but it seems not harmful because small sizes and large sizes may be exceptionnal. The first advantage is that, from a given position, I can directly calculate the block in which I will find the result of the position. You can see that, this way, I avoid the storage of the starting index of each block and I avoid the corresponding binary search.
Defining the number of blocks I will use is not a big problem. Taking your figures for the 8 pieces db I understood that you need less than 400Gb. For blocks of 4Kb on average this means 100K blocks corresponding to 16,5 trillion positions => 165K positions per block on average.
I see also some very interesting additionnal advantages by dispatching the positions in the block in a skillful way.

Ed Gilbert · Post by **Ed Gilbert** » Fri Jul 24, 2009 18:29

BertTuyt wrote:Ed,

for my understanding, you only cache the existing database-blocks?
And the new Database is fully kept in Memory?

Bert

Bert, I'm sorry but I don't fully understand your questions. As general info, if this helps, I give the db lookup driver a chunk of memory to manage as a cache. Typically it is most of the memory in the machine, minus whatever is needed for the rest of the program, hashtables, etc., and leaving a few hundred mbytes for Windows to run with, but I can make this cache size anything I want to, up to the virtual memory limits of the machine. The driver divides this up into 4k blocks, and manages it as a cache of db blocks using a LRU replacement strategy. For example, the laptop that I brought to France has 4gb of ram, and I used about 2.8gb of it as cache for the endgame databases.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Fri Jul 24, 2009 18:38

Gerard, I see how you can do it without needing a binary search, which is certainly an advantage. The main concern with the driver is how the cache works, how efficiently you can operate when you have a cache miss and you have to go to disk, because disk I/O is so much slower than anything else involved. The position to index calculation, binary search, and LRU maintenance are all very fast in comparison. I used 4k blocks because that is the dma block size that the disk interface uses.

-- Ed

TAILLE · Post by **TAILLE** » Fri Jul 24, 2009 19:13

Ed Gilbert wrote:Gerard, I see how you can do it without needing a binary search, which is certainly an advantage. The main concern with the driver is how the cache works, how efficiently you can operate when you have a cache miss and you have to go to disk, because disk I/O is so much slower than anything else involved. The position to index calculation, binary search, and LRU maintenance are all very fast in comparison. I used 4k blocks because that is the dma block size that the disk interface uses.

-- Ed

Yes Ed. I know that but I am not able to say which approach is the best for a game. Anyway, because my current implementation uses already blocks of different sizes, to decide to have a fix number of positions in blocks is an improvement for me.
Another important point to take into account is the opportunity to put in the same block positions that have a good chance to be as interesting as the current position.

in the example above:
1) if it is white to move we will reach a position black to move in the the 5p-3p db and it will be quite interesting to have the 4 generated positions in the same block => with black to move it is interesting to have a block with all positions of the white kings
2) on the other hand, if it is black to move we will reach a position white to move in the the 4p-4p db and here also it will be quite interesting to have the 4 generated positions in the same block => with white to move it is interesting to have a block with all positions of the black kings
I hope it is another advantage of my approach : especially in positions with kings, a judicious building of contents of the blocks can certainly limit the number of disk access.
This example shows that it could be good idea to have index calculation which depend of who is to move.

Ed Gilbert · Post by **Ed Gilbert** » Fri Jul 24, 2009 23:44

Gerard, in this topic http://laatste.info/bb/viewtopic.php?t=2349 I posted some results of benchmarks of db access performance. If you are interested you could easily run the same benchmarks and see how your lookups compare to mine. I posted the source code at that link, and can email you the data file of 200k test positions.

-- Ed

TAILLE · Post by **TAILLE** » Sat Jul 25, 2009 00:17

Ed Gilbert wrote:Gerard, in this topic http://laatste.info/bb/viewtopic.php?t=2349 I posted some results of benchmarks of db access performance. If you are interested you could easily run the same benchmarks and see how your lookups compare to mine. I posted the source code at that link, and can email you the data file of 200k test positions.

-- Ed

Yes Ed. to have such 200K positions for benchmark purpose is of course interesting. Thank you to send me the corresponding file.
Remember that what I prensented above are only new ideas for the future. I intend to program and test them before beginning the 8 pieces generation but I am not in a hurry.

TAILLE · Post by **TAILLE** » Sat Jul 25, 2009 10:07

Ed Gilbert wrote:Gerard, in this topic http://laatste.info/bb/viewtopic.php?t=2349 I posted some results of benchmarks of db access performance. If you are interested you could easily run the same benchmarks and see how your lookups compare to mine. I posted the source code at that link, and can email you the data file of 200k test positions.

-- Ed

By thinking twice on this 200K positions file I am wondering what you try to benchmark.
How is build this file ?
If these positions are issued from 200K games that means that they are totally independant and you will probably measure only the performance of your disk.
In order to test your caching strategy and your choices to build blocks you need many positions that can be encounter in a given game by your engine. You need also to put some positions several time in the file.
Maybe another good benchmark would be to run a kind of perft function which stops either on a given depth or on a position in the db.
What is your feeling ?

Ed Gilbert · Post by **Ed Gilbert** » Sat Jul 25, 2009 12:55

By thinking twice on this 200K positions file I am wondering what you try to benchmark.
How is build this file ?
If these positions are issued from 200K games that means that they are totally independant and you will probably measure only the performance of your disk.
In order to test your caching strategy and your choices to build blocks you need many positions that can be encounter in a given game by your engine. You need also to put some positions several time in the file.
Maybe another good benchmark would be to run a kind of perft function which stops either on a given depth or on a position in the db.
What is your feeling ?

The file was built the way you suggested. A single practice game was run, and every endgame db lookup position was logged to a file during the game. Many of the positions are repeated several times, because they were not in the hashtable and that's what happened during the search. So I think the test using this file is a good simulation of real db lookup performance.

The first pass of the test measures the time to lookup the values of the positions when presumably many of them are not in cache and have to be loaded from disk. Although this is partly measuring your disk speed, the speed is affected by how much data you need to read to load a single position, how many seeks are required, and how good the locality of reference is in your db. Ie. when you read a block of positions to get one value, how many other positions in that block were subsequently looked up.

The second part of the test measures how long it takes to make 100 additional passes on the test file. At that point the positions should all be cached in ram, so it is measuring the speed of non-I/O parts of lookup. This is where the times for the binary search and the LRU maintenance should show up. I found it useful for comparing several different compression schemes that I was evaluating.

I have emailed you the file containing the positions. I ran the test several times using different cache sizes. You can change the CACHE_SIZE macro at the top of the C source file to do this, assuming your driver allows this setting to be changed.

-- Ed

TAILLE · Post by **TAILLE** » Sat Jul 25, 2009 14:59

Ed Gilbert wrote:The file was built the way you suggested. A single practice game was run, and every endgame db lookup position was logged to a file during the game. Many of the positions are repeated several times, because they were not in the hashtable and that's what happened during the search. So I think the test using this file is a good simulation of real db lookup performance.

The first pass of the test measures the time to lookup the values of the positions when presumably many of them are not in cache and have to be loaded from disk. Although this is partly measuring your disk speed, the speed is affected by how much data you need to read to load a single position, how many seeks are required, and how good the locality of reference is in your db. Ie. when you read a block of positions to get one value, how many other positions in that block were subsequently looked up.

The second part of the test measures how long it takes to make 100 additional passes on the test file. At that point the positions should all be cached in ram, so it is measuring the speed of non-I/O parts of lookup. This is where the times for the binary search and the LRU maintenance should show up. I found it useful for comparing several different compression schemes that I was evaluating.

I have emailed you the file containing the positions. I ran the test several times using different cache sizes. You can change the CACHE_SIZE macro at the top of the C source file to do this, assuming your driver allows this setting to be changed.

-- Ed

Thank you Ed for this interesting file I have just received; surely it would help me to compare different possible implementation.

BertTuyt · Post by **BertTuyt** » Sun Jul 26, 2009 17:25

In the (absolutely nonsense, but very challenging) apples versus non-apples competition, I have now reduced the time to calculate the 4K - 2K and 2K - 4K databases to 1114 .0 seconds (on a Q6600).
So I crossed the 500 KPos/sec threshold.

As I have some ideas/options to further improve, I guess the 600 KPos/sec limit is also within reach.

For those who have missed previous messages, the time is based on non-compressed databases, single-core implementation, no symmetry/mirror principles and also without "intelligent" cache mechanisms (so all required sub-databases are fully loaded on demand).

Keep you posted.....

Bert