EndGame Databases

TAILLE · Post by **TAILLE** » Sun Jul 26, 2009 22:07

Ed Gilbert wrote:The first pass of the test measures the time to lookup the values of the positions when presumably many of them are not in cache and have to be loaded from disk. Although this is partly measuring your disk speed, the speed is affected by how much data you need to read to load a single position, how many seeks are required, and how good the locality of reference is in your db. Ie. when you read a block of positions to get one value, how many other positions in that block were subsequently looked up.

The second part of the test measures how long it takes to make 100 additional passes on the test file. At that point the positions should all be cached in ram, so it is measuring the speed of non-I/O parts of lookup. This is where the times for the binary search and the LRU maintenance should show up. I found it useful for comparing several different compression schemes that I was evaluating.
-- Ed

Though I understand the meaning of the second part of the test I do not see the interest of this test in practice. As soon as you reach a position with 7 pieces or less than means that the previous position P is a capture position. As a consequence all successors of P are in the db and P can be completly resolved. If now you put P in the hashtable with the information that the position is resolved that means that this position P becomes a leaf for the next tree exploration and you will not reach the db. With such implementation what could be the interest of this second test ?

BertTuyt · Post by **BertTuyt** » Sun Jul 26, 2009 22:14

Ed,

I still try to understand your DB principles,
As I captured so far, you only load 4K blocks for the already pre-calculated databases (which are stored on disk).
Does this also imply you you use a compression scheme which generates these 4K blocks boundaries?

Are then the "to be calculated" databases fully loaded in memory, or do you for these also apply "swapping".

Bert

Ed Gilbert · Post by **Ed Gilbert** » Mon Jul 27, 2009 00:02

Though I understand the meaning of the second part of the test I do not see the interest of this test in practice. As soon as you reach a position with 7 pieces or less than means that the previous position P is a capture position. As a consequence all successors of P are in the db and P can be completly resolved. If now you put P in the hashtable with the information that the position is resolved that means that this position P becomes a leaf for the next tree exploration and you will not reach the db. With such implementation what could be the interest of this second test ?

The second part of the test measures the non-IO speed of the db lookup. If you are evaluating several different schemes for compression/decompression, or cache maintenance, it can be useful to know how the non-IO speeds compare. One of the things I tested used Huffman codes for compression. This resulted in the smallest db sizes, but the non-IO portion of lookup speed was 3x slower than the other schemes I tested. This helped me decide not to ultimately use the Huffman compressor. Today you can easily outfit a computer with 24gb of ram and hold the entire 7pc db in ram, so this is not an irrelevant statistic. Also, even if your cache is smaller than the total db size (which is more typical), if most of your lookups are cache hits then there is little IO and your lookup speed can be dominated by the non-IO time.

About the other point -- if the position is N pieces then are you interested in the successors? In my case not all egdb positions are leaf nodes in the search. There are good reasons to keep descending into egdb successors. In drawn positons, you may want to find the draw that has the best heuristic value (although I don't do that yet). In won or lost positions, you may need to continue the search to find the shortest path to a conversion, or try to see a path all the way to the end of the game.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Mon Jul 27, 2009 00:09

Ed,

I still try to understand your DB principles,
As I captured so far, you only load 4K blocks for the already pre-calculated databases (which are stored on disk).
Does this also imply you you use a compression scheme which generates these 4K blocks boundaries?

Are then the "to be calculated" databases fully loaded in memory, or do you for these also apply "swapping".

Bert

I use a form of runlength compression in which all the positions of a given subdivision (and therefore db file) are stored as one large continuous block of data. The Grimminck compressed format is similar in that respect. Whenever I lookup a position and that position has to be read from disk, I never read anything other than one 4k block which holds the position of interest.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Mon Jul 27, 2009 18:17

Are then the "to be calculated" databases fully loaded in memory, or do you for these also apply "swapping".

Bert, yes the db being built is kept in ram using 2 bits per position for 'complete' dbs and 4 bits per position for 'incompletes' such as the 5men vs. 4men.

-- Ed

BertTuyt · Post by **BertTuyt** » Mon Jul 27, 2009 20:39

Ed,

so the DB being built is kept completely in RAM, or also here you apply caching.
For the 8p databases, some slices can become quite big i guess.
What was the maximum slice-size you had (in pos and Memory)?

To make slices smaller I included the most advanced rank for the man.
Think you did the same.
I need to re-read your paper on internet, but I think you have added an additional constraint for further reduction?

Bert

Ed Gilbert · Post by **Ed Gilbert** » Mon Jul 27, 2009 21:46

Ed,

so the DB being built is kept completely in RAM, or also here you apply caching.
For the 8p databases, some slices can become quite big i guess.
What was the maximum slice-size you had (in pos and Memory)?

To make slices smaller I included the most advanced rank for the man.
Think you did the same.
I need to re-read your paper on internet, but I think you have added an additional constraint for further reduction?

Bert

The db being built is always in ram. Except for the 4k-4k and 5k-3k slices, all the other subdivisions fit in 1.4gb or less. To build the subdivisions with all kings I used mirror symmetry and also used all 8gb of the ram in the quad-core, so no parallel building during these first few days. After that each instance is given ~1.7gb ram and all cores building in parallel.

To make the subdivisions small I used an indexing scheme that subdivides by more than just piece counts and leading ranks. There is a detailed description of it here: http://pages.prodigy.net/eyg/Checkers/10-pieceBuild.htm

-- Ed

BertTuyt · Post by **BertTuyt** » Tue Jul 28, 2009 11:24

Next episode in the apples versus non-apples competition, I have now reduced the time to calculate the 4K - 2K and 2K - 4K databases to 933.3 .0 seconds (on a Q6600).
So I crossed the 600 KPos/sec threshold.

I have now really limited ideas/options to further improve, so I guess the 700 KPos/sec limit is out of reach for the moment ( although the i7 will realize this [img]images/smilies/icon_smile.gif[/img] ).

For those who have missed previous messages, the time is based on non-compressed databases, single-core implementation, no symmetry/mirror principles and also without "intelligent" cache mechanisms (so all required sub-databases are fully loaded on demand).

Keep you posted.....

Bert

TAILLE · Post by **TAILLE** » Tue Jul 28, 2009 12:52

Hi Bert,

BertTuyt wrote:Next episode in the apples versus non-apples competition, I have now reduced the time to calculate the 4K - 2K and 2K - 4K databases to 933.3 .0 seconds (on a Q6600).
So I crossed the 600 KPos/sec threshold.
Bert

Be aware of not optimise too much this db. Most of the positions of this particular db are draw positions and an algorithm optimised for finding the draws may not be optimised for finding the wins. In addition there are no men though the majority of the the positions of the complete db have men.
In any case that is a very good result Bert.

BertTuyt · Post by **BertTuyt** » Sat Aug 01, 2009 16:23

Ed/Gerard, i have a question related to the DB cache implementation.

So far (with 6p databases) I preload all the databases during start-up, and all are directly accesible in main memory.

Although i also don not use a cache for the 7p database generation (so far), it is evident that i need to implement a cache mechanism in my program Damage.

So far I understood that you use a double linked list of 4K data blocks.
I assume that you use the double link to easily replace the Least-Recently-Used datablock.

What I still not completely understand, how you find a specific blocknumber based on a index.

In the preloaded implementation, i have a list of memory pointers, where i can find the location of a specific database (based on white/black man/king counts). The index for every specific database starts at 0.

When using a cache, i think i need a unique index (so not incorporating the specific database) for a position. Do you do something similar, or do you keep track on the specific database information (like mancount) for all blocks?

If you only use the double link list, you go linear trough the blocks, which is quite time consuming.

On the other hand if you keep a parallel sorted list (sorted on index) with the associated blocknumbers, the sorting will cost quite some time.

So i assume i overlooked something trivial.....

Bert

TAILLE · Post by **TAILLE** » Sat Aug 01, 2009 17:45

Hi Bert,

BertTuyt wrote:Ed/Gerard, i have a question related to the DB cache implementation.

So far (with 6p databases) I preload all the databases during start-up, and all are directly accesible in main memory.

Although i also don not use a cache for the 7p database generation (so far), it is evident that i need to implement a cache mechanism in my program Damage.

So far I understood that you use a double linked list of 4K data blocks.
I assume that you use the double link to easily replace the Least-Recently-Used datablock.

What I still not completely understand, how you find a specific blocknumber based on a index.

In the preloaded implementation, i have a list of memory pointers, where i can find the location of a specific database (based on white/black man/king counts). The index for every specific database starts at 0.

When using a cache, i think i need a unique index (so not incorporating the specific database) for a position. Do you do something similar, or do you keep track on the specific database information (like mancount) for all blocks?

If you only use the double link list, you go linear trough the blocks, which is quite time consuming.

On the other hand if you keep a parallel sorted list (sorted on index) with the associated blocknumbers, the sorting will cost quite some time.

So i assume i overlooked something trivial.....

Bert

First of all you do not necessarily need a db cache for the 7pieces endgame db. If you have a little more than 20Gb of memory available on your computer you can put all the db in memory as you did for the 6 pieces db. If you have not this memory available or if you anticipate the building of the 8 pieces db then you need such cache.

Concerning now the handling of the cache, you have to know that Ed and I have very different approaches, with advantage and drawbacks for each of them. In my implementation I have not a fixe size for a block, I do not have a link (single or doubled) between blocks and I do not handle a LRU mechanism. BTW I am working on some new improvements I will program and test in the next months and before beginning the generation of the 8 pieces db.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 01, 2009 19:12

BertTuyt wrote:Ed/Gerard, i have a question related to the DB cache implementation.

So far (with 6p databases) I preload all the databases during start-up, and all are directly accesible in main memory.

Although i also don not use a cache for the 7p database generation (so far), it is evident that i need to implement a cache mechanism in my program Damage.

So far I understood that you use a double linked list of 4K data blocks.
I assume that you use the double link to easily replace the Least-Recently-Used datablock.

What I still not completely understand, how you find a specific blocknumber based on a index.

In the preloaded implementation, i have a list of memory pointers, where i can find the location of a specific database (based on white/black man/king counts). The index for every specific database starts at 0.

When using a cache, i think i need a unique index (so not incorporating the specific database) for a position. Do you do something similar, or do you keep track on the specific database information (like mancount) for all blocks?

If you only use the double link list, you go linear trough the blocks, which is quite time consuming.

On the other hand if you keep a parallel sorted list (sorted on index) with the associated blocknumbers, the sorting will cost quite some time.

So i assume i overlooked something trivial.....

Bert

Bert, take a look at the Chinook db access code: http://games.cs.ualberta.ca/~chinook/databases/code.c

For the larger databases (>6 pieces), they have for every database subdivision (wm, wk, bm, wk, wr, br tuple) a plain text index file with for every 4K block of data in the database the 32-bit index of the position that starts the block. Suppose you build the full 8 pc databases and ultimately have a 512 Gb database on disk. That means 128M blocks of indices, 4 bytes per starting index so about 512 Mb of indexing data to be in RAM at all times. Suppose you have enough memory left to use 16Gb for caching. That means that you can cache 4M blocks of 4K each. During a search, once you have computed the index of the current position, you can do a binary search on the indexing data (29 steps) to find the location on disk, and then read in the specific 4K block that contains the value of the current position.

Your question is how to locate this block in RAM if it is already among the cached blocks? For this, you need another array of 128M entries, this time with 4-byte indices into the array of cached 4K blocks. That means you have another 512 Mb of indexing data to be in RAM at all times. After every disk read of a 4K block not in cache, you simply update the location of the block in the cache array, using the LRU entry (that's where the double-linked list comes in) as the one to replace.

In total you have 1 Gb of RAM overhead for 16Gb of caching. Note that this first overhead is proportional to the size of the db on disk, not to the size of the cache! However, there is also overhead of using a double linked list, and more importantly, also from having a mini-cache of 64 4-byte indices of 64-byte mini-blocks per cached 4K block (this is used to speedup locating the position within the 4K block). That stuff is about 2*4 bytes for the list pointers + 256 bytes for the mini-cache, so about 1/16 of the 4K data block itself, so another 1 Gb in total for 16Gb of cached blocks. This overhead, however, scales with the number of data blocks. But still, 18 Gb of memory to have 16 Gb in cache is not bad.

This is how I understand the Chinook code, and also from what I have learned from correspondence with Ed and his online descriptions of building the checkers 10 pc and draughts 8 pc dbs.

TAILLE · Post by **TAILLE** » Sat Aug 01, 2009 19:28

Rein Halbersma wrote:For the larger databases (>6 pieces), they have for every database subdivision (wm, wk, bm, wk, wr, br tuple) a plain text index file with for every 4K block of data in the database the 32-bit index of the position that starts the block. Suppose you build the full 8 pc databases and ultimately have a 512 Gb database on disk. That means 128M blocks of indices, 4 bytes per starting index so about 512 Mb of indexing data to be in RAM at all times.

How do you manage to have only 4 bytes for the starting index in the case of the 8 pieces db ? Each slices of this db have far more than 4G positions.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 01, 2009 19:32

TAILLE wrote:
Rein Halbersma wrote:For the larger databases (>6 pieces), they have for every database subdivision (wm, wk, bm, wk, wr, br tuple) a plain text index file with for every 4K block of data in the database the 32-bit index of the position that starts the block. Suppose you build the full 8 pc databases and ultimately have a 512 Gb database on disk. That means 128M blocks of indices, 4 bytes per starting index so about 512 Mb of indexing data to be in RAM at all times.
How do you manage to have only 4 bytes for the starting index in the case of the 8 pieces db ? Each slices of this db have far more than 4G positions.

This is how Ed describes it:
http://pages.prodigy.net/eyg/Checkers/10-pieceBuild.htm

"subdivide each slice into fixed size subdivisions of size 2^32. To find which subdivision of a slice a position is in, the 64-bit slice index is computed for the position, and this number is divided by 2^32. The integer quotient identifies which subdivision contains the index, and the 32-bit remainder from the division becomes the index of the position within that subdivision."

So you use 64-bit indices during building, but 32-bit indices in the index file.

Rein Halbersma · Post by **Rein Halbersma** » Sat Aug 01, 2009 19:41

The LRU cache scheme that Ed & Chinook use has the following scaling properties (with 4K blocks and 4-byte addresses):

total Gb RAM needed = [Gb on disk]/512 + [Gb LRU cache] (1 + 1/16)

If one uses 64-bit addressing everywhere, the formula becomes

total Gb RAM needed = [Gb on disk]/256 + [Gb LRU cache] (1 + 1/8 )

For the example I gave before, total overhead would be 4 Gb for 16 Gb of LRU cache and 512 Gb on disk. With RAM prices as low as they are, one might choose to throw in the extra 2 Gb just to simplify some indexing formula [img]images/smilies/icon_smile.gif[/img]