Endgame database benchmarks

Ed Gilbert · Post by **Ed Gilbert** » Tue Aug 12, 2008 03:57

I have created a simple benchmark test for endgame database lookups which I will use to evaluate the lookup speed of some new compression schemes. I used the test to benchmark the existing kingsrow db. The results are pasted below. The test uses a list of 200k endgame positions which were logged during normal engine searches in a practice game. The first part of the test measures how long it takes to lookup the WLD value of these 200k positions. The second part repeats these 200k lookups 100 more times. After the first pass the data blocks containing the test positions are all in cache buffers so there is no disk I/O and the lookups are much faster.

Here are the results for various sizes of endgame cache memory.

Code: Select all

0.5gb cache:
First pass 108.1 sec
Next 100 passes 9.4 sec

1gb cache:
First pass 86.2 sec
Next 100 passes 9.8 sec

2gb cache:
First pass 29.7 sec
Next 100 passes 9.8 sec

4gb cache:
First pass 9.0 sec
Next 100 passes 9.9 sec

7gb cache:
First pass 1.3 sec
Next 100 passes 10.4 sec

So it seems that a position lookup takes about 500nsec when the data is in the endgame db driver's cache. The test also show the advantage of having lots of cache memory. The initial lookups were 66 times faster using 7gb of cache than with 1gb.

If anyone is interested in running the benchmark with your own database, I can email the list of 200k test positions. They all have 7 pieces or less. The test program is short, I can post it here.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Tue Aug 12, 2008 16:26

Here is the test program.

Code: Select all

#include <windows.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <math.h>
#include <time.h>
#include "egdb_intl.h"
#include "bool.h"
#include "bitcount.h"
#include "move_api.h"
#include "pdn.h"
#include "bicoef.h"
#include "reverse.h"

#define CACHE_SIZE 500				/* total mbytes used by the egdb driver. */
#define DBDIR "c:/kr_intl_wld"
#define MAX_POSITIONS 200000
#define TEST_POS_FILE "dbpos.txt"

#define ADDITIONAL_PASSES 100
#define TDIFF(start) (((double)(clock() + 1 - start)) / (double)CLOCKS_PER_SEC)

typedef struct {
	int color;
	BOARD board;
} POSITION;


void print_msgs(char *msg)
{
	printf(msg);
}


int read_positions(POSITION *positions)
{
	int count;
	FILE *fp;
	char buf[80];

	fp = fopen(TEST_POS_FILE, "r");
	count = 0;
	while (fgets(buf, sizeof(buf), fp)) {
		if (!parse_fen(buf, &positions[count].board, &positions[count].color))
			++count;
		if (count >= MAX_POSITIONS)
			break;
	}
	fclose(fp);
	return(count);
}


int main(int argc, char* argv[])
{
	int i, k, count;
	EGDB_DRIVER *handle_egdb;
	POSITION *positions;
	clock_t t0;

	initbool();
	initbicoef();
	init_reverse();
	init_bitcount();

	/* Read the file of test positions. */
	positions = (POSITION *)malloc(MAX_POSITIONS * sizeof(POSITION));
	count = read_positions(positions);

	/* Open the endgame db driver. */
	handle_egdb = egdb_open("maxpieces=7", CACHE_SIZE, DBDIR, print_msgs);

	/* First pass reading the test positions. */
	t0 = clock();
	for (i = 0; i < count; ++i)
		handle_egdb->lookup(handle_egdb, &positions[i].board, positions[i].color, 0);

	printf("\nFirst pass %.1f sec", TDIFF(t0));

	/* Do 100 additional passes to get a reasonable time value. */
	t0 = clock();
	for (k = 0; k < ADDITIONAL_PASSES; ++k)
		for (i = 0; i < count; ++i)
			handle_egdb->lookup(handle_egdb, &positions[i].board, positions[i].color, 0);

	printf("\nNext %d passes %.1f sec", ADDITIONAL_PASSES, TDIFF(t0));
	handle_egdb->close(handle_egdb);
	return 0;
}

Ed Gilbert · Post by **Ed Gilbert** » Thu Sep 04, 2008 01:25

I ran the same benchmark on a 7pc db which was compressed using length-limited Huffman codes. This format of the db compresses to a size of 21.5gb, as compared to 30.2 gb for the runlength encoded db. I also think I can reduce this Huffman size a few more percent because I generated the Huffman codes from a very small sampling of code frequencies of 2, 3, and 4pc slices. From the table below you can see that the Huffman encoded db is about 3x slower to do a lookup of a position when it is already in disk cache. But for the first pass lookups of test positions, some of the times are faster, which shows the advantage of compressing more information into the same cache space. From just these tests it is not really clear to me which one is preferred. It could be argued that the Huffman compressed db is better because the "first pass" is more representative of how the db is used during a normal game.

I have been working on a third format of database compression which has compression ratios approaching that of the Huffman db but which uses almost the same decompression algorithm as the runlength db. Maybe that version will have the best characteristics of each of these two.

-- Ed

Code: Select all

0.5gb cache:            runlength       Huffman
First pass              108.1 sec       120.9 sec
Next 100 passes         9.4 sec         26.8 sec

1gb cache:
First pass              86.2 sec        71.5 sec
Next 100 passes         9.8 sec         27.3 sec

2gb cache:
First pass              29.7 sec        27.6 sec
Next 100 passes         9.8 sec         27.5 sec

4gb cache:
First pass              9.0 sec         3.1 sec
Next 100 passes         9.9 sec         27.7 sec

7gb cache:
First pass              1.3 sec         1.0 sec
Next 100 passes         10.4 sec        27.8 sec

Ed Gilbert · Post by **Ed Gilbert** » Wed Oct 08, 2008 04:34

I tested a third format for the compressed database and have run it through the same speed benchmark. This format is an enhanced version of runlength encoding, using multiple encoding and decoding tables as a function of the WLD probabilities in each subdivision. The 7pc db compressed to a size of 24.7gb, which is a nice improvement over the 30.2gb of the first runlength encoded db. Decompression speed is also very good as shown below.

-- Ed

Code: Select all

0.5gb cache:         runlength-improved    runlength     Huffman
First pass                 83.4 sec        108.1 sec     120.9 sec 
Next 100 passes             9.7 sec          9.4 sec      26.8 sec 

1gb cache: 
First pass                 74.8 sec         86.2 sec      71.5 sec 
Next 100 passes             9.9 sec          9.8 sec      27.3 sec 

2gb cache: 
First pass                 19.8 sec         29.7 sec      27.6 sec 
Next 100 passes            10.0 sec          9.8 sec      27.5 sec 

4gb cache: 
First pass                  5.4 sec          9.0 sec       3.1 sec 
Next 100 passes            10.1 sec          9.9 sec      27.7 sec 

7gb cache: 
First pass                  0.6 sec          1.3 sec       1.0 sec 
Next 100 passes            10.4 sec         10.4 sec      27.8 sec

BertTuyt · Post by **BertTuyt** » Sat Jan 30, 2010 23:28

I did the same test based on the set of test positions from Ed.
Results, are weird

Initially:

First Pass: 245.9 sec
Next 100 passes: 16.3 sec

But when i clear the Cache "virtually" (by setting the index-to-cache table-entries to a value CACHE-NOT-LOADED or by just by an program exit and then a program restart) the first Pass takes only 0.7 seconds !

Normally I only can re-create the older "slow" values when i restart Windows-7 (64bit version).
I used a cache-size of 256K 4KByte blocks, so the cache is not fully loaded after processing the 233879 position from the test set.

I assume that or the information is still stored in the Hard-Disk cache, or Windows 7 has some intelligent caching, or something "happens" which i do not yet understand.

Maybe coincidence but the fast speed obtained is close to the high speed Ed measured.

So here some questions, which might help me to crack this puzzle:

Ed, how many blocks does the 7GB cache contain? I assume that when you have 256K 4KByte blocks (and you only load 4KByte per cache-read) further increase of cache size does not yield further speed increase as we "only" process < 256K Positions.

Also when you look at "normal" HD performance Read-speed 50-100 MBytes/sec, and accestime 10-20 ms, I dont know how it is possible to get an overall time around 0.7-1.0 seconds).

Another question, how many different cache-blocks do you need for the position set?
And if you start first with the 7GByte cache (fresh start), do you then also get the 1.0 second speed.

I assume ,that there is a logical explanation, but so far i dont have one

Bert

BertTuyt · Post by **BertTuyt** » Sat Jan 30, 2010 23:41

Some additional remarks:

I Use the CreateFile function for "file-operations".
I (somewhere) suspect that the results are an effect of the Windows System cache and subsequent System caching behavior.
Maybe i should get into the details of the CreateFile (for example use the FILE_FLAG_NO_BUFFERING, or whatever).

Bert

Ed Gilbert · Post by **Ed Gilbert** » Sun Jan 31, 2010 14:40

I (somewhere) suspect that the results are an effect of the Windows System cache and subsequent System caching behavior.
Maybe i should get into the details of the CreateFile (for example use the FILE_FLAG_NO_BUFFERING, or whatever).

Bert, this is no doubt the reason that it went so quickly the second time. I use the NO_BUFFERING flag (always, not only just for this benchmark) because it is inefficient for both me and Windows to do caching on this large data set. With this NO_BUFFERING flag set I get repeatable results every time I run the benchmark. IIRC there are some restrictions when you use this flag, like your buffers have to be aligned on 4k boundaries, but this is easy to achieve using VirtualAlloc, and the 4k size is perfect for this application.

First Pass: 245.9 sec
Next 100 passes: 16.3 sec

I used a cache-size of 256K 4KByte blocks, so the cache is not fully loaded after processing the 233879 position from the test set.

Ok, so your times should be compared to my "1GB" numbers. I think you said that you are not preloading your cache buffers at program startup, so they are all empty. That explains why your time for the first pass is so long. I preload the buffers with positions that are in general often arrived at during typical games. This makes a big difference as you can see. BTW, although there are more than 200k positions in the test file, the benchmark program only reads the first 200k of them.

Was this test run on your Q6600 or on the I7?

Ed, how many blocks does the 7GB cache contain? I assume that when you have 256K 4KByte blocks (and you only load 4KByte per cache-read) further increase of cache size does not yield further speed increase as we "only" process < 256K Positions.

In fact even when I am using only 500mb of memory for the endgame db driver, which is less than 200k cache blocks, all the positions are in cache after the first pass. This is because of locality of reference of the data, which means that many of the positions differ from one another in only small details, and when you load a 4k block for a particular position you are also bringing into cache many other positions that will be looked up.

Another reason is that there are a lot of repeated positions in the test data. I simply logged every position as it was sent to the endgame db driver for a lookup, and did not eliminate duplicates. No reason to do so since these are the actual positions as they occured during the search. Not all positions are stored in the hashtable, so many positions get looked up multiple times.

Another question, how many different cache-blocks do you need for the position set?

I don't know, I did not instrument this statistic.

And if you start first with the 7GByte cache (fresh start), do you then also get the 1.0 second speed.

Yes.

-- Ed

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2010 16:16

Ed, thanks for your reply, i start to understand ( I hope ).

If I'm right the reason that larger caches work in your case, is that you preload, so many more positions are already in the cache. Without preloading I expect that the timings for caches 500M - 7G should be exactly the same (as the 500M cache is already large enough to store all positions).

I run my benchmarks on the i7 (2.9 GHz)

Just a question for verification do you use 200 * 1024 test positions or 200 * 1000 (i know that sometime the K-interpretation is not always the same). Think (at least) your algorithm seems to be faster when all positions are loaded (which PC do you use?) although I guess my processor is faster ( so Double Trouble ) . Have to dig into this , whether my index function is so slow, or my RLE-algorithm.

I will redo the test then with the same number as you use.
When I'm back from my Business Trip I will go into the NO_BUFFER implementation issues.

Thanks for your help/support,

Bert

Ed Gilbert · Post by **Ed Gilbert** » Sun Jan 31, 2010 16:49

Just a question for verification do you use 200 * 1024 test positions or 200 * 1000 (i know that sometime the K-interpretation is not always the same).

If you used the source code I posted then it reads only 200000 positions from the file.

Code: Select all

#define MAX_POSITIONS 200000

int read_positions(POSITION *positions)
{
   int count;
   FILE *fp;
   char buf[80];

   fp = fopen(TEST_POS_FILE, "r");
   count = 0;
   while (fgets(buf, sizeof(buf), fp)) {
      if (!parse_fen(buf, &positions[count].board, &positions[count].color))
         ++count;
      if (count >= MAX_POSITIONS)
         break;
   }
   fclose(fp);
   return(count);
}

Think (at least) your algorithm seems to be faster when all positions are loaded (which PC do you use?)

The benchmarks I posted were run on the Q6600.

although I guess my processor is faster ( so Double Trouble ) . Have to dig into this , whether my index function is so slow, or my RLE-algorithm.

I would suspect the decompression. Notice that among the 3 different compression schemes I benchmarked there was almost a 3x difference in speed between the fastest and the slowest. All three used the same indexing and caching functions and only differed in the decompression.

-- Ed

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2010 17:20

With 200.000 positions:

First Pass: 126.1 sec.
Next 100 Passes: 13.5 sec.

Ed's result (100 passes) is around 10 seconds.
So assuming that my i7 is 30% faster then the Ed's Q6000, there is some work to do for me

Bert

Ed Gilbert · Post by **Ed Gilbert** » Sun Jan 31, 2010 17:48

Bert, I think the reason I used exactly 200000 test positions was to make it easier to calculate how long it takes to lookup one position (although using a calculator it doesn't really matter, does it :-). For example, if it takes 10 seconds for the second part of the benchmark, then the time to lookup one position is 10/(200000 * 100) = 0.5usec. Compare that to a typical engine search speed of say 2000Kpos/sec when you're not hitting the endgame db hard. That suggests that if you can keep all the positions in cache then the endgame db lookups don't have to appreciably slow down your search at all.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Sun Jan 31, 2010 17:55

Bert, here's another idea. Buy a 32GB USB flash memory stick and put the 7pc db on it, then repeat the benchmark. I'll bet the first pass will be much faster.

-- Ed

BertTuyt · Post by **BertTuyt** » Sun Jan 31, 2010 19:59

Ed,

just another question for clarification.
How do you preload the DB?
Do you save the total DB-imagine and then load, or do you (somewhere) keep track which blocks are loaded, and use this list to fill the DB at program start.

Bert

Ed Gilbert · Post by **Ed Gilbert** » Mon Feb 01, 2010 02:23

How do you preload the DB?
Do you save the total DB-imagine and then load, or do you (somewhere) keep track which blocks are loaded, and use this list to fill the DB at program start.

No image, just a list of db subdivisions to preload first. I preload cache blocks from the list of preferred subdivisions until the cache is full. It can be time consuming if you have a lot of ram. On the 16gb machine it takes about 10 minutes to initialize the driver, and most of that time is preloading cache blocks.

-- Ed

Ed Gilbert · Post by **Ed Gilbert** » Wed Feb 17, 2010 01:41

I put the 7pc db on a USB flash memory drive and reran the benchmarks. The random access read speed of the flash drive is faster than a hard drive, as seen from the results.

Code: Select all

0.5gb cache:              hard drive      USB flash drive
First pass                 83.4 sec         12.2 sec
Next 100 passes             9.7 sec          9.6 sec

1gb cache: 
First pass                 74.8 sec         10.0 sec
Next 100 passes             9.9 sec          9.9 sec

2gb cache: 
First pass                 19.8 sec          3.0 sec
Next 100 passes            10.0 sec          9.9 sec

4gb cache: 
First pass                  5.4 sec          1.0 sec
Next 100 passes            10.1 sec          10.0 sec

7gb cache: 
First pass                  0.6 sec          0.2 sec
Next 100 passes            10.4 sec         10.1 sec

I cannot fit the entire 8pc db on it, but I can put the most often accessed slices on it. This particular drive is a Kingston Data Traveller I. It is not one of the drives that is advertised as "fast". I assume that "fast" refers to the write speed and that the read speeds are all fairly similar, but that is only my guess.

-- Ed

World Draughts Forum

Endgame database benchmarks

Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks

Re: Endgame database benchmarks