Thanks very much Bert.Thanks
Sidiki
Please take a look AVX-512.BertTuyt wrote: ↑Tue Dec 29, 2020 23:47With some optimization in the AVX2 (256) code (thanks also to Joost), i was able to increase the Damage search-speed with NNUE to 6.0 MN/sec.
This yielded a small improvement in strength.
Unfortunately my processor does not support AVX-512, which could further improve SIMD performance.
Recent DXP match against Kingsrow (with same settings), and perspective Kingsrow 31W, 127D, which yield an Elo difference of 69.
So small steps....
Will keep you posted.
Bert
Hi Bert,
Code: Select all
int vnni_dot_product(int16_t* w1, int16_t* w2, int32_t* offset)
{
__m512i v1 = _mm512_loadu_si512(w1);
__m512i v2 = _mm512_loadu_si512(w2);
__m512i v3 = _mm512_loadu_si512(offset);
__m512i vresult = _mm512_dpwssd_epi32(v3, v1, v2);
return _mm512_reduce_add_epi32(vresult);
}
Ed, I used positions labeled with the score of a 4 ply full-width search with pattern evaluation (without hash or pruning, just plain a-b).Ed Gilbert wrote: ↑Fri Jan 15, 2021 12:51Joost, very interesting, thanks for posting this update. Your results with floating point and non-incremental eval are already very good.
It wasn't clear to me, are your training positions labeled as WLD, or with an eval score from a quick search? Also, what time controls did you use for the matches used to measure performance?
I have gotten distracted with many other interesting machine learning projects (non-draughts) that kind of had to happen right now before anything else. But I will soon start working again on my Keras/Tensorflow pipeline for Scan-like pattern evals.Joost Buijs wrote: ↑Wed Jan 13, 2021 09:11In practice 1.4B positions are difficult to handle (one needs fast amounts of memory for this) so I used only 240M.
Maybe a coincidence, I just told Bert this afternoon that I'm busy modifying my pyTorch DataLoader in such a way that it can read chunks of data from disk. On my AMD PC with 128 GB RAM I can load at max. 625M positions in memory because each position is 192 bytes, there is a way to load positions as 192 bits and translate them on the fly to 192 bytes when needed for the training pipeline, but in Python this will be slow. Reading a chunk from disk seems easier, and with SSD it will be fast enough.Rein Halbersma wrote: ↑Fri Jan 15, 2021 22:00I have gotten distracted with many other interesting machine learning projects (non-draughts) that kind of had to happen right now before anything else. But I will soon start working again on my Keras/Tensorflow pipeline for Scan-like pattern evals.Joost Buijs wrote: ↑Wed Jan 13, 2021 09:11In practice 1.4B positions are difficult to handle (one needs fast amounts of memory for this) so I used only 240M.
One improvement that is possible for any eval training pipeline (NNUE, patterns whatever) using Keras/Tensorflow (or PyTorch for that matter) is to stream the data from disk to the optimizer. That way, you need much less RAM than the entire database. E.g. for Kingsrow, Ed supplied me with a ~10Gb file. Loading that into memory and feeding it to Keras expanded it temporarily to ~48Gb of RAM during optimization. When done from disk in batches, it should be configurable to get it below ~16Gb without much speed loss for *arbitarily* large files on disk.