In TF you can do pretty neat stuff, like directly streaming over a zipped archive of txt/csv files. You can also write a data preprocessing layer that would apply a function on eg a FEN or a set of bitboards. Then everything is part of the model pipeline, and it becomes much easier to experiment.Joost Buijs wrote: ↑Fri Jan 15, 2021 23:29Maybe a coincidence, I just told Bert this afternoon that I'm busy modifying my pyTorch DataLoader in such a way that it can read chunks of data from disk. On my AMD PC with 128 GB RAM I can load at max. 625M positions in memory because each position is 192 bytes, there is a way to load positions as 192 bits and translate them on the fly to 192 bytes when needed for the training pipeline, but in Python this will be slow. Reading a chunk from disk seems easier, and with SSD it will be fast enough.Rein Halbersma wrote: ↑Fri Jan 15, 2021 22:00I have gotten distracted with many other interesting machine learning projects (non-draughts) that kind of had to happen right now before anything else. But I will soon start working again on my Keras/Tensorflow pipeline for Scan-like pattern evals.Joost Buijs wrote: ↑Wed Jan 13, 2021 09:11In practice 1.4B positions are difficult to handle (one needs fast amounts of memory for this) so I used only 240M.
One improvement that is possible for any eval training pipeline (NNUE, patterns whatever) using Keras/Tensorflow (or PyTorch for that matter) is to stream the data from disk to the optimizer. That way, you need much less RAM than the entire database. E.g. for Kingsrow, Ed supplied me with a ~10Gb file. Loading that into memory and feeding it to Keras expanded it temporarily to ~48Gb of RAM during optimization. When done from disk in batches, it should be configurable to get it below ~16Gb without much speed loss for *arbitarily* large files on disk.
https://www.tensorflow.org/tutorials/load_data/csv