I'll try your idea of totally separate hash tables per thread and see how performance goes.
You could also use some spinlocks for mutual exclusion with a single hashtable. A single spinlock would be bad because only one thread could access the hashtable at a time. But if you used say 100 locks, with each lock protecting 1/100th of the table nodes, then the hashtable would usually be available for every thread simultaneously.
-- Ed
Hi Ed,
It just happens that I received this book http://www.amazon.com/C-Concurrency-Act ... d_sxp_f_pt in the mail earlier this week. It's a terrific and really pedagogical introduction to the new C++11 threading standard. Section 6.3.1 contains a worked example of a hash table with a boost::shared_mutex per bucket (where each bucket contains a list of data entries). Each bucket can be concurrently read by an arbitrary number of threads, but only written to by a single thread. Of course, your example of partitioning the table in a 1000 segments would have similar effects.
I'll try your idea of totally separate hash tables per thread and see how performance goes.
You could also use some spinlocks for mutual exclusion with a single hashtable. A single spinlock would be bad because only one thread could access the hashtable at a time. But if you used say 100 locks, with each lock protecting 1/100th of the table nodes, then the hashtable would usually be available for every thread simultaneously.
-- Ed
I've tried spinlocks using _InterlockedCompareExchange128() x64 API using MSVC 2010 and it completely solves the problems and only costs about 14% of extra time. I've discarded the lock-less code now, which is no good for perft.
rein wrote:I think having a single hash table per core is the only way to guarantee correctness (barring memory errors of course).
murraycash wrote:I'll try your idea of totally separate hash tables per thread and see how performance goes.
No surprises that using totally separate hash tables per thread vs shared hash table, was magnitudes slower in performance, so with the interlocking read/write solution which guarantees correctness, I am about 14% slower than the original lock-less solution, using 4 threads.
Rein Halbersma wrote:I'm not sure if this old xor-trick (i.e. not using std::atomic_fetch_xor) is guaranteed to be portable and race-free by the C++11 Standard, or that it "only" happens to work on x86/x86-64 architectures or with a small number of threads.
Rein, I stand corrected. I agree the lock-less implementation does increase the chance of perft count errors and that for a given hash table size, increasing the number of threads accessing the hash table will further increase the chances of race conditions that could cause a random hash key (key ^ data1) ^ data2 when decoded, to be a false match.
I'll try your idea of totally separate hash tables per thread and see how performance goes.