There are some technical obstacles because a Xeon Phi is a co-processor. On the cilkplus.org website there is the following warning about this:Rein Halbersma wrote: Actually, I have a strong feeling that parallel search *will* scale to 50+ cores, but it requires advanced programming environments such as used by Cilk-chess. One problem that Cilk (currently available as a special branch in gcc maintained by Intel) solves compared to most existing YBW implementations (such as present in Stockfish) is the split-point handling and load-balancing over the various cores. The tree itself contains enough parallelism as search depth increases, it's "only" a matter of keeping all the processors busy. The Cilk-scheduler does so in a provably optimal way (by randomized work-stealing rather than deterministic work-pushing in almost every hand-made implementations of YBW).
However, you need to be aware that Cilk Plus assumes that your application is executing in a single, unified address space. Tasks that are offloaded to the Xeon Phi coprocessor are executing in a different address space than the application running on the host processor, and the thread pools on the host processor and the Xeon Phi coprocessor are totally separate. Work cannot be stolen between the host processor and the Xeon Phi coprocessor.
The reason I am not convinced that the parallel search performs well on 50+ cores are all the test results that Bert has published so far in this thread. No doubt with 50+ cores more nodes per second can be achieved. To make good use of the extra nodes you might search deeper or you might prune less to gain ELO. Or you do not increase the number of nodes, but instead you spend more evaluation time per node. Finding a single (hardcoded) optimal configuration for the search and evaluation seems to be a daunting task. The optimal configuration might also depend on game type and game phase. Hence, my interest in the idea to use multiple configurations in parallel and then arbitrate (no idea how to do this cleverly ).