BertTuyt wrote: ↑Wed Mar 25, 2020 17:29
Also find out if I can improved the speed for the AMD (so replace PEXT).
Scan doesn't use PEXT. Not only it's crippled on AMD, as you just noticed, but it won't scale in a world that will increasingly use ARM (which has bit reverse BTW).
Here is an explanation that I sent to Rein a few years ago; hopefully you can make sense of it (fixed-width font mandatory):
---
It's easier with 4x4 patterns, as no special numbering is needed. It is about copying regions around by means of mask + shift.
This 4x8 region is already in place:
A B - - -
C D - - -
E F - - -
G H - - -
I J - - -
K L - - -
M N - - -
O P - - -
- - - - -
- - - - -
Each letter represents a bit; the rest is masked out. This region contains 3 4x4 patterns: A-H, E-L, and I-P.
Now this overlapping 4x8 one needs to be moved to the north-east (not an exact diagonal):
- - - - -
- - - - -
E F - - -
G H - - -
I J - - -
K L - - -
M N - - -
O P - - -
Q R - - -
S T - - -
The last pattern of the left 4 files appears: M-T.
By combination (OR), we now obtain this:
A B E F -
C D G H -
E F I J -
G H K L -
I J M N -
K L O P -
M N Q R -
O P S T -
- - - - -
- - - - -
The left 4 files are from the first region and the other 4, the second region. Now read the first two lines (consecutive bits). They contain the letters A-H in some permutation, plus a one-bit hole. This corresponds to one 4x4 pattern. The next two lines: E-L, another pattern. Same for I-P and M-T.
In other words, I exploit the regularity of 4x4 patterns.
There is an extra bit on every line; I have two things to say about this:
- it brings some hole in the index; that is acceptable
- we can see that there's not enough room for a third copy
The last point is why I need an extra bit on every rank for 4x6 patterns. With the 13x10 board, the left 4 files are:
A B - - - -
C D - - - - -
E F - - - -
G H - - - - -
I J - - - -
K L - - - - -
M N - - - -
O P - - - - -
Q R - - - -
S T - - - -
Only 3 patterns this time: A-L, E-P (optional), and I-T. If I make two copies with the same general idea I obtain this:
A B E F I J
C D G H K L -
E F I J M N
G H K L O P -
I J M N Q R
K L O P S T -
M N Q R - -
O P S T - - -
Q R - - - -
S T - - - -
Now read the first 3 groups of 2 lines each: A-L, E-P, and I-T. No hole, which is pretty much mandatory for large patterns.
I think this method is more or less isomorphic to magic multiplication with very few set bits. Multiplication seems overkill for only 2 or 3 bits, and I'm not even sure it is faster. Also I choose right shifts, which cannot be accomplished by that method. The idea also works with left shifts and therefore multiplication; I have checked.
Fabien.