The Perfect Shuffle Optimization
LLVM and the Perfect Shuffle
A shuffle is a specific type of instruction on many CPU architectures that re-arranges elements in short vectors. These are not permutations since we allow for the mappings to not be injective. Each one of these instructions has a specific latency cost. Determining the shortest max latency chain of microarchitecture specific shuffle for a specific “abstract shuffle” is the “Perfect Shuffle Problem”. Today, there is not a general solution to this problem, in fact, the only place in LLVM where perfect shuffles are attempted are in the ARM backend. In this article, I propose a solution to the perfect shuffle problem.
Representing Shuffles
In this document, we represent shuffles as NxN matrices with the condition that:
- Any row must contain exactly one element set to 1 and all other elements must be 0.
- Columns are allowed to have as many elements that are 1.
So for example the matrix:
\[\begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}\]is a valid shuffle, as is:
\[\begin{bmatrix} 1 & 0 & 0 & 0\\ 0 & 1 & 0 & 0\\ 0 & 0 & 0 & 1\\ 0 & 0 & 0 & 1 \end{bmatrix}\]