The big picture
The benefit of executing vector statements is hardware support. SSE/AVX/NEON are able to execute vector statements nearly as efficient as doing the computation on only one element in the vector. This speeds up numerical applications.
However, if you application only spends a fraction of its time executing vector statements the gain in speed will be insignificant.
Let's start
The idea is based on work done in these two sources: Exploiting superword level parallelism with multimedia instruction sets and Compiler optimizations for processors with SIMD instructions (see full references at the end).
To understand the following paragraphs the notion of a dependency graph might help. A dependency graph has nodes for each instruction of a basic block or trace instructions. If instruction A at position X depends on instruction B at position Y (Y < X), then B must be executed before A.
Using the graph it is possible verify if two instructions can be executed at the same time (they are not dependent).
Don't despair! The description below contains a lot of domain specific terminology. It is not easy to describe the optimization technique without those and if you are not into programming compilers/VMs you might have a hard time understanding it.
Single Instruction Multiple Data (SIMD)
As the name of SIMD suggests, such an instruction applies an operation to multiple data elements. SSE or AVX would be examples of instruction extensions that allow e.g. arithmetic to be executed on multiple data elements. The size in bytes of a vector register is usually 16-64 depending on the instruction set available on the x86 CPU.
The classical approach
One of the best sources for data elements are (you guessed it) arrays. Loops that iterate over arrays are the target of vectorizing algorithms in compiler backends. Let's consider an example:
a,b,c = ... # arrays of 32 bit floats
for i in range(30):
a[i] = b[i] + c[i]
This is one of the most basic examples to show the power of vectorization. If a vector register would hold up to 30 elements of a 32 bit float the loop could be replaced with a single vector statement.
a(0:30) = b(0:30) + c(0:30) # Fortran notation
If the hardware is able to execute 30 floating point operations at the same time the loop is finished 30 times faster.
Having that said, this is the theoretical optimal speedup. In reality things are not that simple, but these potentially will speedup for your numerical application.
Preserving semantics
How does an optimizer transform such a loop? This is where dependencies between statements of the loop and statements of an iteration come into play. Lets adapt the loop from the previous example:
for i in range(1,29):
a[i] = b[i-1] + c[i] # S_1
b[i] = a[i] * 2 # S_2
Lets construct the dependencies on a statement level and across iterations.
The solid dependency edge (denoted with delta) is a true dependency. a[i] is written in S_1 and used in the rhs expression of S_2. The dashed line is a loop carried dependency. At iteration i > 1, the value of b[i-1] is read of the previous iteration. Each step has a previous step that needs to be completed first, making this example impossible to vectorize (If the loop is unrolled once, it is easier to see that S_2 depends on S_1 across iterations). Let's relax the previous example:
for i in range(1,29):
a[i] = b[i-1] + c[i] # S_1
b[i] = c[i] * 2 # S_2
Now S_2 does not depend on S_1, thus it is possible to swap the instructions in the loop body and pre compute all values of b[1:29] using vector instruction. Yielding the result:
b(1:29) = c(1:29) * 2 # S_2
a(1:29) = b(0:28) + c(0:29) # S_1
Vectorization on loop basis needs a cyclic dependency graph and a way to find cycles. Graph algorithms that operate on cyclic graphs often have worse complexity than acyclic graphs or trees.
A new approach to vectorize basic blocks
The algorithm chosen and already (partly) implemented in the PyPy JIT backend must be able to operate on a trace. A trace represents a sequence of instructions that where executed in the user program quite frequently. Let's have a look at a trace that could have been generated from the first loop snippet (unrolled once):
# loop
for i in range(30):
a[i] = b[i] + c[i]
# trace:
label(a,b,c,i)
j = i + 1
guard_true(j <= 30)
ai = load(a, i) # S_3
bi = load(b, i) # S_4
ci = ai + bi # S_5
store(c, i, ci) # S_6
k = j + 1
guard_true(k <= 30)
aj = load(a, j) # S_9
bj = load(b, j) # S_10
cj = aj + bj # S_11
store(c, j, cj) # S_12
jump(a,b,c,k)
The trick to find vectorizable statements is very simple: We know that the indices i and j are used to load and store from arrays. i = j + 1, thus statement S_3 and S_9 access two elements that are adjacent in the array a.
We record them as a pair (S_3,S_9). The same is true for (S_4,S_10) and (S_6,S_12). Now the last missing tile in the puzzle is the addition instructions S_5 and S_11. Lets consider the dependency graph (slightly more complicated):
The sky blue lines from the load operations provide a lead to the two addition operations. The algorithm adds new pairs by following the definition-use and use-definition edges in the dependency graph. Thus yielding the pair (S_5, S_11). Having all pairs the algorithm tries to extend pairs that are adjacent to each other. In this example above, there is nothing that can be merged, but if the loop is unrolled once more, there will be pairs that can be merged. If a pair is merged it is called a pack. As an example if there are two pairs (a,b), (b,c) they are merged into the pack (a,b,c).
Operations that are independent and adjacent in memory have been merged into packs. The last step is to schedule the instructions using the acyclic dependency graph (see picture above). Scheduling tries to emit grouped operations. For the pair (S_3,S_9) it will schedule all other nodes until both have no preceding dependency and emits an instruction v_ai = vec_load(a, i, 2) operation instead of two separate load instructions. The resulting trace looks similar to:
label(a,b,c,i)
j = i + 1
guard_true(j <= 30)
k = j + 1
guard_true(k <= 30)
v_ai = vec_load(a, i, 2)
v_bi = vec_load(b, i, 2)
v_ci = vec_add(v_ai,v_bi,2)
vec_store(c, i, ci)
jump(a,b,c,k)
If we are able to get rid of the redundant index calculation and the weaker guard(j <= 30), then the algorithm managed to let the loop finish twice as fast. This will be subject on the next post.
Implementation progress
Things are partly working already and some problems have already been solved. The PyPy backend in my vecopt branch can build dependency graphs of traces, is able to unroll them and apply the algorithm on that unrolled trace. So it is able to group instructions and emits vector operations. In addition I adapted the x86 backend to emit SSE2 vector instructions and tested it both in the test suite and build a sample interpreter that suddenly was able to boost the loop of 32-bit integer arithmetic 4x faster (wow!).
Still there is a lot todo this summer! I started to test more complicated cases of NumPy traces and currently work on constant and variable expansion.
Summary
In this post I outlined how an ahead of time compiler vectorizes loops. This is not feasible because of the fact that traces do not carry explicit loop information in the loop header. In addition the dependency graph is cyclic and cycle checking is needed. It is also hard to apply loop distribution and loop fission on a trace (it is not easy to reconstruct resume state from a guard in a custom tailored trace).
Basic blocks also suitable to vectorize statements. Even if the loop is not unrolled, if there is parallelism on statement level, the algorithm will find it. In fact the only few tricks are needed to reschedule traces in a vectorized form: unroll the loop, find pairs, extend them to packs and reschedule them.
References
Larsen, Samuel, and Saman Amarasinghe. Exploiting superword level parallelism with multimedia instruction sets. Vol. 35. No. 5. ACM, 2000.
Pryanishnikov, Ivan, Andreas Krall, and Nigel Horspool. "Compiler optimizations for processors with SIMD instructions." Software: Practice and Experience 37.1 (2007): 93-113.