PyPy Vectorization: PyPy unrolling trace loops

Note: Please see my last post "GSoC the first two weeks"

I already outlined the big picture of the final optimization. In many places I have assumed certain conditions and added simplifications to more easily explain the concept. In this and the next posts I will present much more detail about all the different parts that I have added to RPython. Note that I might recapitulate some basics of the terminology to make it better understandable for those who are not familiar with tracing- or method- JIT compilers.

RPython's optimizations

The basic data structure is a list of operations called "trace". A trace has no entry points other then its header instruction. In addition it might only be exited at a special instruction call "guard" instruction. Note that a trace must not necessarily have a backwards jump, but in the following only trace loops are considered. Here is an example of a trace. The red edge shows what is not allowed to happen. Jumps are only valid to the label instruction.

The optimizer takes this list of operations and transforms it to an equivalent trace. Many of the optimizations only pass through the list of operations once and gather information and emit, transform or leave out the current instruction. Optimizations that are done in other compilers are easier to implement because it is a single basic block instead of a region or control flow graph.

There is one optimization (one of the most powerful) that is different though. The unrolling optimization.

Trace loop unrolling

The proposed algorithm (henceforth called VecOpt) needs a way to unroll the loop body several times. So why not reuse this optimization already present in PyPy? I tried to reuse the optimization (and did in a certain way), but the unrolling itself I have duplicated for two reasons:

Unrolling does not only unroll the trace loop, but it gathers information about the traces and applies many more optimizations than just peepling one loop iteration. This includes guard strength reduction, guard implication, loop invariant code motion and more.
The factor is not easily adjustable. It has the sole purpose of peel the loop body just once.

The only requirement VecOpt has: It needs a way to unroll the loop X times and correctly rename the variables. This is implemented in less then 200 lines of code and it is specifically tailored to VecOpt.

Renaming a loop iteration needs a mapping function (dict) of the jump arguments to the label arguments and must instantiate new variables for operations that create variables. In a loop unrolling iteration the mapping function is used to rename the arguments of operations. The idea is also described in the loop unrolling optimization article here (chapter 5).

Benefit

VecOpt benefits from the unrolling optimization. Without it, chances are very very low that a vectorizable loop can be found and transformed to a faster trace loop. There are two reasons for that:

The peeled loop in most cases contains significantly less guards than the original trace
Most of the loop overhead has been removed

Guards increase the dependency edge count in a trace. In many cases they make two instructions dependent even though without the guard they could in theory be executed in parallel.

If loop overhead is not removed, then the gain might not be that significant when using SIMD instructions.

There are many loops that after the unrolling optimization has been applied, only contain less than half of the instructions.

Summary

VecOpt is applied to the peeled loop the unrolling optimization outputs. This smaller loop is then unrolled X times. X is determined by the smallest type that is loaded in the trace loop. Without the unrolling already done by PyPy there would only be little chance to find parallel instructions and group them into vector statements. In a nutshell one could say that: VecOpt unrolls an already unrolled loop.

PyPy Vectorization

Thursday, June 4, 2015

PyPy unrolling trace loops

No comments:

Post a Comment