PyPy Vectorization: July 2015

Bilbao

I have never attended a programming conference before. Some thoughts and impressions:

The architecture of conference center is impressive.
Python is heavily used in numerical computation, data analysis and processing (I thought it to be less).
Pick any bar and a vegetarian dish (if there is any): It will most certainly contain meat/fish
PyPy is used, but most people are unaware of the fact that there is a JIT compiler for Python, that speeds up computations and reduces memory

It was a good decision to come to the EuroPython, meet with people (especially with the PyPy dev team) and see how things work in the Python community. See you next time :)

I did as well work on my proposal all along. Here are some notes what I have been working on (before Bilbao).

ABC Optimization

One "roadblock" I did not tackle is vectorization of "user code". The vecopt branch at it's current shape was not able to efficiently transform the most basic Python for loop accessing array instances of the array module. (Micro)NumPy kernels work very well (which is the main use case), but for Python loops this is a different story. Obviously, it is not that easy to vectorize these, because it is much more likely that many guards and state modifying operations (other than store) are present.

In the worst case the optimizer just bails out and leaves the trace as it is.
But I think at least for the simplest loops should work as well.

So I evaluated what needs to be done to make this happen: Reduce the number of guards, especially Array Bound Checks (ABC). PyPy does this already, but the ones I would like to remove need a slightly different treatment. Consider:

i = 0
while i < X:
a[i] = b[i] + c[i]
i += 1

There are four guards in the resulting trace, one protecting the index to be below X, and three protecting the array access. You cannot omit them, but you can move them outside the loop. The idea is to introduce guards that make the checks (but the index guard) redundant. Here is an example:

guard(i < X) # (1) protects the index
guard(i < len(a)) # (2) leads ot IndexError
load(..., i, ...) # actual loading instruction

Assume X < len(a). Then (1) implies (2) is redundant and guard(X < len(a)) can be done before the loop is entered. That works well for a well behaved program and in order to pick the right guard as a reference (the index guard might not be the first guard), I'll take a look at the runtime value. The minimum value is preferable, because it is the strongest assumption.

I'm not yet sure if this is the best solution, but it is certainly simple and yields the desired result.

Off by one

Some commits ago the last few "off by one" iterations of a NumPy call were always handled by the blackhole interpreter, eventually compiling a bridge out of the guard. This makes the last iteration unnecessary slow. Now, PyPy has the bare essentials to create trace versions and immediately stitch them to an guarding instructions. The original version of the loop is compiled to machine code at the same time as the optimized version and attached to the guards within the optimized version.

Ultimately a vectorized trace exits an index guard immediately leading into a trace loop to handle the last remaining elements.

Now the first half of the GSoC program 2015 is over and it has been a great time. I compared the time line just recently and I have almost finished all the work that needed to be done for the whole proposal. Here is a list what I have implemented.

The tracing intermediate representation has now operations for SIMD instructions (named vec_XYZ) and vector variables
The proposed algorithm was implemented in the optimization backend of the JIT compiler
Guard strength reduction that handles arithmetic arguments.
Routines to build a dependency graph and reschedule the trace
Extended the backend to emit SSE4.1 SIMD instructions for the new tracing operations
Ran some benchmark programs and evaluated the current gain

I even extended the algorithm to be able handle simple reduction patterns. I did not include this in my proposal. This means that numpy.sum or numpy.prod can be executed with SIMD instructions.

Here is a preview of trace loop speedup the optimization currently achieves.

Note that the setup for all programs is the following: Create two vectors (or one for the last three) (10.000 elements) and execute the operation (e.g. multiply) on the specified datatype. It would look something similar to:

a = np.zeros(10000,dtype='float')
b = np.ones(10000,dtype='float')
np.multiply(a,b,out=a)

After about 1000 iterations of multiplying the tracing JIT records and optimizes the trace. Before jumping to and after exiting the trace the time is recorded. The difference you see in the plot above. Note there is still a problem with any/all and that this is only a micro benchmark. It does not necessarily tell anything about the whole runtime of the program.

For multiply-float64 the theoretical maximum speedup is nearly achieved!

Expectations for the next two months

One thing I'm looking forward to is the Python conference in Bilbao. I have not met my mentors and other developers yet. This will be awesome!
I have also been promised that we will take a look at the optimization so that I can further improve the optimization.

To get even better result I will also need to restructure some parts in the Micro-NumPy library in PyPy.
I think I'm quite close to the end of the implementation (because I started in February already) and I expect that the rest of the GSoC program I will extend, test, polish, restructure and benchmark the optimization.

PyPy Vectorization

Friday, July 24, 2015

GSoC: Bilbao, ABC and off by one

Wednesday, July 1, 2015

GSOC Mid Term