Monday, August 24, 2015

The End of Summer, PyPy ♥ SIMD


The Summer of Code is approaching its end and it has been an amazing experience for me. Not only this, but I also approach the end of my masters thesis which for me was on of the main goals for the last five years. The sad story is that I most likely are not able to participate as a student in the future.

It has been a really great time for me, and for anyone feeling unsure of applying or not, I can warmly recommend to try.

PyPy's vectorizing optimizer

So what is the outcome of my GSoC? Let's quickly have a look on what I proposed:

The goal of this project is to enhance the trace optimizer of PyPy. By definition NumPy arrays are unboxed, homogeneous (one data type) and continuous in memory (despite certain exceptions). The same is true for the array module in the standard library. The new optimizer will use this opportunity and exploit a SIMD instruction set (such as SSE,AVX) present on modern CISC processors (e.g. x86). This should lead to enhanced execution speed for arithmetic intensive applications that run on top of PyPy.

I have already showed that individual traces get faster when the optimization is turned on. But that does not necessarily mean programs get faster. The optimization only helps if your program spends a significant fraction of time in the trace loop that is vectorizable.

The following shows the basic NumPy operations stressed. In a loop the NumPy operations are executed 1000 times. This sample program shows the basic setup of multiply-float64:

def bench (vector_a, vector_b):
    for i in range(1000):
        numpy.multiply(vector_a, vector_b, out=vector_a)


The speedup (bigger is better) show the theoretical maximum speedup (this is bounded because of the way SIMD works) and what is achieved by the optimization. The base line is the portable PyPy version 2.6.0 (none of the code I changed is included in this version). The version of Vector speedup is a026d96015e4.

Considering that currently aligned memory load cannot be used I think this is a pretty good result. For float64 multiply the maximum speedup is nearly reached. float32 performs very poorly because the JIT currently does not operate on float32, but always casts to float64, executes the arithmetic and casts back to float32.

Let's run some real programs. SOM (Self Organizing Maps) is an algorithm to map N dimensions onto a 2d grid. It contains a lot of vector distances and vector operations. Dot is the matrix dot product and wdist is the weighted euclidean distances. The number in the benchmark name indicates the size of the vector register.




As you can see, bigger vectors are better for the optimization. The cool thing is that PyPy is now able connect trace trees that use SIMD operations which is not possible if the numerical kernels are written in e.g. C. Again I think the results are pretty cool and speedup should get even more crazy if AVX is implemented in the JIT backend.

Non NumPy traces

At the beginning it was a bit unclear if this is possible, but let me show you a very basic example of a Python program where the optimization creates a very good piece of assembler.

# arrayloop.py
import array
w = array.array('d',[0]*10000) 
l = array.array('d',[1]*10000)
def f(): 
    i = 0 
    while i < 10000: 
        l[i] = l[i] + w[i] 
        i += 1

for i in range(100000): 
    f()

$ time ./pypy-c --jit vec_all=1 ~/arrayloop.py 
0.67user 0.01system 0:00.69elapsed 99%CPU
$ time ./pypy-c --jit vec_all=0 ~/arrayloop.py 
1.65user 0.01system 0:01.67elapsed 99%CPU

Boom. 1.65 / 0.67 = ~2.45 times faster. This is not super scalar because of some caching issues, but because it does less guard checking (all array bound checks are only checked once).

If you happen to be into PyPy, you know that the list strategy (e.g. [42.0] * 44) will not store Python objects, but store double floating points directly in memory. When the first non floating point value is stored into the array, it is transformed into a list of Python objects. Perfect! This makes it even possible to let the optimization run on loops that manipulate lists.

Unfortunately this is currently not possible, because there are some fragments in the trace that the optimizer cannot transform. We are working on it and the next major PyPy update might contain this as well!

Stats

It spans over at least seven files and needs about > 3000 lines of code. The test suite covers roughly > 4000 lines in five files. This is the newly added code in grand total. It excludes every line that I have changed in the existing source base.

Does this only work for PyPy?

If I have put the same effort into a different virtual machine, it would have been tied to the language and the virtual machine only.

That would be lame right?

But (you might be aware of this) any language interpreter written in RPython can reuse the tracing JIT compiler. That means that ``magically'' this optimization is added to the final executable by default.

At this point it should be said that programs will not automatically get faster, there are some rules your traces must obey. But this is easy. Much more easier than writing assembler or using compiler intrinsics.

This is great because if you want to write an array processing language in RPython, you gain all the features it already provides and get to use SSE4.1 (for now) to speed up your programs

Final words

I plan to further work on PyPy and I think soon this project will be finally merged into default. We have talked about it to include it in the upcomming 2.6.1 release, but there are other greate changes comming to enhance the trace compiler it was postponed.

Thank you PyPy Team, thank you Google & PSF! I had a great time tinkering on my proposal.

Friday, August 7, 2015

GSoC: Vec, the little brother of NumPy array

Back from Bilbao I have been busy testing and hunting bugs. This took me quite some time to resolve all issues, because before that I did not run all tests regularly.

I have been working since on vectorizing "user code". The primary goal of this project is to speed up trace loops that iterate over NumPy arrays. But As I have said in the last post it might (or might not) make sense to optimize traces found in the user program.

Vec, the little brother of NumPy array


Using the following snippet one can build a class for vectors in Python and let the optimization speed up the computation. 

import array
class Vec(object):
    # ...
    def __init__(self, content, type='d'):
        self.size = len(content)
        self.type = type
        self.array = array.array(type, content)

    def __add__(self, other):
        return self.add(other)

    def add(self, other, out=None):

        # ...
        # Ensure that other is the right type and size,
        # out must be allocated if it is None
        # ...
        i = 0

        #
        # execute pypy with --jit vectorize_user=1 to
        # enable the optimization
        #

        while i < self.size:
            out.array[i] = self.array[i] + other.array[i]
            i += 1

    # ...

After tracing the loop in the add function a slightly better vector loop is generated. Let's run the program: 

# jit warmup
a,b,c = vec(...), vec(...), vec(...)
# start time.time()
for i in range(500):
   c = a * b
   a = c - b
# stop time.time()

Has the following results can be optained (after the JIT has warmed up):
PyPy (vecopt):    ~0.005
PyPy (no vecopt): ~0.008
CPython:          ~0.040

The PyPy has higer variance in the execution. The garbage collector might be the reason for that. The program has been run 5 times and the mean value is shown above.

What about Python lists?


Honestly, I'm unsure if there is a real benefit. Since PyPy stores integers/floats arrays (that are fully homogenous) without the overhead of embedding it in a PyObject, SIMD operations could be used for normal Python lists.

The problem with this optimization is that the program must run for a very long time and spend a significant fraction of time in the trace loop that has been optimized. The short evaluation above shows that there might be potential. I will further investigate, because this is a great way to find bugs in the implementation as well.

Traces emitted by the user program are much more complex than the one in the NumPy library. The last week I have been working I found many edge cases and even reminded my that I have left some TODOs in the source code.