Breakthrough
I have measured speedup on my sample interpreter already, but not in the NumPy library. I have tested and hardened the edge cases and it is now possible to measure speedup using the NumPy library.
Micro benchmark
a = np.arange(1000.0)
b = np.arange(1000.0)
for i in range(10000):
a = a + b
Invoking this program one can measure as speedup of ~1.33 faster program execution.
Well, that is not quite the theoretical maximum of 2.00 (SSE4)
I have then spent time to analyze the behavior using several profiling utilities. The included Python profiler did not do the job, because it is unaware of the underlying JIT. Thus I used the brand new vmprof and gprof.
Sidenote: I used gprof only to verify, but if a statistical profiler is enough for your python program, go for vmprof! The overhead is minimal and it is possible to get live profiling feedback of your application! In combination with the jitviewer you can find out where your time is spent.
It helped me a lot and the above loop spends about half of the time copying memory. So if the loop body is exchanged with ufunc.add(a, b, out=a) speedup increases up to 1.70-1.80.
That is better, but where is the rest of the time spent? Sadly the profiling says in the loop around the NumPy call. One of my mentors has suggested that there might be possibilities to improve the register allocation. And I'm currently evaluating a way to exchange and add some heuristics to improve the allocator.
The loop itself is a magnitude faster than the scalar loop. So I'm quite happy that my idea really worked out.
Accumulation
That is another big thing that I have been working on. I did not suggest this improvement in my GSoC proposal. Still I want to include it.
Frequently used functions in scientific computing are sum, prod, any, all, max, min, ...
Some of them consider the whole array, some of them bail out if an element has been found. There is potential to use SIMD instructions for these operations.
Let's consider sum(...). The addition is commutative.
x+y = y+x f.a. R
Thus I have added a temporary vector register for summation, the accumulator. Instead of resolving the dependency using a horizontal add (supported by x86 SSE4) the loop partially sums the array. At every guard exit the accumulator is then horizontally added. Again the theoretical speedup is a factor 2 when using float64 on SSE4.
I have not yet managed to compile a version that fully works on sum, but I'm quite close to it. Other functions like all or any are more complex. It is not so easy to recognize the reduction pattern if more than one operation is involved. I will add a pattern matcher for those instructions. Let's have a look at the following example (for all):
d = int_and(a,b)
guard_true(d)
e = int_and(c,d)
guard_true(e)
And output the following vector statements (excluding guard compensation code)
v = vec_int_and([a,c], accum)
guard_true(v)
I did not expect...
I have evaluated the possibility to vectorize arbitrary PyPy traces using the array module. This does not work for PyPy traces. It works in my test toy language (located here). Let's take a look at the following user program:
while i < 100:
a[i] = b[i] + c[i] * 3.14
i += 1
a,b and c are array objects of the Python array module. Their elements are homogeneous and adjacent in memory. The resulting trace could be transformed into a vectorized form.
The current two limitations make it impossible to vectorize the user program: 1) Python checks array boundaries (also negative) for each load/store operation. This adds an additional guard to the trace.
2) The boxing of integer values. The index variable will be recreated at the end of the loop and incremented. This includes several non pure operations in the trace including memory allocation of an integer box every iteration.
I do not yet know how I will come around these problems, but for the second limitation I'm quite sure that the creation of the integer box can be moved to the guard exit.