PyPy Vectorization

Monday, August 24, 2015

The End of Summer, PyPy ♥ SIMD

The Summer of Code is approaching its end and it has been an amazing experience for me. Not only this, but I also approach the end of my masters thesis which for me was on of the main goals for the last five years. The sad story is that I most likely are not able to participate as a student in the future.

It has been a really great time for me, and for anyone feeling unsure of applying or not, I can warmly recommend to try.

PyPy's vectorizing optimizer

So what is the outcome of my GSoC? Let's quickly have a look on what I proposed:

The goal of this project is to enhance the trace optimizer of PyPy. By definition NumPy arrays are unboxed, homogeneous (one data type) and continuous in memory (despite certain exceptions). The same is true for the array module in the standard library. The new optimizer will use this opportunity and exploit a SIMD instruction set (such as SSE,AVX) present on modern CISC processors (e.g. x86). This should lead to enhanced execution speed for arithmetic intensive applications that run on top of PyPy.

I have already showed that individual traces get faster when the optimization is turned on. But that does not necessarily mean programs get faster. The optimization only helps if your program spends a significant fraction of time in the trace loop that is vectorizable.

The following shows the basic NumPy operations stressed. In a loop the NumPy operations are executed 1000 times. This sample program shows the basic setup of multiply-float64:

def bench (vector_a, vector_b):
for i in range(1000):
numpy.multiply(vector_a, vector_b, out=vector_a)

The speedup (bigger is better) show the theoretical maximum speedup (this is bounded because of the way SIMD works) and what is achieved by the optimization. The base line is the portable PyPy version 2.6.0 (none of the code I changed is included in this version). The version of Vector speedup is a026d96015e4.

Considering that currently aligned memory load cannot be used I think this is a pretty good result. For float64 multiply the maximum speedup is nearly reached. float32 performs very poorly because the JIT currently does not operate on float32, but always casts to float64, executes the arithmetic and casts back to float32.

Let's run some real programs. SOM (Self Organizing Maps) is an algorithm to map N dimensions onto a 2d grid. It contains a lot of vector distances and vector operations. Dot is the matrix dot product and wdist is the weighted euclidean distances. The number in the benchmark name indicates the size of the vector register.

As you can see, bigger vectors are better for the optimization. The cool thing is that PyPy is now able connect trace trees that use SIMD operations which is not possible if the numerical kernels are written in e.g. C. Again I think the results are pretty cool and speedup should get even more crazy if AVX is implemented in the JIT backend.

Non NumPy traces

At the beginning it was a bit unclear if this is possible, but let me show you a very basic example of a Python program where the optimization creates a very good piece of assembler.

# arrayloop.py
import array
w = array.array('d',[0]*10000)
l = array.array('d',[1]*10000)
def f():
    i = 0
    while i < 10000:
        l[i] = l[i] + w[i]
        i += 1

for i in range(100000):
    f()

$ time ./pypy-c --jit vec_all=1 ~/arrayloop.py
0.67user 0.01system 0:00.69elapsed 99%CPU
$ time ./pypy-c --jit vec_all=0 ~/arrayloop.py
1.65user 0.01system 0:01.67elapsed 99%CPU

Boom. 1.65 / 0.67 = ~2.45 times faster. This is not super scalar because of some caching issues, but because it does less guard checking (all array bound checks are only checked once).

If you happen to be into PyPy, you know that the list strategy (e.g. [42.0] * 44) will not store Python objects, but store double floating points directly in memory. When the first non floating point value is stored into the array, it is transformed into a list of Python objects. Perfect! This makes it even possible to let the optimization run on loops that manipulate lists.

Unfortunately this is currently not possible, because there are some fragments in the trace that the optimizer cannot transform. We are working on it and the next major PyPy update might contain this as well!

Stats

It spans over at least seven files and needs about > 3000 lines of code. The test suite covers roughly > 4000 lines in five files. This is the newly added code in grand total. It excludes every line that I have changed in the existing source base.

Does this only work for PyPy?

If I have put the same effort into a different virtual machine, it would have been tied to the language and the virtual machine only.

That would be lame right?

But (you might be aware of this) any language interpreter written in RPython can reuse the tracing JIT compiler. That means that ``magically'' this optimization is added to the final executable by default.

At this point it should be said that programs will not automatically get faster, there are some rules your traces must obey. But this is easy. Much more easier than writing assembler or using compiler intrinsics.

This is great because if you want to write an array processing language in RPython, you gain all the features it already provides and get to use SSE4.1 (for now) to speed up your programs

Final words

I plan to further work on PyPy and I think soon this project will be finally merged into default. We have talked about it to include it in the upcomming 2.6.1 release, but there are other greate changes comming to enhance the trace compiler it was postponed.

Thank you PyPy Team, thank you Google & PSF! I had a great time tinkering on my proposal.

Friday, August 7, 2015

GSoC: Vec, the little brother of NumPy array

Back from Bilbao I have been busy testing and hunting bugs. This took me quite some time to resolve all issues, because before that I did not run all tests regularly.

I have been working since on vectorizing "user code". The primary goal of this project is to speed up trace loops that iterate over NumPy arrays. But As I have said in the last post it might (or might not) make sense to optimize traces found in the user program.

Vec, the little brother of NumPy array

Using the following snippet one can build a class for vectors in Python and let the optimization speed up the computation.

import array
class Vec(object):
    # ...
    def __init__(self, content, type='d'):
        self.size = len(content)
        self.type = type
        self.array = array.array(type, content)

    def __add__(self, other):
        return self.add(other)

    def add(self, other, out=None):
        # ...
# Ensure that other is the right type and size,
# out must be allocated if it is None
        # ...
        i = 0
        #
        # execute pypy with --jit vectorize_user=1 to
        # enable the optimization
#
        while i < self.size:
            out.array[i] = self.array[i] + other.array[i]
            i += 1
    # ...

After tracing the loop in the add function a slightly better vector loop is generated. Let's run the program:

# jit warmup
a,b,c = vec(...), vec(...), vec(...)
# start time.time()
for i in range(500):
   c = a * b
   a = c - b
# stop time.time()

Has the following results can be optained (after the JIT has warmed up):
PyPy (vecopt):    ~0.005
PyPy (no vecopt): ~0.008
CPython:          ~0.040

The PyPy has higer variance in the execution. The garbage collector might be the reason for that. The program has been run 5 times and the mean value is shown above.

What about Python lists?

Honestly, I'm unsure if there is a real benefit. Since PyPy stores integers/floats arrays (that are fully homogenous) without the overhead of embedding it in a PyObject, SIMD operations could be used for normal Python lists.

The problem with this optimization is that the program must run for a very long time and spend a significant fraction of time in the trace loop that has been optimized. The short evaluation above shows that there might be potential. I will further investigate, because this is a great way to find bugs in the implementation as well.

Traces emitted by the user program are much more complex than the one in the NumPy library. The last week I have been working I found many edge cases and even reminded my that I have left some TODOs in the source code.

Friday, July 24, 2015

GSoC: Bilbao, ABC and off by one

Bilbao

I have never attended a programming conference before. Some thoughts and impressions:

The architecture of conference center is impressive.
Python is heavily used in numerical computation, data analysis and processing (I thought it to be less).
Pick any bar and a vegetarian dish (if there is any): It will most certainly contain meat/fish
PyPy is used, but most people are unaware of the fact that there is a JIT compiler for Python, that speeds up computations and reduces memory

It was a good decision to come to the EuroPython, meet with people (especially with the PyPy dev team) and see how things work in the Python community. See you next time :)

I did as well work on my proposal all along. Here are some notes what I have been working on (before Bilbao).

ABC Optimization

One "roadblock" I did not tackle is vectorization of "user code". The vecopt branch at it's current shape was not able to efficiently transform the most basic Python for loop accessing array instances of the array module. (Micro)NumPy kernels work very well (which is the main use case), but for Python loops this is a different story. Obviously, it is not that easy to vectorize these, because it is much more likely that many guards and state modifying operations (other than store) are present.

In the worst case the optimizer just bails out and leaves the trace as it is.
But I think at least for the simplest loops should work as well.

So I evaluated what needs to be done to make this happen: Reduce the number of guards, especially Array Bound Checks (ABC). PyPy does this already, but the ones I would like to remove need a slightly different treatment. Consider:

i = 0
while i < X:
a[i] = b[i] + c[i]
i += 1

There are four guards in the resulting trace, one protecting the index to be below X, and three protecting the array access. You cannot omit them, but you can move them outside the loop. The idea is to introduce guards that make the checks (but the index guard) redundant. Here is an example:

guard(i < X) # (1) protects the index
guard(i < len(a)) # (2) leads ot IndexError
load(..., i, ...) # actual loading instruction

Assume X < len(a). Then (1) implies (2) is redundant and guard(X < len(a)) can be done before the loop is entered. That works well for a well behaved program and in order to pick the right guard as a reference (the index guard might not be the first guard), I'll take a look at the runtime value. The minimum value is preferable, because it is the strongest assumption.

I'm not yet sure if this is the best solution, but it is certainly simple and yields the desired result.

Off by one

Some commits ago the last few "off by one" iterations of a NumPy call were always handled by the blackhole interpreter, eventually compiling a bridge out of the guard. This makes the last iteration unnecessary slow. Now, PyPy has the bare essentials to create trace versions and immediately stitch them to an guarding instructions. The original version of the loop is compiled to machine code at the same time as the optimized version and attached to the guards within the optimized version.

Ultimately a vectorized trace exits an index guard immediately leading into a trace loop to handle the last remaining elements.

Wednesday, July 1, 2015

GSOC Mid Term

Now the first half of the GSoC program 2015 is over and it has been a great time. I compared the time line just recently and I have almost finished all the work that needed to be done for the whole proposal. Here is a list what I have implemented.

The tracing intermediate representation has now operations for SIMD instructions (named vec_XYZ) and vector variables
The proposed algorithm was implemented in the optimization backend of the JIT compiler
Guard strength reduction that handles arithmetic arguments.
Routines to build a dependency graph and reschedule the trace
Extended the backend to emit SSE4.1 SIMD instructions for the new tracing operations
Ran some benchmark programs and evaluated the current gain

I even extended the algorithm to be able handle simple reduction patterns. I did not include this in my proposal. This means that numpy.sum or numpy.prod can be executed with SIMD instructions.

Here is a preview of trace loop speedup the optimization currently achieves.

Note that the setup for all programs is the following: Create two vectors (or one for the last three) (10.000 elements) and execute the operation (e.g. multiply) on the specified datatype. It would look something similar to:

a = np.zeros(10000,dtype='float')
b = np.ones(10000,dtype='float')
np.multiply(a,b,out=a)

After about 1000 iterations of multiplying the tracing JIT records and optimizes the trace. Before jumping to and after exiting the trace the time is recorded. The difference you see in the plot above. Note there is still a problem with any/all and that this is only a micro benchmark. It does not necessarily tell anything about the whole runtime of the program.

For multiply-float64 the theoretical maximum speedup is nearly achieved!

Expectations for the next two months

One thing I'm looking forward to is the Python conference in Bilbao. I have not met my mentors and other developers yet. This will be awesome!
I have also been promised that we will take a look at the optimization so that I can further improve the optimization.

To get even better result I will also need to restructure some parts in the Micro-NumPy library in PyPy.
I think I'm quite close to the end of the implementation (because I started in February already) and I expect that the rest of the GSoC program I will extend, test, polish, restructure and benchmark the optimization.

Thursday, June 18, 2015

It is ... alive!!!

I have been quite busy the last weeks improving my solution. Most of the time I have dedicated to accumulation of values. But first I have to tell you about the ...

Breakthrough

I have measured speedup on my sample interpreter already, but not in the NumPy library. I have tested and hardened the edge cases and it is now possible to measure speedup using the NumPy library.

Micro benchmark

a = np.arange(1000.0)
b = np.arange(1000.0)
for i in range(10000):
a = a + b

Invoking this program one can measure as speedup of ~1.33 faster program execution.

Well, that is not quite the theoretical maximum of 2.00 (SSE4)

I have then spent time to analyze the behavior using several profiling utilities. The included Python profiler did not do the job, because it is unaware of the underlying JIT. Thus I used the brand new vmprof and gprof.

Sidenote: I used gprof only to verify, but if a statistical profiler is enough for your python program, go for vmprof! The overhead is minimal and it is possible to get live profiling feedback of your application! In combination with the jitviewer you can find out where your time is spent.

It helped me a lot and the above loop spends about half of the time copying memory. So if the loop body is exchanged with ufunc.add(a, b, out=a) speedup increases up to 1.70-1.80.

That is better, but where is the rest of the time spent? Sadly the profiling says in the loop around the NumPy call. One of my mentors has suggested that there might be possibilities to improve the register allocation. And I'm currently evaluating a way to exchange and add some heuristics to improve the allocator.

The loop itself is a magnitude faster than the scalar loop. So I'm quite happy that my idea really worked out.

Accumulation

That is another big thing that I have been working on. I did not suggest this improvement in my GSoC proposal. Still I want to include it.
Frequently used functions in scientific computing are sum, prod, any, all, max, min, ...

Some of them consider the whole array, some of them bail out if an element has been found. There is potential to use SIMD instructions for these operations.

Let's consider sum(...). The addition is commutative.

x+y = y+x f.a. R

Thus I have added a temporary vector register for summation, the accumulator. Instead of resolving the dependency using a horizontal add (supported by x86 SSE4) the loop partially sums the array. At every guard exit the accumulator is then horizontally added. Again the theoretical speedup is a factor 2 when using float64 on SSE4.

I have not yet managed to compile a version that fully works on sum, but I'm quite close to it. Other functions like all or any are more complex. It is not so easy to recognize the reduction pattern if more than one operation is involved. I will add a pattern matcher for those instructions. Let's have a look at the following example (for all):

d = int_and(a,b)
guard_true(d)
e = int_and(c,d)
guard_true(e)

And output the following vector statements (excluding guard compensation code)

v = vec_int_and([a,c], accum)
guard_true(v)

I did not expect...

I have evaluated the possibility to vectorize arbitrary PyPy traces using the array module. This does not work for PyPy traces. It works in my test toy language (located here). Let's take a look at the following user program:

while i < 100:
a[i] = b[i] + c[i] * 3.14
i += 1

a,b and c are array objects of the Python array module. Their elements are homogeneous and adjacent in memory. The resulting trace could be transformed into a vectorized form.

The current two limitations make it impossible to vectorize the user program: 1) Python checks array boundaries (also negative) for each load/store operation. This adds an additional guard to the trace.
2) The boxing of integer values. The index variable will be recreated at the end of the loop and incremented. This includes several non pure operations in the trace including memory allocation of an integer box every iteration.

I do not yet know how I will come around these problems, but for the second limitation I'm quite sure that the creation of the integer box can be moved to the guard exit.

Thursday, June 4, 2015

PyPy unrolling trace loops

Note: Please see my last post "GSoC the first two weeks"

I already outlined the big picture of the final optimization. In many places I have assumed certain conditions and added simplifications to more easily explain the concept. In this and the next posts I will present much more detail about all the different parts that I have added to RPython. Note that I might recapitulate some basics of the terminology to make it better understandable for those who are not familiar with tracing- or method- JIT compilers.

RPython's optimizations

The basic data structure is a list of operations called "trace". A trace has no entry points other then its header instruction. In addition it might only be exited at a special instruction call "guard" instruction. Note that a trace must not necessarily have a backwards jump, but in the following only trace loops are considered. Here is an example of a trace. The red edge shows what is not allowed to happen. Jumps are only valid to the label instruction.

The optimizer takes this list of operations and transforms it to an equivalent trace. Many of the optimizations only pass through the list of operations once and gather information and emit, transform or leave out the current instruction. Optimizations that are done in other compilers are easier to implement because it is a single basic block instead of a region or control flow graph.

There is one optimization (one of the most powerful) that is different though. The unrolling optimization.

Trace loop unrolling

The proposed algorithm (henceforth called VecOpt) needs a way to unroll the loop body several times. So why not reuse this optimization already present in PyPy? I tried to reuse the optimization (and did in a certain way), but the unrolling itself I have duplicated for two reasons:

Unrolling does not only unroll the trace loop, but it gathers information about the traces and applies many more optimizations than just peepling one loop iteration. This includes guard strength reduction, guard implication, loop invariant code motion and more.
The factor is not easily adjustable. It has the sole purpose of peel the loop body just once.

The only requirement VecOpt has: It needs a way to unroll the loop X times and correctly rename the variables. This is implemented in less then 200 lines of code and it is specifically tailored to VecOpt.

Renaming a loop iteration needs a mapping function (dict) of the jump arguments to the label arguments and must instantiate new variables for operations that create variables. In a loop unrolling iteration the mapping function is used to rename the arguments of operations. The idea is also described in the loop unrolling optimization article here (chapter 5).

Benefit

VecOpt benefits from the unrolling optimization. Without it, chances are very very low that a vectorizable loop can be found and transformed to a faster trace loop. There are two reasons for that:

The peeled loop in most cases contains significantly less guards than the original trace
Most of the loop overhead has been removed

Guards increase the dependency edge count in a trace. In many cases they make two instructions dependent even though without the guard they could in theory be executed in parallel.

If loop overhead is not removed, then the gain might not be that significant when using SIMD instructions.

There are many loops that after the unrolling optimization has been applied, only contain less than half of the instructions.

Summary

VecOpt is applied to the peeled loop the unrolling optimization outputs. This smaller loop is then unrolled X times. X is determined by the smallest type that is loaded in the trace loop. Without the unrolling already done by PyPy there would only be little chance to find parallel instructions and group them into vector statements. In a nutshell one could say that: VecOpt unrolls an already unrolled loop.

GSoC the first two weeks

The community bonding phase ended almost two weeks from now and I have been very busy working on my proposal.

My situation might be quite different from other GSoC participants. I started my proposal in early February and also wrote code at that time. Thus I did not have a rough start with the project setup.

The major things I have been working on is ironing out some problems that happen in the dependency construction. I use a dependency graph to reschedule the PyPy trace and there where some corner cases I did not think of (Trace operations with side effects and guards). Some edges missing are fatal, because it might happen that the instruction end up in the wrong order.
In addition to that I designed and added a cost model to decide whether or not it is reasonable to vectorize a trace loop. This prevents the optimizing backend to spend time on traces that will not be faster.
I already compared the idea to classical vectorization in the last post. The biggest problem is that without a cost model it can sometimes happen that not vectorizable loops end up with a single vector load statement and later unpack the vector elements. This is not desirable and the cost model prevents this from happening.

By looking at the schedule of my proposal I'm quite advanced already and I'm already working on stuff I proposed to do in the second term of GSoC. This is great because there is lots of time to fine tune and I have already discussed with my mentors and project members where we could even further improve.

My next post explains the setup I started working at back in February/March and shows how I unroll loops.