PyPy Vectorization: June 2015

Thursday, June 18, 2015

It is ... alive!!!

I have been quite busy the last weeks improving my solution. Most of the time I have dedicated to accumulation of values. But first I have to tell you about the ...

Breakthrough

I have measured speedup on my sample interpreter already, but not in the NumPy library. I have tested and hardened the edge cases and it is now possible to measure speedup using the NumPy library.

Micro benchmark

a = np.arange(1000.0)
b = np.arange(1000.0)
for i in range(10000):
a = a + b

Invoking this program one can measure as speedup of ~1.33 faster program execution.

Well, that is not quite the theoretical maximum of 2.00 (SSE4)

I have then spent time to analyze the behavior using several profiling utilities. The included Python profiler did not do the job, because it is unaware of the underlying JIT. Thus I used the brand new vmprof and gprof.

Sidenote: I used gprof only to verify, but if a statistical profiler is enough for your python program, go for vmprof! The overhead is minimal and it is possible to get live profiling feedback of your application! In combination with the jitviewer you can find out where your time is spent.

It helped me a lot and the above loop spends about half of the time copying memory. So if the loop body is exchanged with ufunc.add(a, b, out=a) speedup increases up to 1.70-1.80.

That is better, but where is the rest of the time spent? Sadly the profiling says in the loop around the NumPy call. One of my mentors has suggested that there might be possibilities to improve the register allocation. And I'm currently evaluating a way to exchange and add some heuristics to improve the allocator.

The loop itself is a magnitude faster than the scalar loop. So I'm quite happy that my idea really worked out.

Accumulation

That is another big thing that I have been working on. I did not suggest this improvement in my GSoC proposal. Still I want to include it.
Frequently used functions in scientific computing are sum, prod, any, all, max, min, ...

Some of them consider the whole array, some of them bail out if an element has been found. There is potential to use SIMD instructions for these operations.

Let's consider sum(...). The addition is commutative.

x+y = y+x f.a. R

Thus I have added a temporary vector register for summation, the accumulator. Instead of resolving the dependency using a horizontal add (supported by x86 SSE4) the loop partially sums the array. At every guard exit the accumulator is then horizontally added. Again the theoretical speedup is a factor 2 when using float64 on SSE4.

I have not yet managed to compile a version that fully works on sum, but I'm quite close to it. Other functions like all or any are more complex. It is not so easy to recognize the reduction pattern if more than one operation is involved. I will add a pattern matcher for those instructions. Let's have a look at the following example (for all):

d = int_and(a,b)
guard_true(d)
e = int_and(c,d)
guard_true(e)

And output the following vector statements (excluding guard compensation code)

v = vec_int_and([a,c], accum)
guard_true(v)

I did not expect...

I have evaluated the possibility to vectorize arbitrary PyPy traces using the array module. This does not work for PyPy traces. It works in my test toy language (located here). Let's take a look at the following user program:

while i < 100:
a[i] = b[i] + c[i] * 3.14
i += 1

a,b and c are array objects of the Python array module. Their elements are homogeneous and adjacent in memory. The resulting trace could be transformed into a vectorized form.

The current two limitations make it impossible to vectorize the user program: 1) Python checks array boundaries (also negative) for each load/store operation. This adds an additional guard to the trace.
2) The boxing of integer values. The index variable will be recreated at the end of the loop and incremented. This includes several non pure operations in the trace including memory allocation of an integer box every iteration.

I do not yet know how I will come around these problems, but for the second limitation I'm quite sure that the creation of the integer box can be moved to the guard exit.

Thursday, June 4, 2015

PyPy unrolling trace loops

Note: Please see my last post "GSoC the first two weeks"

I already outlined the big picture of the final optimization. In many places I have assumed certain conditions and added simplifications to more easily explain the concept. In this and the next posts I will present much more detail about all the different parts that I have added to RPython. Note that I might recapitulate some basics of the terminology to make it better understandable for those who are not familiar with tracing- or method- JIT compilers.

RPython's optimizations

The basic data structure is a list of operations called "trace". A trace has no entry points other then its header instruction. In addition it might only be exited at a special instruction call "guard" instruction. Note that a trace must not necessarily have a backwards jump, but in the following only trace loops are considered. Here is an example of a trace. The red edge shows what is not allowed to happen. Jumps are only valid to the label instruction.

The optimizer takes this list of operations and transforms it to an equivalent trace. Many of the optimizations only pass through the list of operations once and gather information and emit, transform or leave out the current instruction. Optimizations that are done in other compilers are easier to implement because it is a single basic block instead of a region or control flow graph.

There is one optimization (one of the most powerful) that is different though. The unrolling optimization.

Trace loop unrolling

The proposed algorithm (henceforth called VecOpt) needs a way to unroll the loop body several times. So why not reuse this optimization already present in PyPy? I tried to reuse the optimization (and did in a certain way), but the unrolling itself I have duplicated for two reasons:

Unrolling does not only unroll the trace loop, but it gathers information about the traces and applies many more optimizations than just peepling one loop iteration. This includes guard strength reduction, guard implication, loop invariant code motion and more.
The factor is not easily adjustable. It has the sole purpose of peel the loop body just once.

The only requirement VecOpt has: It needs a way to unroll the loop X times and correctly rename the variables. This is implemented in less then 200 lines of code and it is specifically tailored to VecOpt.

Renaming a loop iteration needs a mapping function (dict) of the jump arguments to the label arguments and must instantiate new variables for operations that create variables. In a loop unrolling iteration the mapping function is used to rename the arguments of operations. The idea is also described in the loop unrolling optimization article here (chapter 5).

Benefit

VecOpt benefits from the unrolling optimization. Without it, chances are very very low that a vectorizable loop can be found and transformed to a faster trace loop. There are two reasons for that:

The peeled loop in most cases contains significantly less guards than the original trace
Most of the loop overhead has been removed

Guards increase the dependency edge count in a trace. In many cases they make two instructions dependent even though without the guard they could in theory be executed in parallel.

If loop overhead is not removed, then the gain might not be that significant when using SIMD instructions.

There are many loops that after the unrolling optimization has been applied, only contain less than half of the instructions.

Summary

VecOpt is applied to the peeled loop the unrolling optimization outputs. This smaller loop is then unrolled X times. X is determined by the smallest type that is loaded in the trace loop. Without the unrolling already done by PyPy there would only be little chance to find parallel instructions and group them into vector statements. In a nutshell one could say that: VecOpt unrolls an already unrolled loop.

GSoC the first two weeks

The community bonding phase ended almost two weeks from now and I have been very busy working on my proposal.

My situation might be quite different from other GSoC participants. I started my proposal in early February and also wrote code at that time. Thus I did not have a rough start with the project setup.

The major things I have been working on is ironing out some problems that happen in the dependency construction. I use a dependency graph to reschedule the PyPy trace and there where some corner cases I did not think of (Trace operations with side effects and guards). Some edges missing are fatal, because it might happen that the instruction end up in the wrong order.
In addition to that I designed and added a cost model to decide whether or not it is reasonable to vectorize a trace loop. This prevents the optimizing backend to spend time on traces that will not be faster.
I already compared the idea to classical vectorization in the last post. The biggest problem is that without a cost model it can sometimes happen that not vectorizable loops end up with a single vector load statement and later unpack the vector elements. This is not desirable and the cost model prevents this from happening.

By looking at the schedule of my proposal I'm quite advanced already and I'm already working on stuff I proposed to do in the second term of GSoC. This is great because there is lots of time to fine tune and I have already discussed with my mentors and project members where we could even further improve.

My next post explains the setup I started working at back in February/March and shows how I unroll loops.