Unladen Swallow Retrospective

malkia · on March 26, 2011

Somewhat relevant to the topic - Mike Pall discusses (2009) the usage of llvm for lua - he's not ditching the approach, just pointing out the difficulties, and why he took on making luajit the way it is:

http://lua-users.org/lists/lua-l/2009-06/msg00071.html

kingkilr · on March 26, 2011

A big one Mike misses (because it isn't as critical for Lua AFAIK) is escape analysis. For PyPy our aggressive escape analysis can bring huge wins on things like numeric code and some types of string processing, but no static compilers really do this type of optimization, because it's not very important in C-type languages.

haberman · on March 27, 2011

Out of curiosity what makes this such a big win for Python, but not so much for Lua or C-type languages?

kingkilr · on March 27, 2011

I honestly don't know why it's not such a big deal for Lua, but given LuaJIT's performance, and knowing that it doesn't do escape analysis I know it must not be a big deal :) In Python it's a big deal because everything is boxed, and your inner loops just get bogged down with allocations, which are expensive compared to arithmetic operations.

rayiner · on March 27, 2011

It's not a big deal for LuaJIT because it uses a clever tagging format that can hold a boxed 64-bit float in a register.

beagle3 · on March 27, 2011

LuaJIT being a trace compiler, escape analysis in implicit. As long as a value stays (say) int or float, the code to handle it stays int or float. If for whatever reason it switches type, the code will change to accommodate that -- but as long as the types (and/or values) stay the same, the machine code to handle them will take advantage of that.

kingkilr · on March 27, 2011

It's absolutely not true that escape analysis is implicit in tracing: https://bitbucket.org/pypy/extradoc/raw/63e4617062b2/talk/pe...

scott_s · on March 27, 2011

I can't speak for Lua, but for C, it's because all memory management is handled explicitly by the programmer - that precludes optimizations where dynamic allocation is replaced with stack allocation. The C compiler isn't free to change such things, and most C programmers already perform that optimization without thinking about it. (I don't mean to imply that state of affairs is a good thing, it's just so ingrained in a C programmer that they don't think of it as an optimization.)

Also, since C allows access to raw-memory by design, it can't guarantee that, say, my memory allocation in function f() isn't touched by my funky pointer-arithmetic in function g(). (Perhaps in theory it could, but in practice this problem is mind-blowingly difficult.)

See http://en.wikipedia.org/wiki/Escape_analysis

pbiggar · on March 28, 2011

That's interesting. In Java, escape analysis wasn't really a win from a memory perspective when there were good GCs involved. Why is it different in Python/PyPy?

malkia · on March 27, 2011

Mike said he'll be working on escape analysis to decrease the garbage collection if I remember correctly...

malkia · on March 29, 2011

I take this back. I've read all Mike's post in the lua mailing list, and found the reference where he talks about it, and in fact as not considering it (yet)

it was from old posting in 3/14/2006 5:50PM:

"* Garbage collection and heap allocation put Lua at a speed disadvantage to languages with manual memory management. The impact is less in Lua than other dynamic languages because of typed-value storage and immutable shared strings. Adding a custom memory allocator to the Lua core could be beneficial. Complex solutions like escape analysis are not on my radar for LuaJIT (yet). "

jnoller · on March 27, 2011

See also Collin's comment on the post: http://qinsb.blogspot.com/2011/03/unladen-swallow-retrospect...

rayiner · on March 27, 2011

The points about LLVM being designed for "static C-like languages" isn't totally on-point. There as an impedance mismatch between Python and LLVM, but it's less about dynamic versus static than it is about the nature of the stack frame.

In LLVM, most optimizations operate on SSA values (the Value class in the LLVM IR). There is some support for CSE-ing loads, etc, but the effectiveness of that depends heavily on your alias information. So to get good optimization out of LLVM, you've got to represent things as Value's.

This is hard to do in Python. Python's lexical scoping semantics are a little bit wonky and there are lots of scenarios in which the thread's stack frame is reified as an object. So without some heavy analysis, you end up keeping your local variables in an activation frame object on the heap, and at that point you're toast. The local variables themselves won't be represented as SSA Value's, and most of the LLVM optimizations won't do anything with them.

This is not a "dynamic language" thing per se. Lisp, which is also a dynamic language, actually maps quite cleanly to LLVM. Every binding introduced by LET is already in SSA form unless it is either closed-over, assigned-to, or both.

1) If the value is just closed-over, you demote it to the function's environment and replace uses of the variable with a load from the environment vector.

2) If the value is just assigned-to, you just demote it to a stack slot via ALLOCA and LLVM's mem2reg pass will take care of re-promoting it to a Value. This latter technique is exactly what Clang does for all C local variables, so LLVM is highly-tuned for handling this scenario. In C, variables that have their address taken cannot be promoted, but this cannot happen in Lisp so every assigned-to value demoted to an ALLOCA should be promotable.

3) If a value is both assigned-to and closed over, you demote it to a heap-allocated box and replace all uses with heap references.

After this transformation, nearly every Lisp variable is an SSA Value, and the optimizers can work with them. Even if you use function calls to do generic arithmetic, etc, LLVM will happily CSE them for you as long as you mark those functions as readnone (ie: pure).

Now, there are some things LLVM won't do for you. It can't const-propagate generic arithmetic because it does't know the semantics. It can't do reassociation, etc, because you're not using the ADD/SUB, etc instructions. I don't see anything that would prevent you from doing it yourself in a custom pass, however.

In short, the criticism isn't so much that LLVM has an impedance mismatch with dynamic languages as it is that it only handles the bottom of the optimization stack. You still need to do the high-level language-specific optimizations before handing things to LLVM.

joevandyk · on March 27, 2011

Huh, first time I've seen the word "reified".

http://en.wikipedia.org/wiki/Reification_(computer_science)

beza1e1 · on March 27, 2011

I think the deeper problem is not that stuff like arithmetic is done by function calls, but that it is done by method calls. This requires points-to-analysis and a lowering to function calls first. Since this analysis will be very conservative for performance reasons, you want to steal more tricks. Tracing seems to be en vogue today.

rayiner · on March 27, 2011

Global type inference (which is what you'd be doing if you used a points-to analysis to figure out what methods would be called) is a fragile optimization.

Runtime type feedback (observing the actual type at call sites and doing optimistic compilation based on that) is more robust and predictable (your points-to analysis doesn't suddenly fail just because there is a single polymorphic use somewhere in the code). That in itself is orthogonal to tracing. Runtime type feedback was developed on whole-method compilers, after all. Tracing is neat, and it's in vogue because it gets you to a "good" optimizer with relatively little development cost, but you can do all the things tracing does in a whole-method compiler with more engineering work. On-stack replacement (necessary for doing optimistic compilation) is arguably easier in a tracing compiler, and is definitely the major thing missing in LLVM for dynamic language implementations.

But all of this is one step beyond the threshold problem. Even all these tricks don't help you if you can't reify your local variables as SSA values --- all that fancy LLVM machinery won't see anything to optimize.

jpr · on March 27, 2011

> Python's lexical scoping semantics are a little bit wonky and there are lots of scenarios in which the thread's stack frame is reified as an object.

Could you give an example of such a scenario? I'm having a hard time imagining what this means in code. I know that Python's scoping is a little weird, so this isn't really surprising...

rayiner · on March 27, 2011

There are various aspects of Python's local variable semantics that seem to harken to days when stack frames were implemented with dictionaries (eg: del). There may be a way to map this cleanly to register variables but it's not clear to me how you'd do this.

Beyond that there are several places where you can introspect the activation. See: locals(), some of the fields in the code object, etc. Again, there may be implementation techniques you can use to get around this, but there isn't an obvious and clean mapping to SSA values as there is for Lisp-like languages.

mitchellh · on March 26, 2011

The blog post touched on it briefly with regards to PyPy but I wanted to ask here since I'm sure someone knows a lot more than I do on this. I didn't follow unladen swallow very closely, but it seemed like an exciting project from the get-go. Its sad to see it go, but it appears its not in vain.

I want to ask: What lasting effects did the work that went into Unladen Swallow have on any Python implementations (or any other languages)? What "legacy" was left?

kingkilr · on March 26, 2011

The primary legacy of Unladen Swallow (this was something that everyone agreed on at the VM summit at PyCon) were: a) improvements to LLVM for all users of its JIT, things like valgrind and GDB integration, lifting that 16 MB of machine code, and fixing various bugs, b) developing a cohesive set of benchmarks for the Python community, and c) drawing more attention to the idea of a fast Python.

cdavid · on March 27, 2011

B is especially important for python people - it is my understanding that a lot of pypy current benchmarks are coming from unladen swallow.

tianyicui · on March 27, 2011

> Our potential customers eventually found other ways of solving their performance problems that they felt more comfortable deploying.

Just curious, what are the "other ways"?

nikcub · on March 27, 2011

"Most Python code at Google isn't performance critical."

What about AppEngine? The primary 'cloud' platform at Google.

jjm · on March 27, 2011

You would think that such a product would in fact be an excellent candidate for high performance but I believe that group strives more toward process scalability than any particular low level optimization.

rbanffy · on March 27, 2011

Most of the time in GAE is lost to latencies between the various subsystems (much like most web frameworks). Unless you are doing something processor-heavy (and you probably shouldn't do that on GAE), chances are you are waiting more for the network (or the datastore) than anything else.

jbarham · on March 27, 2011

"Our potential customers [within Google] eventually found other ways of solving their performance problems that they felt more comfortable deploying."

I wonder how much this includes teams within Google that have since started using Go for writing performance sensitive applications...

jnoller · on March 27, 2011

Not very many, as far as I've been told.

18pfsmt · on March 27, 2011

Are we talking about an african or european swallow?

rbanffy · on March 27, 2011

It's definitely unfair to mod this down. It's a very pertinent question when Python is discussed.

rbanffy · on March 28, 2011

sigh... There was a time most HNers would get the joke.