Inlining Branches and Sharing Metatables

In this post, we continue the quest to make Brat faster (which essentially equates to making LuaJIT happier).

Inline Branches

Previously we removed inner function creation by lifting functions out into the global scope and faking closures when needed.

One place where a lot of functions show up is in conditionals. true?/false?/null? are all just functions that take a condition and two branches. In the Tak benchmark, you can see the branches are wrapped in functions:

tak = { x, y, z |
  true? y < x
    { tak tak(x - 1, y, z), tak(y - 1, z, x), tak(z - 1, x, y) }
    { z }
}

This is standard Brat style to delay execution of the branches.

Before the lifting of functions, these functions would be created everytime the tak function was called. That’s pretty bad! Now the functions will be lifted and fake closures will be used instead.

However, what’s better than lifting a function? Not calling a function at all! Since the conditionals are used all the time and are core functions in the language, it makes sense to optimize them to just be regular Lua if statements.

When can we do this? Any time the branch is a liftable function. That’s convenient, since we already have the logic to figure that out.

To inline the branches, they are treated almost exactly like functions. A new scope is created and what would have been the body of the method is output in a do...end block. Instead of a return value, the result is just passed back in a variable. The condition and the branches are then put into a reguar if statement with guards just in case someone decides to override true?/false?/null? (which is possible but unlikely. If it happens, the original code without inlining is used.)

What are the results?

Tak benchmark before inline branches: 0.751 seconds

Tak benchmark after inline branches: 0.431 seconds

A nice 43% improvement!

Shared Metatables

Metatables are Lua’s way of overriding behavior of a table. For example, you can set the method to be called when brackets [] are used, or what method to use for conversion to a string. Brat sets up these two methods for every new object.

In wandering the web looking for nuggets of LuaJIT wisdom, I found this email from Mike Pall. In it, he notes that the parent post was creating new metatables for each new object, but the methods were the same.

Looking at Brat’s object creation, it already factored out the methods, but a new metatable was created for each new object. It was a simple change to always use the same one, and the change had no ill side effects on existing code.

Results

Tak benchmark after metatable change: 0.225 seconds. Another 48% improvement! These two changes together reduced the runtime for the Tak benchmark by 70%.

Similarly, the Kaprekar benchmark went from 86 seconds in our last blog post to just 21 seconds - another 70% improvement. Fibonacci (king of microbenchmarks) runs in just 0.043 seconds.

For more real-world use, these two optimizations reduced parsing time of peg.brat (ironically the current largest Brat file) by 42%.

While Brat is still not (nor will ever be) particularly fast in general, it is fun to continue pushing it.

Optimizing with Lifting Functions and Faking Closures

Brat uses functions all over the place. Everything between curly braces is a function, and every function is also a closure (meaning it saves its environment).

For example, here is the standard recursive definition of the Fibonacci sequence:

fibonacci = { x |
  true? x < 2, x, { fibonacci(x - 1) + fibonacci(x - 2}
}

fibonacci 25

Of course fibonacci itself is a function, but so is the block passed to the call to true?.

Unfortunately, this means every call to fibonacci will create a new inner function/closure. Even more unfortunately, LuaJIT does not currently compile creation of closures.

Below is the output of luatrace showing how the JIT performed.

TRACE SUMMARY
=============
Trace Status                 Traces       Bytecodes           Lines
------------                 ------       ---------           -----
Success                   13 ( 39%)      244 (  4%)       63 (  5%)
NYI: bytecode FNEW        18 ( 54%)     5098 ( 91%)     1080 ( 90%)
blacklisted                1 (  3%)      138 (  2%)       38 (  3%)
NYI: bytecode UCLO         1 (  3%)       62 (  1%)       11 (  0%)
------------------  --------------- --------------- ---------------
Total                     33 (100%)     5542 (100%)     1192 (100%)
==================  =============== =============== ===============

39% of attempted traces were successfully compiled to native code, but 54% were aborted because of the closure creation. The running time (average of 5 runs) was 0.894 seconds.

Naturally, it would be good to get rid of the closure creation so LuaJIT can compile more code. But how?

Faking Closures

The first step is to figure out how we can make a closure without making a closure. All we need is the function itself and somewhere to put the variables it needs to access but are outside its own scope.

This is accomplished with a simple data structure that stores a function and a table with variable names and their values.

The creation of this data structure looks like this:

local _temp13 = _lifted_call(_lifted1, {})
_temp13.arg_table['_temp1'] = _temp1
_temp13.arg_table['_temp2'] = _temp2

For reasons covered below, this is called a “lifted call”. _lifted1 is the name of the function being stored. After creating the new stored call, the variables are stored into the table. For simplicity, the keys are the same as the variable names.

Now we have a package with the function and the values it would normally capture as a closure. For convenience, the package can be called just like a function. Unfortunately, it is not a function, so the compiled Brat code must check if a variable is a function or one of these packaged up calls (which just look like Lua tables otherwise). Either way, it can be invoked the same way.

Lifting Functions

The next step is to move the function creation outside of any other functions, essentially “lifting” or “hoisting” it up and away. This is so it only gets created once.

The lifted function accepts the table of variables, self, and then any normal arguments. In our example, there are no normal arguments, so the function starts like this:

_lifted1 = function(_argtable, _self)
  local _temp1 = _argtable['_temp1']
  local _temp2 = _argtable['_temp2']

At the beginning of the function it reads the variables back out of the table and into local variables. These have the same names as before, so the function can be compiled the same as if it had not been lifted.

The Dirty Details

Not all inner functions can be lifted. In our implementation, the function cannot really access the outside variables, only their values have been copied into the local scope. This means any function which sets the value of a variable outside its own scope (an “upvar”) cannot be lifted.

Unfortunately, it gets worse, though. If any inner function sets an upvar, none of its outer functions can be lifted, either.

Even worse, if a function at the same level or lower sets an upvar, none of the functions at the same level can be lifted.

For example:

f = {
  x = 1
  while { x < 10 } { x = x + 1 }
}

None of the two inner functions can be lifted. If { x < 10 } is lifted, it will get a snapshot of x and the later assignment will not affect it.

In theory f could be lifted although as a top-level function it would not do any good.

Results

Going back to Fibonacci, how does the JIT trace look after lifting out the inner function?

TRACE SUMMARY
=============
Trace Status                         Traces       Bytecodes           Lines
------------                         ------       ---------           -----
Success                           20 ( 90%)    11541 ( 70%)     1114 ( 83%)
down-recursion, restarting         1 (  4%)     3753 ( 22%)      125 (  9%)
call unroll limit reached          1 (  4%)     1101 (  6%)       97 (  7%)
--------------------------  --------------- --------------- ---------------
Total                             22 (100%)    16395 (100%)     1336 (100%)
==========================  =============== =============== ===============

Nice! 90% of the attempted traces were compiled, and the aborted traces were only due to recursion. The average of five runs was 0.678 seconds, which is 24% faster.

In this case, even the overhead of our own fake closures was worth it to get the JIT-compiled code.

A different example, calculating Kaprekar numbers went from 122 seconds to just 86, ~30% faster.

Brat is Now 'Self-Hosting'

Unnecessary Backstory

When development of Brat began in 2009, it used the Ruby Treetop PEG library to generate a parser which compiled Brat to Neko. Some time around May 2010 I started working on moving from Neko to Lua, which involved a total rewrite of the compiler part of the parser. Except I kept using Treetop PEG (and therefore Ruby). By September the rewrite was done and Brat was off Neko and onto Lua. At the beginning of 2011, plain Lua was replaced by LuaJIT..

In April of 2011, I started implementing a PEG in Brat. Over four years later, the PEG library, parser, and (now separate) compiler are complete enough that I have gotten rid of the Treetop and Ruby dependencies. Hooray!

The New Compiler

Let’s get this out of the way up front: Brat still compiles to Lua and runs on LuaJIT. But the parser and compiler are now written in Brat. Yes, this is actually slower (at the moment) than using Treetop and Ruby. However, I am hopeful this will improve because the new compiler doesn’t have all the optimizations implemented that the old one did. In particular, lifting functions to avoid trace aborts in LuaJIT (something I’ll have to write up some day).

First, there is a generic PEG implementation in Brat’s standard library. Along with that, there is a hand-coded PEG to parse a PEG language very similar to Treetop.

This allowed me to mostly copy the old gammar to a new one. Unlike the old one, though, the parser grammar is separated out from the implementation. This makes it much easier to read! The grammar is compiled to Brat, which is then compiled to Lua.

Unlike the old parser/compiler which converted straight from the source code to Lua, the new one first converts the source to an abstract syntax tree (AST) made up of s-expressions.

The AST can then pass through any number of transformers. There is only one right now. It determines information around whether variables are local or not, which simplifies the compiled code later on.

Finally, the AST goes to the compiler (also written in Brat). The compiler converts the AST to Lua.

Bootstrapping

Management of compilation (compiling files, compiling from STDIN, and interactive mode) has been handled by bin/brat. This is still the case, except it calls brat2lua which handles the source->AST->compiler->Lua transformation.

However, everything from brat2lua on is written in Brat. So how are those files compiled? That is the essential problem of bootstrapping a language. Fortunately, we have the entire previous parser/compiler to use. The repo contains the compiled Brat files necessary to run the parser and compiler.

This means Brat’s dependencies are down to a C compiler to compile LuaJIT and the other C libs - everything else is included in the source repository!

More posts…

Fork me on GitHub