A Note Using LuaJit 2.0 beta4.

LuaJit 2.0 performs well on alioth benchmark, but initially was not performing as well as expected on my real code.
Here is my experience about tuning the LuaJit parameters to improve the performance to the dream level.
The Lua interpreter was completing within 100s on my script.
The LuaJit 2.0 default settings were running in 30s almost the same as when running with luajit -joff (40s).
After tuning the script completed in 8s.

  1. -Omaxmcode=512 is the default,  so it can help increasing the traces cache size . I went up to 4096 in my case but will explain why later. The default value is rather small and fits well with small benchmark application but as soon as you start having branchy code or large code base you may end up in a lot of traces competing for the code cache and as in my case trace gain were well balanced, removing some had some impact.
  2. Use -jv! it gives a lot of info on what's going on. -jv option proved to be quite a useful tool to figure out what happened.
    hence when some limits are reached, it shows explicit messages that can help tuning the parameters. However, one message that got my attention was about PHI.
  3. patch LuaJit: increase LJ_MAX_PHI!
    This was suggested to me by Mike Pall when I asked him what it meant. I will quote him no to betray a valid explanation:
    "> [TRACE --- (41/0) xxxxxxxx.lua:528 -- too many PHIs]
    Umm, this means you have a loop with more than 32 loop-carried
    variables. That's rather unusual. Maybe it happens indirectly
    through load-forwarding. Bumping the LJ_MAX_PHI define would get
    rid of the message, but that won't necessarily help performance."

    So I increase LJ_MAX_PHI to 64 in lj_defs.h. But didn't see immediate gain even with increasing maxmcode. So luajit -jv again...
  4.  -Omaxsnap=128. Once I've changed LJ_MAX_PHI, I started having more snap per traces, probably as the combinations of variables were bigger. Again just a guess but maybe one can study the relation between LHI and number of snap per trace? Of course having more snap didn't help if they can't fit in the cache, so I increased again the cache size up to 4096. Up to this point my script was running in 10s! a rather impressive speed for the code.
  5. Don't try more loop unrolling. It doesn't help. What may actually help is to reducing it. I guess fetching code or large trace have a price in cpu cache.
  6. Actually don't try to change the other default parameters. I've made few experiments and even started patching the acovea framework but didn't took time to complete it. However my feeling is that this is probably not needed to other modification in the settings than narrows quite well the programs shapes.
  7. remove logging/debug trace in production... in my case got another 20% speed increase down to 8s from lua 5.1 regular interpreter. Here is a fragment of my dicussion with Mike Pall.
    "> So my question is: could it be possible to imagine a trace
    > framework allowing almost no penalty when disabled in jit?

    Since a lot of the code isn't compiled in your benchmark, the
    interpreter has to deal with it. If everything was compiled, it
    wouldn't matter much which kind of trace framework you use, as the
    compiler could usually hoist it.

    If you absolutely need maximum performance then I suggest to use
    pre-processing. Load the whole file with fp:read"*a", eliminate
    all trace code with string.gsub and then pass it to loadstring()."

Conclusion

LuaJit is truly fast, but may still require some exploration to tune it in real cases to scale the performances. However this was still a pure Lua script. Will probably experiment later if these performance can be obtained in an embedded context.