When this function is first called, we parse it and execute it. Next calls, we keep executing it slowly a few times. At one point, we have one slow iteration, which is where the JIT magic happens. And after, all subsequent executions are faster.
If the JIT were to be eager, we would parse, then right away compile, such that all executions are fast. As you might see, this is not always the fastest case, this depends on whether you prefer throughput over responsiveness and it depends on the number of times the function is expected to run.
Now, we know how Just-In-Time compilers choose to compile on-demand instead of compiling ahead-of-time. It is time for us to look at what Just-in-time compilers are doing.
Before jumping into more detailed explanations for the next 16 minutes, we have to understand control flow graphs. The following blue block is a sequence of code. We represent branches with edges between these blocks. An if statement would see the code prior the if and the condition in the first block which can branch into the then-part, or the else-part, which are both joining back after the if-statement. If the else branch breaks the control flow, by returning, then we do not have any edge back.
A JIT is all about specializing some code based on its inputs. removing what is useless and optimizing what remains. To know what is removed, let’s first have a look at what is executed.
From this, a simple method JIT consists of removing the cost induced by the interpreter. Thus instead of having an interpreter to dispatch every opcode, the JIT replaces the opcode of our bytecode by the corresponding assembly code. For each opcode, we copy the interpreter path used to interpret it, and reconstruct the control flow of the bytecode.
This small modification has a huge impact! Not only do we remove the cost of branching to execute each opcode, but we also specialize the code extracted from the interpreter with the opcode parameters.
Anyhow, as you might guess, most of the time a plus operator is either summing numbers, or concatenating strings, but rarely both. Ideally we should only cherry-pick the pieces of code which are useful and discard the rest, but this is premature for a Simple JIT.
There is a technique called Inline Caches or Trace Compilation. This technique is about generating code on-demand. Instead of compiling an entire function, which can be time consuming, we compile only the branches as they are requested. Thus, if the first time we visit this code we detect a double addition, we will execute, compile and simplify the code corresponding to a double addition. If after we detect a string concatenation, we will execute, compile and simplify the branches dedicated to string concatenation. However, we still need a fallback path to handle cases which are not yet generated.
But, how removing unused branches optimizes the execution speed? When a CPU sees code, it tries to predict which branch is going to be executed. However branches are not equals in assembly code. One fall-through while the other one will discard a bit of the instruction pipeline. By moving away the unlikely code, we help the CPU at making the right choice by loading less instructions as branches are no longer interleaved.
Inline Caches are great for extracting the existing code, but are not restricted by it. One thing possible, when dealing with pure operations, is to use Inline Caches as a caching mechanism. When we have a complex function such as a property lookup, which is mostly pure, we can generate much simpler code. By specializing a property lookup for each object type, we can generate code which checks the object type, and right-away returns the location of the property.
Good news! We have reached what simple JIT compilers do as a compilation target. Which is to remove the dispatching cost of the interpreter, and use Inline Caches for code which has a lot of variations. This mode of compilation is what SpiderMonkey uses since 2013, and what the v8 team recently announced under the codename Sparkplug.
So far JITs are just warming up, but they are much more powerful than that. The optimizing JIT learns from what the simple JIT is doing. The simple JIT will generate Inline Caches each time a code path is necessary. The Optimizing JIT piggybacks on this acquired knowledge to specialize the code accordingly, based on the hypothesis that this is unlikely to change.
Inline Caches also provide valuable information when there is a single entry. When there is a single entry, then we can take the code compiled by the Inline Caches, and reuse it in our compilation graph.
For example, knowing that the previous instruction is an addition and not a concatenation removes the need to check whether we have a number as operand. While this one sounds trivial, there are many others with many other opportunities.
By specializing, we can do more optimizations such as constant propagations, folding consecutive operations, replacing object fields by stack variables, computing the range of math operations to use integer math when possible.
Another optimization is the Unreachable Code Elimination, while it sounds similar to Branch Pruning it is actually complementary, as some code might be unreachable based on the parameters provided by the caller. So when inlining the code of one function in another, we might discover that a branch can never be reached in this context, while it has been used in different contexts. Unreachable Code Elimination follows other optimizations such as constant folding and Range Analysis, which help pre-compute whether a condition is always evaluated true or false.
The optimizing JIT is making profile-guided-optimization, but it is much more capable than a static compiler at this task. When a static compiler performs profile-guided-optimization, it does not have the ability to remove anything, it just moves the code far away to hint the CPU that the code should not be loaded in its instruction cache.
An optimizing JIT keeps the condition and simply removes branches. The reason we can do that is because we have the option to resume the execution in the Simple JIT or in the interpreter. Resuming in a lower tier is called a deoptimization, and this is the Nemesis of optimizing JITs.
A deoptimization is quite complex, as the optimizing JIT is removed from the execution stack and replaced by what the frame of the simple JIT would look like. The locations of the variables of the optimizing JIT have to be mapped to the locations of the variables of the simple JIT. This On-Stack-Replacement is quite costly, not much because of it-self but because we have to wait in the simple JIT while the optimizing JIT is confident enough to re-compile.
So far, I only mentioned how JITs are cutting branches, but keeping the conditions. JITs do not have time to prove that all properties hold within the program. Moreover code could be added dynamically, invalidating any proof. Still, there is a trick which consists of moving the conditions to other functions which are producing the value which is checked.
To understand how this works, we have to look at the data and from where it is coming from, such as a property read. The optimizing JIT will move the condition out of the generated code to be executed in all sources of the data, such as the property setter. Thus, when the value is used in the optimized code, we know the condition holds.
However, if the condition fails while setting the property, then the generated code is no longer valid, and if the invalid property value were to flow in the compiled code, it could miss-behave. The generated code has to be invalidated and discarded. A fuse is triggered and converts all edges into deoptimizations.
Invalidating code, forces us to deoptimize and to resume the execution in the simple JIT, and wait for another compilation to resume our fastest running speed.
Ok, let’s summarize …
JITs are all about making trade-offs between compilation speed and the speed of the generated code. While there is a lot of logic in JIT compilers optimizations, their goal is to reduce the number of instructions to be executed in the functions which are executed the most frequently.
Removing branches is the most effective way to enable more optimization in a JIT, either by removing obvious overhead, by recording paths as they are needed, or by filtering branches when they are unused or unreachable.