I’m excited to announce Off-Main-Thread painting, our new Firefox graphics performance effort! It’s shipping soon in our next release, Firefox 58 – directly on the heels of Advanced Layers, our new compositor for Firefox 57.
To understand OMTP, and why it’s a big deal for us, it helps to understand how Firefox renders a webpage down to pixels on your screen. There are four main steps involved:
- Making a Display List: Here we collect the visible elements on the page and create high-level primitives to encapsulate rendering each one. These primitives are called “display items.”
- Assigning Layers: Here we try to group display items together into “layers”, based on how they are scrolled or animated. There are different types of layers. Display items will usually be grouped into “Painted” layers, which have a texture (or bitmap) that is updated when items are added, removed, or changed.
- Rasterization: This is where each display item is asked to render itself into its assigned layer. For example, a “table” item might issue a series of API calls to draw borders and lines.
- Compositing: Finally, the layers are composited into a single final image, which is then sent to the monitor. This step uses Direct3D or OpenGL when available.
These steps occur across two threads, like so:
The Compositing step already happens off the main thread, but the other major steps do not. And while rasterization is not always expensive, it can be, and it is very much affected by resolution. Rasterizing on a 4K monitor requires computing roughly 10 times as many pixels than, say, a 1024×768 screen.
Off Main Thread Painting is our answer to rasterization costs. As the name suggests – we simply do it on another thread! It turned out to be surprisingly easy – with an asterisk.
Normally, our display items render through an API we call Moz2D. Moz2D was already designed to support multiple backends – Skia, Cairo, Direct2D, et cetera. We added an additional “Capture” backend, where instead of immediately issuing commands, we can record them in a list. That list then gets replayed on a painting thread. Voilà! Rasterization is now asynchronous.
The new diagram looks like this:
What happens if painting takes multiple frames to complete? Say the paint thread is going to take 100ms to rasterize a very complex recording. Will the main thread keep piling up new frames and sending them to the paint thread? The answer is: no. Because Firefox double buffers, we currently cannot allow more than one frame of slack. If we begin rendering a new frame, we will wait for the previous frame to finish. Luckily since we only render on vertical sync (every 16ms on a 60hz display), this affords us a full 32ms (minus whatever time we spent preparing and recording the previous frame, of course) before we start delaying work on the main thread.
To see why this is beneficial, imagine a series of frames before and after OMTP. If each frame exceeds the frame budget – even if rasterization was not the biggest component (like it is in the diagram below) – our composite will be delayed until the next vsync. In the diagram below, not only are we missing frames, but we’re spending a good deal of time doing nothing.
Now, imagine the same content being rendered with OMTP. The main-thread is now recording commands and sending them to the paint thread. We can resume processing the next frame up until another rasterization needs to be queued. As long as neither thread exceeds its frame budget, we’ll always be able to composite on time. And if even if we blow the frame budget – at least we’ll get a few more frames in than the previous diagram.
When we started planning for future Graphics team work last year, we set out by instrumenting Firefox with Telemetry. We wanted to know how much painting was affecting frame time, and in addition, we wanted to know more about slow paints. When painting exceeded a certain threshold (set to 15ms), how was that time divided between different phases of the painting process?
We had a gut feeling that rasterization was less of a cost than expected. Partly because it’s incremental (we rarely have to re-rasterize an entire page), and partly because we use Microsoft’s high-performance rasterization library, Direct2D. And indeed, our gut feelings were confirmed: for most “slow” paints, the costs were in the preparatory steps. Rasterization was sometimes a large chunk, but usually, it was somewhere between 10-20% of the entire paint cycle. Immediately, this data kicked off another project: Retained Display Lists, which the layout team will be talking about soon.
Even though rasterization was usually fast, we had enough evidence that it did consume precious frame cycles, and that was motivation enough to embark on this project.
A nice side effect of having instrumented Firefox is that we were pretty quickly able to see the effects of OMTP. The two graphs below are unfortunately a bit difficult to read or condense, but they are straight from our public Telemetry dashboard. On the left is data from Firefox 57, and on the right is data from Firefox 58. The horizontal axis is how expensive rasterization was as a percentage of the total frame time. The vertical axis is how often that particular weighting occurred.
In Firefox 57, “cheap” rasterizations (those less than ~10% of the paint cycle) occur 51% of the time. In Firefox 58, they occur 80% of the time! That means in Firefox 58, rasterization will consume less of the frame budget on average. Similarly, in Firefox 57, rasterization is a significant slice – 50% of the paint cycle or more – 21% of the time. In Firefox 58, that scenario occurs only 4% of the time!
And indeed, we do see benefits! With Direct2D, our microbenchmark improved FPS by 30%. And with Skia, our microbenchmark improved FPS by 25%. We expect Skia wins to be even greater in the future as we experiment with parallel painting.
Earlier we mentioned this was easy to implement. If that’s the case, why didn’t we just do it a long time ago? There are a few reasons, but in actuality it was a super long road to get here, and it was only made simple by years of precursor work. This project required Off Main Thread Compositing and significant work to simplify and reduce complexity in both Layers and Moz2D. Some of that work was not even motivated until Electrolysis took off. We even had an earlier OMTP project (spearheaded by Jerry Shih for FirefoxOS), but it found roadblocks in our IPC layer. We were only able to overcome those roadblocks with the knowledge learned from past efforts, combined with later refactorings.
There were also some thread-safety complications, of course. Our 2D API is “copy on write”. You can issue many draw calls to a Moz2D surface, and even create copies of the surface, but usually no actual computations are performed until the contents of a surface will be read. So, a copy of a surface is just a pointer. When the original surface is about to be mutated, any outstanding copies are told to immediately duplicate the underlying pixels, so they reflect the image as when the “copy” was created.
Why did this pose problems for OMTP? Well, it turns out we copy Moz2D surfaces a lot. Those copies can be sent from the main thread to the paint thread. If the main thread happens to mutate the original surface while the paint thread tries to read from a shallow copy, there will be a race. We definitely don’t want to deep-copy all of our temporary surfaces on the main thread, so instead, we added per-surface synchronization to Moz2D.
Finally, another issue we ran into was the Direct2D global lock. Rather than completely audit or overhaul how Direct2D is used on both threads, we decided to enable Direct2D thread safety. When this is enabled, Direct2D will hold a global lock during certain critical sections. For example, this lock is held during surface destruction/allocation, when surfaces are “flushed” to the GPU, and when surfaces are copied. A good deal of work was us hitting these contention points for various reasons and addressing them, sometimes by moving more code off the main thread, and sometimes by fixing silly mistakes.
What’s left to do? We have a few follow-up projects in mind. Now that we have asynchronous painting, it makes sense to explore parallel painting as well. We already support “tiled” rendering on Mac, and now we can explore both asynchronous tiling and painting tiles and layers in parallel. We also want to explore how well this works on Windows, both with Skia and with Direct2D. Our “slow rasterization” benchmarks suggest that parallel painting will be a huge win for Skia.
There are also just some missing features in OMTP right now. For example, we do not support rasterizing “mask” layers on the paint thread. We would like to move some of this functionality out of the renderer into Advanced Layers, where masking can be done in our new, much-more-intelligent batching compositor.
I’d like to thank Mason Chang, Ryan Hunt, Bas Schouten, Jerry Shih, Matt Woodrow, and our 2017 intern Dominic Farolino for contributing to these projects and getting them out the door!