Improving texture atlas allocation in WebRender

This is going to be a rather technical dive into a recent improvement that went into WebRender.

Texture atlas allocation

In order to submit work to the GPU efficiently, WebRender groups as many drawing primitives as it can into what we call batches. A batch is submitted to the GPU as a single drawing command and has a few constraints. for example a batch can only reference a fixed set of resources (such as GPU buffers and textures). So in order to group as many drawing primitives as possible in a single batch we need to place as many drawing parameters as possible in few resources. When rendering text, WebRender pre-renders the glyphs before compositing them on the screen so this means packing as many pre-rendered glyphs as possible into a single texture, and the same applies for rendering images and various other things.

For a moment let’s simplify the case of images and text and assume that it is the same problem: input images (rectangles) of various rectangular sizes that we need to pack into a larger textures. This is the job of the texture atlas allocator. Another common name for this is rectangle bin packing.

Many in game and web development are used to packing many images into fewer assets. In most cases this can be achieved at build time Which means that the texture atlas allocator isn’t constrained by allocation performance and only needs to find a good layout for a fixed set of rectangles without supporting dynamic allocation/deallocation within the atlas at run time. I call this “static” atlas allocation as opposed to “dynamic” atlas allocation.

There’s a lot more literature out there about static than dynamic atlas allocation. I recommend reading A thousand ways to pack the bin which is a very good survey of various static packing algorithms. Dynamic atlas allocation is unfortunately more difficult to implement while keeping good run-time performance. WebRender needs to maintain texture atlases into which items are added and removed over time. In other words we don’t have a way around needing dynamic atlas allocation.

A while back

A while back, WebRender used a simple implementation of the guillotine algorithm (explained in A thousand ways to pack the bin). This algorithm strikes a good compromise between packing quality and implementation complexity.
The main idea behind it can be explained simply: “Maintain a list of free rectangles, find one that can hold your allocation, split the requested allocation size out of it, creating up to two additional rectangles that are added back to the free list.”. There is subtlety in which free rectangle to choose and how to split it, but the overall, the algorithm is built upon reassuringly understandable concepts.

Deallocation could simply consist of adding the deallocated rectangle back to the free list, but without some way to merge back neighbor free rectangles, the atlas would quickly get into a fragmented stated with a lot of small free rectangles and can’t allocate larger ones anymore.

Lots of free space, but too fragmented to host large-ish allocations.

To address that, WebRender’s implementation would regularly do a O(n²) complexity search to find and merge neighbor free rectangles, which was very slow when dealing with thousands of items. Eventually we stopped using the guillotine allocator in systems that needed support for deallocation, replacing it with a very simple slab allocator which I’ll get back to further down this post.

Moving to a worse allocator because of the run-time defragmentation issue was rubbing me the wrong way, so as a side project I wrote a guillotine allocator that tracks rectangle splits in a tree in order to find and merge neighbor free rectangle in constant instead of quadratic time. I published it in the guillotiere crate. I wrote about how it works in details in the documentation so I won’t go over it here. I’m quite happy about how it turned out, although I haven’t pushed to use it in WebRender, mostly because I wanted to first see evidence that this type of change was needed and I already had evidence for many other things that needed to be worked on.

The slab allocator

What replaced WebRender’s guillotine allocator in the texture cache was a very simple one based on fixed power-of-two square slabs, with a few special-cased rectangular slab sizes for tall and narrow items to avoid wasting too much space. The texture is split into 512 by 512 regions, each region is split into a grid of slabs with a fixed slab size per region.

The slab allocator in action. This is a debugging view generated from a real browsing session.

This is a very simple scheme with very fast allocation and deallocation, however it tends to waste a lot of texture memory. For example allocating an 8×10 pixels glyph occupies a 16×16 slot, wasting more than twice the requested space. Ouch!
In addition, since regions can allocate a single slab size, space can be wasted by having a region with few allocations because the slab size happens to be uncommon.

Improvements to the slab allocator

Images and glyphs used to be cached in the same textures. However we render images and glyphs with different shaders, so currently they can never be used in the same rendering batches. I changed image and glyphs to be cached into a separate set of textures which provided with a few opportunities.

Not mixing images and glyphs means the glyph textures get more room for glyphs which reduces the number of textures containing glyphs overall. In other words, less chances to break batches. The same naturally applies to images. This is of course at the expense of allocating more textures on average, but it is a good trade-off for us and we are about to compensate the memory increase by using tighter packing.

In addition, glyphs and images are different types of workloads: we usually have a few hundred images of all sizes in the cache, while we have thousands of glyphs most of which have similar small sizes. Separating them allows us to introduce some simple workload-specific optimizations.

The first optimization came from noticing that glyphs are almost never larger than 128px. Having more and smaller regions, reduces the amount of atlas space that is wasted by partially empty regions, and allows us to hold more slab sizes at a given time so I reduced the region size from 512×512 to 128×128 in the glyph atlases. In the unlikely event that a glyph is larger than 128×128, it will go into the image atlas.

Next, I recorded the allocations and deallocations browsing different pages, gathered some statistics about most common glyph sizes and noticed that on a low-dpi screen, a quarter of the glyphs would land in a 16×16 slab but would have fit in a 8×16 slab. In latin scripts at least, glyphs are usually taller than wide. Adding 8×16 and 16×32 slab sizes that take advantage of this helps a lot.
I could have further optimized specific slab sizes by looking at the data I had collected, but the more slab sizes I would add, the higher the risk of regressing different workloads. This problem is called over-fitting. I don’t know enough about the many non-latin scripts used around the world to trust that my testing workloads were representative enough, so I decided that I should stick to safe bets (such as “glyphs are usually small”) and avoid piling up optimizations that might penalize some languages. Adding two slab sizes was fine (and worth it!) but I wouldn’t add ten more of them.

The original slab allocator needed two textures to store a workload that the improved allocator can fit into a single one.

At this point, I had nice improvements to glyph allocation using the slab allocator, but I had a clear picture of the ceiling I would hit from the fixed slab allocation approach.

Shelf packing allocators

I already had guillotiere in my toolbelt, in addition to which I experimented with two algorithms derived from the shelf packing allocation strategy, both of them released in the Rust crate etagere. The general idea behind shelf packing is to separate the 2-dimensional allocation problem into a 1D vertical allocator for the shelves and within each shelf, 1D horizontal allocation for the items.

The atlas is initialized with no shelf. When allocating an item, we first find the shelf that is the best fit for the item vertically, if there is none or the best fit wastes too much vertical space, we add a shelf. Once we have found or added a suitable shelf, an horizontal slice of it is used to host the allocation.

At a glance we can see that this scheme is likely to provide much better packing than the slab allocator. For one, items are tightly packed horizontally within the shelves. That alone saves a lot of space compared to the power-of-two slab widths. A bit of waste happens vertically, between an item and the top of its shelf. How much the shelf allocator wastes vertically depends on how the shelve heights are chosen. Since we aren’t constrained to power-of-two size, we can also do much better than the slab allocator vertically.

The bucketed shelf allocator

The first shelf allocator I implemented was inspired from Mapbox’s shelf-pack allocator written in JavaScript. It has an interesting bucketing strategy: items are accumulated into fixed size “buckets” that behave like a small bump allocators. Shelves are divided into a certain number of buckets and buckets are only freed when all elements are freed. The trade-off here is to keep atlas space occupied for longer in order to reduce the CPU cost of allocating and deallocating. Only the top-most shelf is removed when empty so consecutive empty shelves in the middle aren’t merged until they become the top-most shelves, which can cause a bit of vertical fragmentation for long running sessions. When the atlas is full of (potentially empty) shelves the chance that a new item is too tall to fit into one of the existing shelves depends on how common the item height is. Glyphs tend to be of similar (small) heights so it works out well enough.

I added very limited support for merging neighbor empty shelves. When an allocation fails, the atlas iterates over the shelves and checks if there is a sequence of empty shelves that in total would be able to fit the requested allocation. If so, the first shelf of the sequence becomes the size of the sum, and the other shelves are squashed to zero height. It sounds like a band aid (it is) but the code is simple and it is working within the constraints that make the rest of the allocator very simple and fast. It’s only a limited form of support for merging empty shelves but it was an improvement for workloads that contain both small and large items.

Image generated from the glyph cache in a real borwsing session via a debugging tool. We see fewer/wider boxes rather than many thin boxes because the allocator internally doesn’t keep track of each item rectangle individually, so we only see buckets filling up instead.

This allocator worked quite well for the glyph texture (unsurprisingly, as Mapbox’s implementation it was inspired from is used with their glyph cache). The bucketing strategy was problematic, however, with large images. The relative cost of keeping allocated space longer was higher with larger items. Especially with long running sessions, this allocator was good candidate for the glyph cache but not for the image cache.

The simple shelf allocator

The guillotine allocator was working rather well with images. I was close to just using it for the image cache and move on. However, having spent a lot of time looking at various allocations patterns, my intuition was that we could do better. This is largely thanks to being able to visualize the algorithms via our integrated debugging tool that could generate nice SVG visualizations.

It motivated experimenting with a second shelf allocator. This one is conceptually even simpler: A basic vertical 1D allocator for shelves with a basic horizontal 1D allocator per shelf. Since all items are managed individually, they are deallocated eagerly which is the main advantage over the bucketed implementation. It is also why it is slower than the bucketed allocator, especially when the number of items is high. This allocator also has full support for merging/splitting empty shelves wherever they are in the atlas.

This was generated from the same glyph cache wokrload as the previous image.

Unlike the Bucketed allocator, this one worked very well for the image workloads. For short sessions (visiting only a handful of web pages) it was not packing as tightly as the guillotine allocator, but after browsing for longer period of time, it had a tendency to better deal with fragmentation.

The simple shelf allocator used on the image cache. Notice how different the image workloads look (using the same texture size), with much fewer items and a mix of large and small items sizes.

The implementation is very simple, scanning shelves linearly and then within the selected shelf another linear scan to find a spot for the allocation. I expected performance to scale somewhat poorly with high number of glyphs (we are dealing in the thousands of glyphs which arguably isn’t that high), but the performance hit wasn’t as bad I had anticipated, probably helped by mostly cache friendly underlying data-structure.

A few other experiments

For both allocators I implemented the ability to split the atlas into a fixed number of columns. Adding columns means more (smaller) shelves in the atlas, which further reduces vertical fragmentation issues, at the cost of wasting some space at the end of the shelves. Good results were obtained on 2048×2048 atlases with two columns. You can see in the previous two images that the shelf allocator was configured to use two columns.

The shelf allocators support arranging items in vertical shelves instead of horizontal ones. It can have an impact depending on the type of workload, for example if there is more variation in width than height for the requested allocations. As far as my testing went, it did not make a significant difference with workloads recorded in Firefox so I kept the default horizontal shelves.

The allocators also support enforcing specific alignments in x and y (effectively, rounding up the size of allocated items to a multiple of the x and y alignment). This introduces a bit of wasted space but avoids leaving tiny holes in some cases. Some platforms also require a certain alignment for various texture transfer operations so it is useful to have this knob to tweak at our disposal. In the Firefox integration, we use different alignments for each type of atlas, favoring small alignments for atlases that mostly contain small items to keep the relative wasted space small.

Conclusion

Various visualizations generated while I was working on this. It’s been really fun to be able “look” at the algorithms at each step of the process.

The guillotine allocator is the best at keeping track of all available space and can provide the best packing of the algorithms I tried. The shelf allocators waste a bit of space by simplifying the arrangement into shelves, and the slab allocator wastes a lot of space for the sake of simplicity. On the other hand the guillotine allocator is the slowest when dealing with multiple thousands of items and can suffer from fragmentations in some of the workloads I recorded. Overall the best compromise was the simple shelf allocator which I ended up integrating in Firefox for both glyph and image textures in the cache (in both cases configured to have two columns per texture). The bucketed allocator is still a very reasonable option for glyphs and we could switch to it in the future if we feel we should trade some packing efficiency for allocation/deallocation performance. In other parts of WebRender, for short lived atlases (a single frame), the guillotine allocation algorithm is used.

These observations are mostly workload-dependent, though. Workloads are rarely completely random so results may vary.

There are other algorithms I could have explored (and maybe will someday, who knows), but I had found a satisfying compromise between simplicity, packing efficiency, and performance. I wasn’t aiming for state of the art packing efficiency. Simplicity was a very important parameter and whatever solutions I came up with would have to be simple enough to ship it in a web browser without risks.

To recap, my goals were to:

  • allow packing more texture cache items into fewer textures,
  • reduce the amount of texture allocation/deallocation churn,
  • avoid increasing GPU memory usage, and if possible reduce it.

This was achieved by improving atlas packing to the point that we more rarely have to allocate multiple textures for each item type . The results look pretty good so far. Before the changes in Firefox, glyphs would often be spread over a number of textures after having visited a couple of websites, Currently the cache eviction is set so that we rarely need more than than one or two textures with the new allocator and I am planning to crank it up so we only use a single texture. For images, the shelf allocator is pretty big win as well. what used to fit into five textures now fits into two or three. Today this translates into fewer draw calls and less CPU-to-GPU transfers which has a noticeable impact on performance on low end Intel GPUs, in addition to reducing GPU memory usage.

The slab allocator improvements landed in bug 1674443 and shipped in Firefox 85, while the shelf allocator integration work went in bug 1679751 and will make it hit the release channel in Firefox 86. The interesting parts of this work are packaged in a couple of rust crates under permissive MIT OR Apache-2.0 license:

One thought on “Improving texture atlas allocation in WebRender

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s