Efficiently submitting work to GPUs isn’t as straightforward as one might think. It is not unusual for a CPU renderer to go through each graphic primitive (a blue filled circle, a purple stroked path, and image, etc.) in z-order to produce the final rendered image. While this isn’t the most efficient way, greater care needs to be taken in optimizing the inner loop of the algorithm that renders each individual object than in optimizing the overhead of alternating between various types of primitives. GPUs however, work quite differently, and the cost of submitting small workloads is often higher than the time spent executing them.
I won’t go into the details of why GPUs work this way here, but the big takeaway is that it is best to not think of a GPU API draw call as a way to draw one thing, but rather as a way to submit as many items of the same type as possible. If we implement a shader to draw images, we get much better performance out of drawing many images in a single draw call than submitting a draw call for each image. I’ll call a “batch” any group of items that is rendered with a single drawing command.
So the solution is simply to render all images in a draw call, and then all of the text, then all gradients, right? Well, it’s a tad more complicated because the submission order affects the result. We don’t want a gradient to overwrite text that is supposed to be rendered on top of it, so we have to maintain some guarantees about the order of the submissions for overlapping items.
In the 29th newsletter intro I talked about culling and the way we used to split the screen into tiles to accelerate discarding hidden primitives. This tiling system was also good at simplifying the problem of batching. In order to batch two primitives together we need to make sure that there is no primitive of a different type in between. Comparing all primitives on screen against every other primitive would be too expensive but the tiling scheme reduced this complexity a lot (we then only needed to compare primitives assigned to the same tile).
In the culling episode I also wrote that we removed the screen space tiling in favor of using the depth buffer for culling. This might sound like a regression for the performance of the batching code, but the depth buffer also introduced a very nice property: opaque elements can be drawn in any order without affecting correctness! This is because we store the z-index of each pixel in the depth buffer, so if some text is hidden by an opaque image we can still render the image before the text and the GPU will be configured to automatically discard the pixels of the text that are covered by the image.
In WebRender this means we were able to separate primitives in two groups: the opaque ones, and the ones that need to perform some blending. Batching opaque items is trivial since we are free to just put all opaque items of the same type in their own batch regardless of their painting order. For blended primitives we still need to check for overlaps but we have less primitives to consider. Currently WebRender simply iterates over the last 10 blended primitives to see if there is a suitable batch with no other type of primitive overlapping in between and defaults to starting a new batch. We could go for a more elaborate strategy but this has turned out to work well so far since we put a lot more effort into moving as many primitives as possible into the opaque passes.
In another episode I’ll describe how we pushed this one step further and made it possible to segment primitives into the opaque and non-opaque parts and further reduce the amount of blended pixels.