It is. There's a terminology problem in play here. Throughout this article 2D does not mean 2D, it means "arbitrary complexity paths, eg bezier curves". This is a small subset of 2D as makes up a UI. It'd be like saying 3D exclusively means infinitely detailed tesellated shapes with path traced rendering. That's definitely an area of 3D that exists, but of course is not at all the entirety of 3D in practice in eg games or movies. Rather it's more like the holy grail.
Same thing here with eg. SVGs. GPU accelerating SVGs is stupidly complicated because it's an inherently serial algorithm, and GPUs are poop at that. But how much of your 2D UI is made up of that? Text is in that same category but how much else? Typically very little. Maybe a few icons, but that tends to be about it. Instead you have higher level shapes, like round rectangles. And those you can do with a GPU quite easily. Similarly images are usually just a textured quad. Again trivial for a GPU. You could describe them as paths if you had a fully generic, fully accelerated path rendering system. But nobody has that, so nobody actually describes them like that.
So very nearly all 2D/UI systems are GPU accelerated. It'd be perhaps more accurate to call them hybrid renderers. Things like fonts are just CPU rendered because CPUs are better at it, but the GPU is doing all the fills, gradients, "simple" shapes, texturing, etc...
I respectfully disagree that CPUs are better than GPUs at font rendering :)
There are a few related things that can be said. Doing fast, high quality font rendering on a GPU is hard; it's much easier on a CPU. Further, the traditional rasterization pipeline of a GPU is not good at rendering fonts. Fortunately, modern GPUs also have compute shaders, which are programmed somewhat like regular computers but just with an astonishingly high number of threads.
This is the topic of my research, and I intend to publish quantitative measurements backing up these assertions before long. Early results look promising.
> which are programmed somewhat like regular computers but just with an astonishingly high number of threads.
But they aren't that; they are actually wide vector processors, which means groups of threads need to be doing the same thing for it to perform properly! Branches and divergent control flow kill GPU performance.
I'm sure you already know this, but I'm just pointing out for other folks reading. If GPUs were just CPUs with stupidly high core counts then things would be way easier, but it's more complicated than that.
But any Turing-complete operation can be mapped mechanistically into a branchless ISA, can’t it? One of those “one-instruction” ISAs, for example, where every instruction is also a jump. Vector processors would compute on those just fine, just like they compute matrix-multiplication problem isomorphisms just fine.
Or, for a more obvious/less arcane restatement: can't the shader cores just be given a shader that's an interpreter, and a texture that's a spritesheet of bytecode programs?
Yes, we can make GPU programs that render vector images this way, but they tend to be slower than an equivalent CPU program. Branches are not the problem, GPUs handle those just fine now actually. The problem is duplicated work. GPUs have cores that are individually much, much slower than a CPU, but make up for this by having lots and lots of them running in parallel. Having those cores all run the same serial interpreter does not give you increased parallelism, so the result is slower.
Designing algorithms for the GPU requires rethinking your dataflow and structure to exploit the parallel nature of the GPU. GPUs are not just a "go fast" button.
> Branches are not the problem, GPUs handle those just fine now actually
Worth noting that's only kinda true. If all threads take the same branch in a thread group, then it's mostly fine. But divergent branches are basically equivalent to all cores taking both branches and just masking off all the writes with whether or not the conditional was true. This can be incredibly slow depending on the complexity of the code being branched.
Also not all GPUs can even optimize branches effectively, some of them just always take both branches & mask off the results.
Well, sure; but the problem of font rendering specifically is an "embarrassingly parallel" one, isn't it? If you've got 1000 glyphs at a specific visual size to pre-cache into alpha-mask textures; and you've got 1000 GPU shader cores to compute those glyphs on; then each shader core only needs to compute one glyph once.
Can a CPU really be so much faster than these cores that it can run this Turing-complete font rendering program (which, to be clear, is already an abstract machine run through an interpreter either way, whether implemented on the CPU or the GPU) consisting of O(N) interpreted instructions, O(N) times, for a total of O(N^2) serial CPU computation steps; in less than the time it takes the O(N) GPU cores to run only O(N) serial computation steps each? Especially on a modern low-power system (e.g. a cheap phone), where you might only have 2-4 slow CPU cores, but still have a bounty of (equally slow) GPU cores sitting there doing mostly nothing? If so, CPUs are pretty amazing.
But even if it were true that it'd be faster in some sense (time to first pixel, where the first rendered glyph becomes available?) to render on the CPU — accelerators don't just exist to make things faster, they also exist to offload problems so the CPU can focus on things that are its comparative advantage.
Analogies:
- An apprentice tradesperson doesn't have to be better at a delegated task than their mentor is; they only need to be good enough at the task to free up some time for the mentor to focus on getting something higher-priority done, that the mentor can do and the apprentice (currently) cannot. For example, the apprentices working for master oil painters did the backgrounds, so the master could focus on portrait details + anatomy. The master could have done the backgrounds faster! But then that time would be time not spent working on the foreground.
- Ethernet cards. CPUs are fast enough to "bit bang" even 10GBe down a wire just fine; but except under very specific situations (i.e. dedicated network-switches where the CPU wants to process every packet synchronously as it comes in), it's better that they don't, leaving the (slower!) Ethernet MCU to parse Ethernet frames, discard L2-misdirected ones, and DMA the rest into kernel ring-buffer memory.
- Audio processors in old game consoles like the SNES's S-SMP and the C64's SID — yes, the CPU could do everything these could do, and faster; but if the CPU had to keep music samples playing in realtime, it wouldn't have much time to do things like gameplay (which usually goes together with playing music samples!)
Offloading font (or generalized implicit-shape) rendering to the GPU might not make sense if you're just computing letterforms for billboard textures in a static 3D scene (rather the opposite!) but in a game that wants to do things like physics and AI on the CPU, load times can likely be shorter with the GPU tasked with the font rendering, no? Especially since the rendered glyph-textures then don't have to be loaded into VRAM, because they're already there.
Having a queue of 1,000 independent work items to do doesn't mean something is "embarrassingly parallel". Operating systems are a classic example of something that's hard to parallelize, and they have 1,000 independent processes they need to schedule and manage. Heterogeneous tasks makes parallelism hard!
Cores in GPUs do not operate independently, they have hierarchies of memory and command structure. They are good at sharing some parts and terrible at sharing other parts.
Exploiting the parallelism of a GPU in the context of curve rasterization is still an active research problem (Raph Levien, who has posted elsewhere in this thread, is one of the people doing the research), and it's not easy.
I restrained from commenting on the specifics of how curves are rasterized, but if you want to imagine it, think about a letter, maybe a large "g", think about the points that make it up, and then come up with an algorithm to find out whether a specific point is inside or outside that outline. What you'll quickly realize is that there's no local solution, there's only global solutions. You have to test the intersection of all curves to know whether a given pixel is inside or outside the outline, and that sort of problem is serial.
The work division you want (do a bit of work for each curve), is exactly backwards from the work division a normal GPU might give you (do a bit of work for each pixel), pushing you towards things like compute shaders.
I could go on, but this comment thread is already too deep.
> The work division you want (do a bit of work for each curve), is exactly backwards from the work division a normal GPU might give you (do a bit of work for each pixel)
Doesn't this mean that you could:
1. entirely "offline", at typeface creation time:
1a. break glyphs into their component "convex curved region tiles" (where each region is either full, empty, or defined by a curve with zero inflection points)
1b. deduplicate those tiles (anneal glyph boundaries to minimize distint tiles; take advantage of symmetries), to form a minimal set of such curve-tiles, and assign those sequence numbers, forming a "distinct curves table" for the typeface;
1c. restate each glyph as a grid of paint-by-numbers references (a "name table", to borrow the term from tile-based consoles) where each grid position references its tile + any applied rotation+reflection+inversion
2. Then, at scene-load time,
2a. take each distinct curve from the typeface's distinct-curves table, at the chosen size;
2b. generate a (rather large, but helpfully at most 8bpp) texture as so: for all distinct-curve tiles (U pos), for all potential angled-vector-line intersections (V pos), copy the distinct-curve tile, and serialize the intersection data into pixels beside it
2c. run a compute shader to operate concurrently over the workload tiles in this texture to generate an output texture of the same dimensions, that encodes, for each workload, the alpha-mask for the painted curve for the specified angle, iff the intersection test was good (otherwise generating a blank alpha-mask output);
2d. (this is the part I don't know whether GPUs can do) parallel-reduce the UxV tilemap into a Ux1 tilemap, by taking each horizontal strip, and running a pixel-shader that ORs the tiles together (where, if step 2c is done correctly, at most one tile should be non-zero per strip!)
2e. treat this Ux1 output texture as a texture atlas, and each typeface nametable as a UV map for said texture atlas, and render the glyphs.
To be clear, I'm not expecting that I came up with an off-the-cuff solution to an active "independent research problem" here; I'm just curious why it doesn't work :)
If you allow yourself to do this work offline, that's one thing, but keep in mind that 2D realtime graphics are a requirement. People still need to render SVGs, HTML5 canvas, the CSS drawing model, etc. Grid fitting might eventually go out of favor for fonts, but that's something that means you need different outlines for different sizes of fonts. See Behdad's excellent document on the difficulties of text subpixel rendering and layout [0]. Also, there's things like variable fonts which we might want to support.
The work to break a number of region tiles such that each tile has at most one region might be too fine-grained (think about tiger.svg), and probably equivalent in work compared to rasterizing on the CPU, so not much of a gain there. That said, tiled options are very popular, so you're definitely on to something, though tiles often contain multiple elements.
Going down this way lies ideas like Pathfinder 3, Massively Parallel Vector Graphics (Gan et Al), and my personal favorite, the work of adamjsimmons. I have to read this comment [1] a bit between the lines, but I think it's basically that a quadtree or other form of BVH is computed on the CPU containing which curves are in which parts of the glyph, and then the pixel shader only evaluates the curves it knows are necessary for that pixel. Similar in a lot of ways to Behdad's GLyphy.
I have my own ideas I eventually want to try on top of this as well, but I think using a BVH is my preferred way to solve this problem.
EDIT: You changed this comment between when I was writing and when I posted it, so it's not a reply to the new scheme. The new scheme doesn't seem particularly helpful for me. If you want to talk about this further to learn why, contact information is in my HN profile.
> If you've got 1000 glyphs at a specific visual size to pre-cache into alpha-mask textures;
How often does that happen? There are definitely languages where that is a plausible scenario (eg, Chinese), but for the majority of written languages you have well under 100 glyphs of commonality for any given font style.
And then as you noted, you cache these to an alpha texture. So you need all of those 1000 glyphs to show up in the same frame even.
> Especially on a modern low-power system (e.g. a cheap phone), where you might only have 2-4 slow CPU cores, but still have a bounty of (equally slow) GPU cores sitting there doing mostly nothing?
But the GPU isn't doing nothing. It's already doing all the things it's actually good at like texturing from that alpha texture glyph cache to the hundreds of quads across the screen, filling solid colors, and blitting images.
Rather, typically it's the CPU that is consistently under-utilized. Low end phones still tend to have 6 cores (even up to 10 cores), and apps are still generally bad at utilizing them. You could throw an entire CPU core at doing nothing but font rendering and you probably wouldn't even miss it.
The places where GPU rendering of fonts becomes interesting is when glyphs get huge, or for things like smoothly animating across font sizes (especially with things like variable width fonts). High end hero features, basically. For the simple task of text as used on eg. this site? Simple CPU rendered glyphs to an alpha texture is easily implemented and plenty fast.
Absolutely, and I don't want to claim I'm the first or only one doing font rendering on GPU. There's Slug as you pointed out, Pathfinder and Spinel as Jasper cited, and also interesting experimental work including GLyphy by Behdad and algorithms by Evan Wallace and Will Dobbie, plus a whole series of academic papers including "Massively Parallel Vector Graphics," "Random Access Vector Graphics," and others.
However, I would say that a common thread is that doing this well is hard. There's no straightforward cookbook scheme that people can just implement, and there are always tradeoffs. Slug is used in a number of games (and congrats to Eric for winning those licenses), but not as far as I know in any UI toolkits, and there are reasons for that.
> Slug is used in a number of games (and congrats to Eric for winning those licenses), but not as far as I know in any UI toolkits, and there are reasons for that
Presumably because its antialiasing is crap? But there's nothing inherent to fragment-oriented approaches that prevents you from doing good aa, and they slot nicely into the existing rasterization pipeline (which is why slug has fewer feature level requirements than pathfinder). They also permit arbitrary domain transformations (some caveats here as you have to calculate a bounding box still), and given appropriate space partitioning should not be significantly slower than scanline algorithms.
Also: UI toolkits are not known for being on the leading edge of graphics research. I think fastuidraw demonstrates this rather well. Insofar as there is exciting work happening in industry, it is mainly happening in web browsers; and I would expect mozilla and google to devote their efforts pathfinder and skia, respectively.
No, Slug’s technique can handle AA and do it well. The problem with Slug for general purpose UI frameworks is it needs to do a lot of pre-processing on it’s data to do the good job it does.
Slug aa is only 1-dimensional. From the paper (end of section 2):
> Adding and subtracting these fractions from the winding number has the effect of antialiasing in the direction of the rays. Averaging the final coverages calculated for multiple ray directions antialiases with greater isotropy, but at a performance cost. Considering only rays parallel to the coordinates axes is a good compromise, especially when combined with supersampling, as discussed later.
I.E. you don't get a real 2-d coverage result, only an amalgamation of a number of 1-d coverage results; and you must trade off performance and quality. Other approaches do not require such a tradeoff.
Analytic 2-d coverage can be done more cheaply than n 1-d samples (n is probably in the neighborhood of 4-6), and produces better (mathematically ideal, albeit with uncomfortable caveats) results. (Note 4-6 samples don't mean 4-6x slower, due to space partitioning, buffers, and other fixed costs, as well locality. And I think slug takes 2 samples by default as is.)
Oh I wasn’t pointing it out as a critical response to it being your thesis. I’m actually very interested to see how it turns out, because I’m digging into this space at the moment.
I’m trying to build a platform-agnostic styling language specifically for UI/UX designers, and it’s leading me down the path of “render everything via WebGPU”.
Is there a way I can follow your progress? Very keen on hearing more about your research if/when it’s ready.
I don't know if I'd go that far -- icons and text are vector paths. Strokes and drop shadows (aka blurs) are all things that GPUs aren't great particularly great at. Simple shapes like rounded rectangles, GPUs can be OK at, but you'd have overdraw problems if done naively.
I've worked on 2D rendering engines, so I've seen the content thrown at it in the wild. Very rarely do you have a simple case. GitHub's buttons are maybe the simplest example I can think of, and they have strokes (GPUs: ugh) on a filled rounded border (GPUs: ugh), with text inside (GPUs: ugh), sometimes with a text shadow (GPUs: ugh).
It can be done, but you basically have to get away from triangles and move into research methods which are exceptionally more tricky, aka the stuff in Pathfinder and piet-gpu.
> Strokes and drop shadows (aka blurs) are all things that GPUs aren't great particularly great at.
They can handle those just fine. Blurs are just inherently very expensive, but GPUs are no worse at them than CPUs. In fact GPUs are way faster at blurs than CPUs.
Same with filled shapes. It's not really a challenge. You have a fragment shader that knows how to essentially 'clip' to a round rect, which isn't hard, and then filling it any which way with anything is trivial.
"Dedicated graphics hardware for 2D" is just a blitter and sprite engine, and modern GPU's do that just fine. The mouse pointer in many recent systems is a 2D hardware sprite.
Programming graphics, especially 2D stuff, is far more ergonomic and convenient in your native language already executing on the CPU. If you can get away with it performance wise, there's really no incentive to incur the myriad obnoxious bullshit inherent in GPU programming.
But with the advent of high dpi displays it's become problematic to do even simple 2D/UI rendering on the CPU just because of the enormous quantity of pixels.
When you pull in GPU support, now you're stuck having to pick a backend (gl/vulkan/d3d/metal) or some compatibility layer to make some/all of them work. You have to write shaders, you have to constantly move state in/out of the GPU across this GPU:CPU API boundary. It's just a total clusterfuck best avoided if possible.
I'm not familiar with modern game engines, but I'd be very surprised if any of them managed to eliminate the utterly unnatural reality of writing shaders vs. writing classical 2D rendering algorithms operating on a linear buffer of pixels in memory.
For concurrency reasons shaders logically run on a single pixel. Gone are your longstanding algorithms for doing simple things like bresenham line-drawing, or something as simple as drawing a filled box like this:
for (int y = box.y; y < box.y + box.h; y++)
for (int x = box.x; x < box.x + box.w; x++)
FB[y * FB_STRIDE + x] = box.color;
Nope, not happening in a shader. Every shader basically executes in isolation on a pixel and you have to operate from a sort of dead-reckoning perspective. No more sequential loops iterating rows and lines, a fashion which we have literally decades of graphics programming publications explaining how to do things. Not to mention how natural it is to think about things that way, since it closely resembles drawing on paper.
In shaders you often end up doing things that feel utterly absurd in terms of overhead because of this "you run on an arbitrary pixel" perspective. Oftentimes you're writing some kind of distance function, where previously you would have written a loop iterating across lines and rows advancing some state as you step through the pixels. In a shader it's like the paper is covered with thousands of pencils that don't move, and the shader program just determines what color the pencil should be based on its location.
GPU programming is plain annoying, even without the GPU API fragmentation clusterfuck. Especially if you've been writing 2D stuff on the CPU for decades.
What you've described is a blit operation (copying a block of pixels from a source, such as a texture, in your case a solid color). Probably you wouldn't write this out but would write:
blit(box.x,box.y,box.w,box.h,RED)
In shader-land, this is equivalent to rendering a rectangle, with a texture or solid color as source. Sure it's more involved to implement this abstraction, since you need a mesh, need to write a small shader, be familiar with the render pipeline state etc, but it also gives you some stuff trivially, like anti-aliasing, and scaling support for the blit operation.
Many libraries already implement this stuff on top of WebGL, like pixi.js
While I don't doubt the creative possibilities of working with pixels directly, once you figure out how GPUs work, a lot of 2D stuff is actually pretty easy.
Both of the sibling comments describe quadratic Bezier curves (used often in font rendering because TrueType only supports quadratic), while graphics APIs and CFF font outlines often mandate support for cubic Beziers. Cubics are a lot more challenging to build a closed-form solution for, and also have things like self-intersection which makes it a lot more challenging.
Most production renderers, sometimes even ones on the CPU, approximate cubic Bezier curves with a number of quadratic Bezier curves. This is a preprocessing step which needs to be done on the CPU. While it it could be done on the GPU, doing it in the pixel shader would be really wasteful.
Browsers largely make use of the GPU for UI rendering. Direct2D, Cocoa, QT and GTK(4) are all hardware accelerated as well. So not really sure what you mean?
The linked article covers the major challenges that makes it difficult to adapt the GPU to 2D vector graphics rendering.
tl;dr 2D cares about shapes and curves and proper antialiased coverage of super small shapes, the traditional rasterization GPU pipeline is very good at triangles and textures and have limited coverage options.
Direct2D, Qt and GTK+ still do a good portion of the graphics work on the CPU, and only use GPU for composition. Some limited graphics can be done on the GPU, usually with quality tradeoffs. Font rasterization is still done the CPU, and uploaded to the GPU as a texture.
Newer libraries like Pathfinder, Spinel, piet-gpu all work by not using the triangle rasterization parts but instead treating the GPU as a general-purpose parallel processor with compute shaders.
I think it is. I see all kinds of 2D projects claiming to be "GPU accelerated" - for example GTK, KDE, web browsers. I'm not sure how much of the actual processing is done on the GPU, but it's enough to call it "accelerated"!
GPUs want to draw triangles, and in fact only know how to draw triangles[0]. Pretty much all graphics API innovation has been around either feeding more triangles to the GPU faster, letting the GPU create more triangles after they've been sent, or finding cool new ways to draw things on the surface of those triangles.
2D/UI breaks down into drawing curves, either as filled shapes or strokes. The preferred representation of such is a Bezier spline, which is a series of degree-three[1] polynomials that GPUs have zero support for rasterizing. Furthermore, strokes on the basis functions of those splines are not polynomials, but an even more bizarre curve type called an algebraic curve. You cannot just offset the control points to derive a stroke curve; you either have to approximate the stroke itself with Beziers, or actually draw a line sequentially in a way that GPUs are really not capable of doing.
The four things you can do to render 2D/UI on a GPU is:
- Tesselate the Bezier spline with a series of triangles. Lyon does this. Bezier curves make this rather cheap to do, but this requires foreknowledge of what scale the Bezier will be rendered at, and you cannot adjust stroke sizes at all without retessellating.
- Send the control points to the GPU and use hardware tessellation to do the above per-frame. No clue if anyone does this.
- Don't tessellate at all, but send the control points to the GPU as a polygonal mesh, and draw the actual Beziers in the fragment shaders for each polygon. For degree-two/quadratics there are a series of coordinate transforms that you can do which conveniently map all curves to one UV coordinate space; degree-three/cubics require a lot more attention in order to render correctly. If I remember correctly Mozilla Pathfinder does this[2].
- Send a signed distance field and have the GPU march it to render curves. I don't know much about this but I remember hearing about this a while back.
All of these approaches have downsides. Tessellation is the approach I'm most familiar with because it's used heavily in Ruffle; so I'll just explain it's downsides to give you a good idea of why this is a huge problem:
- We can't support some of Flash's weirder rendering hacks, like hairline strokes. Once we have a tessellated stroke, it will always be that width regardless of how we scale the shape. But hairlines require that the stroke get proportionally bigger as the shape gets smaller. In Flash, they were rendering on CPU, so it was just a matter of saying "strokes are always at least 1px".
- We have to sort of guess what scale we want to render at and hope we have enough detail that the curves look like curves. There's one particular Flash optimization trick that consistently breaks our detail estimation and causes us to generate really lo-fi polygons.
- Tessellation requires the curve shape to actually make sense as a sealed hull. We've exposed numerous underlying bugs in lyon purely by throwing really complicated or badly-specified Flash art at it.
- All of this is expensive, especially for complicated shapes. For example, pretty much any Homestuck SWF will lock up your browser for multiple minutes as lyon tries to make sense of all of Hussie's art. This also precludes varying strokes by retessellating per-frame, which would otherwise fix the hairline stroke problem I mentioned above.
[0] AFAIK quads are emulated on most modern GPUs, but they are just as useless for 2D/UI as triangles are.
[1] Some earlier 2D systems used degree-two Bezier splines, including most of Adobe Flash.
[2] We have half a PR to use this in Ruffle, but it was abandoned a while back.
> For degree-two/quadratics there are a series of coordinate transforms that you can do which conveniently map all curves to one UV coordinate space; degree-three/cubics require a lot more attention in order to render correctly. If I remember correctly Mozilla Pathfinder does this[2].
That's interesting! Do you have any references (or links to sample fragment shader code) for that quadratic case coordinate transformation?
Basically, all quadratic Bezier curves are just linear transformations[0] of the curve u^2 - v. The fragment shader just evaluates that one equation to draw the curve, and texture mapping does all the rest. As long as you're careful to ensure that your fill surface polygon actually makes sense, you get back out perfectly-rendered Beziers at any zoom factor or angle.
[0] Scale/shear/rotate - all the things you can do by matrix multiplication against a vector. Notably, not including translations; though GPUs just so happen to use a coordinate system that allows linear translations if you follow some conventions.
I found a reference on HN for anyone else following along:
> In 2005 Loop & Blinn [0] found a method to decide if a sample / pixel is inside or outside a bezier curve (independently of other samples, thus possible in a fragment shader) using only a few multiplications and one subtraction per sample.
- Integral quadratic curve: One multiplication
- Rational quadratic curve: Two multiplications
- Integral cubic curve: Three multiplications
- Rational cubic curve: Four multiplications