Rasterizer engines don’t have to worry about the thread-pixel scheduling problem as its handled behind the scenes by the fixed function rasterizer hardware. With rasterization, GPU threads are mapped to the object data first (vertex vectors), and then scanned into pixel vector work queues, whose many to one mapping to output pixels is synchronized by dedicated hardware.

A tracing engine on the other hand explicity allocates threads to output pixels, and then loops through the one to many mapping of object data which intersects the pixel’s ray, which imposes some new performance pitfalls.

But if you boil it down to simplicity, the rasterization vs ray tracing divide is really just the difference in a loop ordering and mapping:

for each object

for each pixel ray intersecting object

if ray forms closest intersection

store intersection

vs

for each pixel

for each object intersecting pixel’s ray

if ray forms closest intersection

store intersection

The real meat of course is in the data structures, which determine exactly what these ‘for each ..’ entail. Typically, pixels are stored in a regular grid, and there is much more total object data than pixels, so the rasterization approach is simpler. Mapping from objects to rays is thus typically easier than mapping from rays to objects. Conversely, if your scene is extremely simple, such as a single sphere or a regular grid of objects, the tracing approach is equally simple. If the pixel rays do not correspond to a regular grid, rasterization becomes more complex. If you want to include reflections and secondary ray effects, then the mapping from objects to rays becomes complex.

Once the object count becomes very large, much larger than the number of pixels, the two schemes become increasingly similar. Why? Because the core problem becomes one of dataset management, and the optimal solution is output sensitive. So the problem boils down to finding and managing a minimal object working set: the subset of the massive object database that is necessary to render a single frame.

A massively complex scene is the type of scene you have in a full unique dataset. In a triangle based engine, this is perhaps a surface quadtree or ABBTree combined with a texture quadtree for virtual textures, ala id tech 5. In a voxel engine, this would be an octree with voxel bricks. But the data structure is somewhat independent on whether you trace or rasterize, and you could even mix the two. Either scheme will require a crucial visibility step which determines which subsets of the tree are required for rendering the frame. Furthermore, wether traced or rasterized, the dataset should be about the same, and thus the performance limit is about the same – proportional to the working dataset size.

Which gets to an interesting question: What is the theoretical working set size? If you ignore multi-sample anti-aliasing and anistropic sampling, you need about one properly LOD-filtered object primitive(voxel, triangle+texel, whatever) per pixel. Which is simple, suprisingly small, and of course somewhat useless, for naturally with engines at that level of complexity anti-aliasing and anistropic sampling are important. Anti-aliasing doesnt by itself add much to the requirement, but the anisotropy-isotropy issue turns out to be a major problem.

Consider even the ‘simple’ case of a near-infinite ground plane. Sure its naturally representable by a few triangles, but lets assume it has tiny displacements all over and we want to represent it exactly without cheats. A perfect render. The octree or quadtree schemes are both isotropic, so to get down to pixel-sized primitives, they must subdivide down to the radius of a pixel cone. Unfortunately, each such pixel cone will touch many primitives – as the cone has a near infinite length, and when nearly perpendicular to the surface, will intersect a near infinite number of primitives. But whats the real worst case?

The solution actually came to me from shadow mapping, which has a similar sub problem in mapping flat regular grids to pixels. Consider a series of cascade shadow maps which perfectly cover the ground plane. They line up horizontally with the screen along one dimension, and align with depth along the other dimension – near perfectly covering the set of pixels. How many such cascades do you need? It turns out you need log(maxdist), where maxdist is the extent of the far plane, in relation to the near plane. Assuming a realistic far plane of 16 kilometers and a 1 meter near plane, this works out to 14 cascades. So in coarse approximation, anistropy increases the required object density cost for this scene by a factor of roughly ~10x-20x. Ouch!

This is also gives us the worst case possible scene, which is just that single flat ground plane scaled up: a series of maximum length planes each perpendicular to a slice of eye rays, or alternatively, a vicious series of pixel tunnels aligned to the pixel cones. The worst case now is much much worse than ~10-20x the number of pixels. These scenes are easier to imagine and encounter with an orthographic projection, and thankfully won’t come up with a perspective projection very often, but still are frightening.

It would be nice if we could ‘cheat’, but its not exactly clear how to do that. Typical triangle rasterizers can cheat in anistropic texture sampling, but its not clear how to do that in a tree based subdivision system, wether quadtree or octree or whatever. There may be some option with anistropic trees like KD-trees, but they would have to constantly adapt as the camera moves. Detecting glancing angles in a ray tracer and skipping more distance is also not a clear win as it doesn’t reduce the working set size and breaks memory coherency.