Category Archives: graphics

Overdue Update

I need to somehow enforce a mental pre-committment to blog daily.  It’s been almost half a year and I have a huge backlog of thoughts I would like to commit to permanent long term storage.

Thus, a commitment plan to some upcoming future posts:

  •  In October/November of last year(2010), I researched VR HMDs and explored the idea of a next-generation interface.  I came up with a novel hardware idea that could potentially solve the enormous resolution demands of a full FOV optic-nerve saturating near-eye display device (effective resolution of say 8k x 4k per eye or higher).  After a little research I found the type of approach I discovered already has a name: a foveal display, although current designs in the space are rather primitive.  The particular approach I have in mind, if viable, could solve the display problem once and for all.  If an optimized foveal display could be built into eyewear, you would never need any other display – it would replace monitors, tvs, smartphone screens and so on.  Combine a foveal HMD with a set of cameras spread out in your room like stereo speakers and some software for real-time vision/scene voxelization/analysis, and we could have a Snowcrash interface (and more).
  • Earlier in this year I started researching super-resolution techniques.  Super-resolution is typically used to enhance old image/video data and has found a home in upconverting SD video. I have a novel application in mind:  Take a near flawless super-res filter and use it as a general optimization for the entire rendering problem.  This is especially useful for near-future high end server based rendering solutions.  Instead of doing expensive ray-tracing and video compression on full 1080p frames, you run the expensive codes on a 540p frame and then do a fast super-res upconversion to 1080p (potentially a 4x savings on your entire pipeline!).  It may come as surprise that current state of the art super-res algorithms can do a 2x upsample from 540p to 1080p at very low error rates: well below the threshold of visual perception.  I have come up with what may be the fastest, simplest super-res technique that still achieves upsampling to 1080p with imperceptible visual error.  A caveat is that your 540p image must be quite good, which has implications for rendering accuracy, anti-aliasing, and thus rendering strategy choices.
  • I have big grandiose plans for next-generation cloud based gaming engines.  Towards that end, I’ve been chugging away at a voxel ray tracing engine.  This year I more or less restarted my codebase, designing for Nvidia’s fermi and beyond along with a somewhat new set of algorithms/structures.  Over the summer I finished some of the principle first pipeline tools, such as a triangle voxelizer, some new tracing loops and made some initial progress towards a fully dynamic voxel scene database.
  • Along the way to Voxeland Nirvanah I got completely fed up with Nvidia’s new debugging path for cuda (they removed the CPU emulation path) and ended up writing my own cuda emulation path via a complete metaparser in C++ templates that translates marked up ‘pseudo-cuda’ to either actual cuda or a scalar CPU emulation path.  I built most of this in a week and it was an interesting crash course in template based parsing.  Now I can run any of my cuda code on the CPU.  I can also mix and match both paths, which is really useful for pixel level debugging.  In this respect the new path i’ve built is actually more powerful and useful than nvidia’s old emulation path as that required full seperate recompilation.  Now I can run all my code on the GPU, but on encountering a problem I can copy the data back to the CPU and re-run functions on the CPU path with full debugging info.  This ends up being better for me than using nvidia’s parallel insight for native GPU debugging, because insight’s debug path is rather radically different than the normal compilation/execution path and you can’t switch between them dynamically.
  • In the realm of AI, I foresee two major hitherto unexploited/unexplored application domains related to Voxeland Nirvanah.  The first is what we could call an Artificial Visual Cortex.  Computer Vision is the inverse of Computer Graphics.  The latter is concerned with transforming a 3+1D physical model M into a 2+1 D viewpoint image sequence I.  The former is concerned with plausibly reconstructing the physical model M given a set of examples of viewpoint image sequences I.  Imagine if we had a powerful AVC trained on a huge video database that could then extract plausible 3D scene models from video.  Cortical models feature inversion and inference.  A powerful enough AVC could amplify rough 2D image sketches into complete 3D scenes.  In some sense this would be an artificial 3D artist, but it could take advantage of more direct and efficient sensor and motor modalities.  There are several aspects to this application domain that make it much simpler than a full AGI.  Computational learning is easier if one side of the mapping transform is already known.  In this case we can prime the learning process by using ray-tracing directly as the reverse transformation pathway (M->I).  This is a multi-billion dollar application area for AI in the field of computer graphics and visualization.
  • If we can automate artists, why not programmers?  I have no doubt that someday in the future we will have AGI systems that can conceive and execute entire technology businesses all on their own, but well before that I foresee a large market role for more specialized AI systems that can help automate more routine programming tasks.  Imagine a programming AI that has some capacity for natural language understanding and some ontology that combines knowledge of some common-sense english, programming, and several programming languages.  Compilation is the task of translating between two precise machine languages expressed in some context-free grammar.  There are deterministic algorithms for such translations.  For the more complex unconstrained case of translation between two natural languages we have AI systems that use probabilistic context-sensitive-grammars and semantic language ontologies.  Translating from a natural language to a programming language should have intermediate complexity.  There are now a couple of research systems in natural language programming that can do exactly this (such as sEnglish).  But imagine combining such a system with an automated ontology builder such as TEXTRUNNER which crawls the web to expand it’s knowledge base.  Take such a system and add an inference engine and suddenly it starts getting much more interesting.  Imagine building entire programs in pseudo-code, with your AI using it massive onotology of programming patterns and technical language to infer entire functions and sub-routines.  Before full translation, compilation and test, the AI could even perform approximate-simulation to identify problems.  Imagine writing short descriptions of data structures and algorithms and having the AI fill in details and even potentially handling translation to multiple languages, common optimizations, automatic parallelization, and so on.  Google itself could become an algorithm/code repository.  Reversing the problem, an AI could read a codebase and began learning likely structures and simplifications to high-level english concept categories, learning what the code is likely to do.  Finally, there are many sub-problems in research where you really want to explore a design space and try N variations in certain dimensions.  An AI system with access to a bank of machines along with compilation and test procedures could explore permutations at very high speed indeed.  At first I expect these type of programming assistant AIs to have wide but shallow knowledge and thus amplify and assist rather than replace human programmers.  They will be able to do many simple programming tasks much faster than a human.  Eventually such systems will grow in complexity and then you can combine them with artificial visual cortices to expand their domain of applicability and eventually get a more complete replacement for a human engineer.

Latency & Human Response Time in current and future games

I’m still surprised at how many gamers and even developers aren’t aware that typical games today have total response latencies ranging anywhere from 60-200ms. We tend to think of latencies in terms of pings and the notion that the response time or ‘ping’ from a computer or a console five feet away can be comparable to the ping of a server a continent away is something of an unnatural notion.

Yet even though it seems odd, its true.

I just read through “
Console Gaming: The Lag Factor“, a recent blog article on EuroGamer which follows up on Mick West’s original Gamasutra article that pioneered measuring the actual response times of games using a high speed digital camera. For background, I earlier wrote a GDM article (Gaming in the Cloud) that referenced that data and showed how remotely rendered games running in the cloud have the potential to at least match the latency of local console games, primarily by running at a higher FPS.

The eurogamer article alludes to this idea:

In-game latency, or the level of response in our controls, is one of the most crucial elements in game-making, not just in the here and now, but for the future too. It’s fair to say that players today have become conditioned to what the truly hardcore PC gamers would consider to be almost unacceptably high levels of latency to the point where cloud gaming services such as OnLive and Gaikai rely heavily upon it.

The average videogame runs at 30FPS, and appears to have an average lag in the region of 133ms. On top of that is additional delay from the display itself, bringing the overall latency to around 166ms. Assuming that the most ultra-PC gaming set-up has a latency less than one third of that, this is good news for cloud gaming in that there’s a good 80ms or so window for game video to be transmitted from client to server


Its really interesting to me that the author assumes that “ultra-PC” gaming set up has a latency less than one third of a console – even though the general model developed in the article posits that their is no fundamental difference between PCs and consoles in terms of latency – other than framerate.

In general, the article shows that games have inherent delay measured in frames – the minimum seems to be about 3 frames of delay, but can go up to 5 for some games. The total delay in time units is simply N/F, the number of frames of delay over the frame rate. A simple low-delay app will typically have the minimum delay – about 3, which maps to around 50ms running at 60fps and 100ms at 30fps.

There is no fundemental difference between consoles and PC’s in this regard other than framerate – the PC version of a game running at 60fps will have the same latency as its console sibling running at 60fps. Of course, take a 30fps console game and run it at 60fps and you halve the latency – and yes this exactly what cloud gaming services can exploit.


The eurogamer article was able to actually measure just that – proving this model with some real world data. The author was able to use the vsync feature in bioshock to measure the response difference between 59fps and 30fps, and as expected, the 59fps had just about half the latency.

The other assertion of the article – or rather the whole point of the article – was that low response times are really important for the ‘feel’ of a game. So I’d like to delve into this in greater detail. As a side note though, the fact that delay needs to be measured for most games to make any sort of guess about its response time tells you something.

Firstly, on the upper end of the spectrum, developers and gamers know from 1st hand experience that there definitely is an upper window to tolerable latency, although it depends on the user action. For most games, controlling the camera with a joypad or mouse feels responsive with a latency of up to 150ms. You might think that the mouse control would be considerably more demanding in this regard, but the data does not back that up – I assert that PC games running at 30fps have latencies in the same 133-150ms window as 30fps console games, and are quite playable at that fps (and some have even shipped capped at 30fps).


There is a legitimate reason for a PC gamer to try to minimize their system latency as much as possible for competitive multiplayer gaming, especially twitch shooters like counterstrike. A system running with vsync off at 100fps might have latencies under 50ms and will give you a considerable advantage over an opponent running at 30fps with 133-150ms of base system latency – no doubt about that.


But what I’m asserting is that most gamers will barely – if at all – be able to notice the difference of delay times under 100ms in typical scenarios in FPS and action games – whether using a gamepad or mouse and keyboard. As the delay times exceed some threshold they become increasingly noticeable – 200ms of delay is noticeable to most users, and 300ms becomes unplayable. That being said, variation in the delay is much more noticeable. The difference between a perfectly consistent 30fps and 60fps is difficult to perceive, but an inconsistent 30fps is quite noticeable – the spike or changes in response time from frame to frame themselves are neurologically relevant and quite detectable. This is why console developers spend a good deal of time to optimize the spike frames and hit a smooth 30fps.

There is however a class of actions that do have a significantly lower latency threshold – the simple action of moving a mouse cursor around on the screen! Here I have some 1st hand data. A graphics app which renders its own cursor, has little buffering and runs at 60fps will have about 3 frames or about 50ms of lag, and at that low level of delay the cursor feels responsive. However if you take that same app and slow it down to 30fps, or even add just a few more frames of latency at 60fps the cursor suddenly seems to lag behind. The typical solution is to use the hardware cursor feature which short circuits the whole rendering pipeline and provides a direct fast path to pipe mouse data to the display – which seems to be under 50ms. For the low-latency app running at 60fps, the hardware cursor isn’t necessary, but it becomes suddenly important at some threshold around 70-90ms.

I think that this is the real absolute lower limit of human ability to perceive delay.

Why is there such a fundamental limit? In short: the limitations of the brain.

Ponder for a second what it actually means for the brain to notice delay in a system. The user initiates an action and sometime later this results in a response, and if that response is detected as late, the user will notice delay. Somewhere, a circuit (or potentially circuits, but for our purposes this doesn’t matter) in the brain makes a decision to initiate an action, this message must then propagate down to muscles in the hand where it then enters the game system through the input device. Meanwhile in the brain, the decision circuit must also send messages to the visual circuits of the form “I have initiated action and am expecting a response – please look for this and notify me immediately on detection”. Its easier to imagine the brain as a centralized system like a single CPU, but it is in fact the exact opposite – massively distributed – a network really on the scale of the internet itself – and curiously for our discussion, with latencies comparable to the internet itself.

Neurons can fire only as fast as about 10ms typically, perhaps as quickly as 5ms in some regions. The fastest neural conduits – myelinated fiber – can send signals from the brain to the fingertip (one way) in about 20ms. So now imagine using these slow components to build a circuit that could detect a timing delay in as quickly as 60ms.

Lets start with the example of firing a gun. At a minimum, we have some computation to decide to fire, and once this happens the message can be sent down to the fingertip to pull the trigger and start the process. At the same time, for the brain to figure out if the gun actually fired in time, the message must also be sent down to the visual circuits, where the visual circuits must process the visual input stream and determine if the expected response exists (a firing gun), this information can then be sent to some higher circuit which can then compute whether the visual response (gun firing response pattern exists or not at this moment in time) matches the action initiated (the brain sent a firing signal to the finger at this moment in time).

Built out of slow 10ms neurons, this circuit is obviously going to have alot of delay of its own which is going to place some limits its response time and ability to detect delay. Thinking of the basic neuron firing system as the ‘clock rate’ and the brain as a giant computer (which it is in the abstract sense), it appears that the brain can compute some of these quick responses in as little as around a dozen ‘clock cycles’. This is pretty remarkable, even given that the brain has trillions of parallel circuits. But anyway, the brain could detect even instantaneous responses if it had the equivalent of video buffering. In other words, if the brain could compensate for its own delay, it could detect delays in the firing response on timescales shorter than its own response time. For this to happen though, the incoming visual data would need to be buffered in some form. The visual circuits, instead of being instructed to signal upon detection of a firing gun, could be instructed to search for a gun firing X ms in the past. However, to do this they would need some temporal history – the equivalent of a video buffer. There’s reasons to believe some type of buffering does exist in the brain, but with limitations – its nothing like a computer video buffer.

The other limitation to the brain’s ability to detect delays is the firing times of neurons themselves which make it difficult to detect timings on scales approaching the neuron firing rate.
But getting back to the visual circuits, the brain did not evolve to detect lag in video games or other systems. Just because its theoretically possible that a neural circuit built out of relatively slow components could detect fact responses by compensating for its own processing delay does not mean that the brain actually does this. The quick ‘twitch’ circuits we are talking about evolved to make rapid decisions – things like: detect creature, identify as prey or predator, and initiate flight or fight. These quick responses involve rapid pattern recognition, classification, and decision making, all in real-time. However, the quick response system is not especially concerned with detecting exactly when an event occurred, its optimized for the problem of reacting to events rapidly and correctly. Detecting if your body muscles reacted to the run command at the right time is not the primary function of these circuits – it is to detect the predator threat and initiate the correct run response rapidly. The insight and assertion I’m going to make is that our ability to detect delays in other systems (such as video games) is only as good as our brain’s own quick response time – because it uses the same circuits. Psychological tests show the measured response time is around ~200ms for many general tasks, probably getting a little lower for game-like tasks with training. A lower bound of around 100-150ms for complex actions like firing guns and moving cameras seems reasonable for experienced players.

For moving a mouse cursor, the response time appears to be lower, perhaps 60-90ms. From this brain model, we can expect that for a few reasons. Firstly, the mouse cursor is very simple and very small, and once the visual system is tracking it we can expect that detecting changes in its motions (to verify that its moving as intended) is computationally simple and can be performed in the minimal number of steps. Detecting that the entire scene moved in the correct direction, or that the gun is in its firing animation state are far more complex pattern recognition tasks, and we can expect they would take more steps. So detecting mouse motion represents the simplest and fastest type of visual pattern recognition.

There is another factor at work here as well: rapid eye cascades. The visual system actually directs the eye muscles on frame by frame time scales that we don’t consciously perceive. When recognizing a face, you may you think you are looking at someone right in the eye, but if you watched a high res video feed of yourself and zoomed in on your eyes in slow motion, you’d see that your eyes are actually making many rapid jumps – leaping from the eyebrow to the lips to the nose and so on. Presumably when moving around a mouse cursor, some of these eye cascades are directed to predicted positions of the mouse to make it easy for the visual system to detect its motion (and thus detect if its lagging).

So in summary, experimental data (from both games and psychological research) leads us to expect that the threshold for human delay detection is around:

300ms> games become unpleasant, even unplayable
200ms> delay becomes palpable
100-150ms – limit of delay detection for full scene actions – camera panning and so on
50-60ms – absolute limit of delay detection – small object tracking – mouse cursors

Delay is a strongly non-linear phenomena, undetectable beyond certain threshold and then ramping up to annoying and then deal breaking soon after. Its not a phenomenon where less is always better. Less beyond a certain point doesn’t matter from a user experience point of view. (of course, for competitive twitch gaming, having less delay is definitely advantageous even when you can’t notice it – but this isn’t relevant for console type systems where everyone has the same delay)

So getting back to the earlier section of this post, if we run a game on a remote pc, what can we expect the total delay to be?

The cloud system has several additional components that can add delay on top of the game itself: video compression, the network, and the client which decompresses the video feed.

Without getting into specifics, what can we roughly expect? Well, even a simple client which just decompresses video is likely to exhibit the typical minimum of roughly 3 frames of lag. Lets assume the video compression can be done in a single frame and the network and buffering adds another, we are looking then at roughly 5 frames of additional lag with a low ping to the server – with some obvious areas that could be trimmed further.

If everything is running at 60, a low latency game (3 frames of internal lag), might exhibit around 8/60 or 133ms of latency, and a higher latency game (5 frames of internal lag), might exhibit 10/60 or 166ms of latency. So it seems reasonable to expect that games running at 60fps remotely can have latencies similar to local games running at 30fps. Ping to the server then does not represent even the majority of the lag, but obviously can push the total delay into the unplayable as the ping grows – and naturally every frame of delay saved allows the game to be playable at the same quality at increasingly greater distances from the server.

What are the next obvious areas of improvement? You could squeeze and save additional frames here and there (the client perhaps could be optimized down to 2 frames of delay – something of a lower limit though), but the easiest way to further lower the latency is just to double the FPS again.

120 fps may seem like alot, but it also happens to be a sort of requirement for 3D gaming, and is the direction that all new displays are moving. At 120fps, the base lag in such an example would be around 8/120 to 10/120, or around 66ms to 83ms of latency, comparable to 60fps console games running locally. This also hints that a remotely rendered mouse cursor would be viable at such high FPS. At 120fps, you could have a ping as high as 100ms and still get an experience comparable to a local console .

This leads to some interesting rendering directions if you start designing for 120fps and 3D, instead of the 30fps games are typically designed for now. The obvious optimization for 120fps and 3D is to take advantage of the greater inter-frame coherence. Reusing shading, shadowing, lighting and all that jazz from frame to frame has proportionately greater advantage at high FPS as the scene will change proportionately less between frames. Likewise, the video compression work and bitrate scales sublinearly, and actually increases surprisingly slowly as you double the framerate.



Unique Voxel Storage

How much memory does a unique voxelization of a given scene cost? Considering anistropic filtering and translucency a pixel will be covered by more than one voxel in general. An upper bound is rather straightforwad to calculate. For a single viewport with a limited nearZ and farZ range, there are a finite number of pixel radius voxels extending out to fill the projection volume. The depth dimension of this volume is given by viewportDim * log2(farz/nearz). For a 1024×1024 viewport, a nearZ of 1 meter and a view distance of 16 kilometers, this works out to about log2(16000)*1024, or 14,000 voxels per pixel, or 14 billion voxels for the frustum’s projection volume, and around ~100 billion voxels for the entire spherical viewing volume. This represents the maximum possible data size of any unique scene when sampled at proper pixel sampling rate with unlimited translucency and AA precision.

Now obviously, this is the theoretical worst case, which is interesting to know, but wouldn’t come up in reality. A straightforward, tighter bound can be reached if we use discrete multi-sampling for the AA and anistropic filtering, which means that each sub-sample hits just one voxel, and we only need to store the visible (closest) voxels. In this case, considering occlusion, the voxel cost is dramatically lower, being just ScreenArea*AAFactor. For an average of 10 sub-samples and the same viewport setup as above, this is just around 100 million voxels for the entire viewing volume. Anistropic filtering quickly hits diminishing returns by around 16x maximum samples per pixel, and most pixels need much less, so a 10x average is quite reasonable.
For translucent voxels, a 10x coverage multiplier is quite generous, as the contribution of high frequencies decreases with decreasing opacity (which current game rasterizers exploit by rendering translucent particles at lower resolution). This would mean that voxels at around 10% opacity would get full pixel resolution, and voxels at about 1.5% or lower would get half-pixel resolution, roughly.
The octree subdivision can be guided with the z occlusion information. Ideally we would update a node’s visibility during the ray traversal, but due to the scattered memory write ineffeciency it will probably be better to write out some form of z-buffer and then back-project the nodes to determine visibility.
A brute force multi-sampling approach sounds expensive, but would still be feasible on future hardware, as Nvidia’s recent siggraph paper “Alternative Rendering Pipelines with Nvidia Cuda” demonstrates in the case of implementing a Reyes micropolygon rasterizer in Cuda. With enough multi-samples, you don’t even need bilinear filtering – simple point sampling will suffice. But for voxel tracing, discrete multi-sampling isn’t all that effecient compared to the more obvious and desireable path, which is simply to accumulate coverage/alpha directly while tracing. This is by far the fastest route to high quality AA & filtering. However it does pose a problem for the visibility determination mentioned above – without a discrete z-buffer, you don’t have an obvious way of calculating voxel visibility for subdivision.
One approach would be to use an alpha-to-coverage scheme, which would still be faster than true multi-sampled tracing. This would require updating a number of AA z samples inside the tracing inner loop, which is still much more work then just alpha blending. A more interesting alternative is to store an explicit depth function. One scheme would be to store a series of depths representing equal alpha intervals. Or better yet, store arbitrary piecewise segments of the depth/opacity function. In the heirarchical tracing scheme, these could be written out and stored at a lower resolution mip level, such as the quarter res level, and then be used both to accelerate tracing for the finer levels and for determing octree node visibility. During the subdivision step, nodes would project to the screen and sample their visibility from the appropriate depth interval from this structure.
I think the impact of anisotropy and translucency can be limited or capped just as in the discrete z-buffer case by appropriate node reweighting based on occlusion or opacity contribution. A node which finds that it is only 25% visible would only get slightly penalized, but a 5% visibile node more heavily so, effectively emulating a maximum effective voxel/pixel limit, after which resolution is lost. (which is fine, as the less a node contributes, the less important the loss of its high frequency content). Or more precisely, node scores would decrease in proportion to their screen coverage when it falled below the threshold 1/AA, where AA is the super-sampling limit you want to emulate.

Hierarchical Cone Tracing (half baked rough sketch)

High quality ray tracing involves the computation of huge numbers of ray-scene intersections. As most scenes are not random, the set of rays traced for a particular frame are highly structured and spatially correlated. Just as real world images feature significant local spatial coherence which can be exploited by image compression, real world scenes feature high spatial coherence which ray tracers can exploit. Current real time ray tracers exploit spatial coherence at the algorithm level through hierachical packet/cone/fustrum tracing, and at the hardware level through wide SIMD lanes, wide memory packet transactions, and so on. Coherence is rather easy to maintain for primary rays and shadow rays, but becomes increasingly difficult with specular reflection and refraction rays in the presence of high frequency surface normals, or the wide dispersion patterns of diffuse tracing. However, taking inspiration from both image compression and collision detection, it should be possible to both significantly reduce the number of traces per pixel and trace all rays (or cones) in a structured, coherent and even heirarchical manner.

A set of rays with similar origins and directions can be conservatively bounded or approximated by a frustum or cone. Using these higher order shapes as direct intersection primitives can replace many individual ray traces, but usually at the expense of much more complex intersections for traditional primitives such as triangles. However, alternative primitives such as voxels (or distance fields) permit cheap sphere intersections, and thus fast approximate cone tracing through spherical stepping. Cones have the further advantage that they permit fairly quick bounding and clustering operations.
Building on cones as the base primitive, we can further improve on ray tracing in several directions. First we can treat the entire set of cone traces for a frame as a large collision detection problem, intersecting a set of cones with the scene. Instead of intersecting each cone individually, HCT builds up a cone heirachy on the fly and uses this to quickly exclude large sections of potential sphere-scene intersections. The second area of improvement is to use clustering approximations at the finer levels to reduce the total set of cones, replacing clusters of similar cones with approximations. Finally, the heiarchy exposed can be navigated in a coherent fashion which is well suited to modern GPUs.
In short sketch, hierachical cone tracing amounts to taking a cluster of cone segments (which in turn are clusters of points+rays), and then building up some sort of hierachical cone tree organization, and then using this tree to more effeciently trace and find intersections. As tracing a cone involves testing a set of spheres along the way, testing enclosing spheres up the hierachy can be used to skip steps lower down in the hierachy. However, instead of building the hierarchy from bottom up (which would require the solution to already be known), the cone tree is built from the top down, using adaptive subdivision. Starting with a single cone (or small set of cones) from the camera, it calculates an intersection slice (or ranges of slices) with the scene. These intersection spheres are tested for bounds on the normals, which allow computation of secondary cones, with origins bounding the interesection volumes and angular widths sufficient to bound the secondary illumination of interest. This space forms a 3D (2 angular + depth) dependency tree structure, which can then be adaptively subdivided, eventually down to the level of near pixel width primary cones which typically intersect a small slice of the scene and have similar normals (small radius normal bounding cone). Refraction secondary cones have a similar width to the incoming cone’s width. Specular secondary cones have a width ranging from the incoming width to much wider, depending on the glossiness term. Diffuse bounce secondary cones are essentially equivalent to the widest specular, expanding the cone to maximum width. Subdivision can proceed in several ways at each step, either in primary cone direction (2D), intersection depth, or secondary cone direction (2D) or intersection depth. (considering more bounces in one step would add additional dimensions) The subdivision is error guided, terminating roughly when a low error approximation is reached. This framework is complete, and can simultaneously handle all illumination effects.
First, consider just the simple case of primary rays. Here, hierachical cone tracing is pretty straightforward. You trace the image pyramid from coarse to finest. Each mip trace finds the first intersection or conservative near-z for that mip level, and the next mip level trace uses this coarse near-z to start tracing, instead of starting at the camera. Just even this basic idea can result in more than a 2x speedup vs regular tracing. Further speedup is possible if the lower res mip traces farther in, possibly outputting a set of segments instead of just the 1st intersection. This can be done rather easily with alpha tracing in a voxel system. Considering a range of segments, and not just the 1st intersection amounts to treating it as a 3D subdivision problem instead of 2D. The global bounding cone is approximated by a few dozen spheres, each of which subdivide into about 8 child spheres, and so on. Testing a sphere high up in the heirachy can exclude that entire subtree.
Next lets consider a directional or point light. Starting with a very large cone for the entire scene, we get a set of intersection slices along the entire frustum each of which has a range of normals. Since these regions are so huge, the normals will probably cover the full sphere, so the cones for any secondary effects will basically be larger spheres emenating out across the entire world. Not suprisingly, the orgin of the cone hierachy doesn’t tell you much other than you have a camera frustum, it hits a bunch of stuff, and the secondary illumination could come from anywhere in the scene. Subdivision can proceed in several different ways. You could next subdivide the screen, or you could even subdivide in 3D (screen + depth), splitting into 8 child traces, but the farther depth slices are provisional (they may not accumulate anything). However, another dimension of subdivision, which makes sense here, is to subdivide the 2D direction of secondary (illumination) cones. This should be error guided, subdividing cones that contain high frequency illumination (stored in the octree) and contribute the most error. A bright point light will be found this way, or a direct light will just show up as a huge illumination spike in one cone direction. Subdivision of direction will then find the incoming light direction logarithimically. If there is no indirect illumination (all zeroed out in the octree), then the error guidance will expand in the direction dimension until it finds the primary light direction, and ignore the other directions.
How would the error guidance work more specifically? Its goal would seek to minimize the final tree size (and thus total computation). It would do this in a greedy, approximate fashion using the local information available at each step. When subdividing a cone, there are several splitting options to choose from, and one dimension (depth, ie the cone’s length) is rather special as it involves a temporal dependency. The closer cone section potentially masks the farther section, so if the near section is completely opaque (blocks all light), the far section can be cut. Tracing a cone as a sphere approximation amounts to fully subdividing along just the depth dimension, and reveals the full alpha interval function along that depth (which could be saved as just a min and max, or in a more explicit format). Subdividing a cone then along either the spatial dimension or angular dimension would depend on the outcome of the trace, illumination info found, and the current angle relative to the spherical origin bound.
Careful error guidance will result in an effecient traversal for the directional light case. Once the secondary cone direction is sufficiently subdivided, finding that a secondary cone trace towards the light direction results in a 0 (fully shadowed) will automatically stop further subdivision of that entire subtree. Likewise, subdividing in secondary cone depth will reveal entire cone subsections that are empty, and do not need to be traced. The fully lit regions will then end up exploring increasingly short cone traces near the surface. Full cone traces down to pixel level are only required near shadow edges, and only when the shadow is near pixel resolution, as the error guidance can terminate softer shadows earlier. Secondary bounce diffuse illumination is similar, but the energy is smeared out (lower frequency), so it explores a broader subtree in cone direction, but can usually terminate much earlier in the spatial dimension. It again ends up terminating with just a few shallow traces at the finest resolution, representing local searches. Specular and reflection traces are handled in a similar fashion, and really aren’t that different.
Surface Clustering
Further improvement comes from clustering approximation. The coarse level tests can use bounds on the normals and depths in an intersection region to approximate the intersection and terminate early (for example, finding that the intersection for a 4×4 pixel block is a smooth, nearly coplanar surface section allows early termination and computation of intersection through simple interpolation for the 2 finer levels). This is related to the concept of shading in a compressed space. Consider a simple block based compression scheme, such as DXTC, which essentially uses a PCA approach to compress a set of values (a 4×4 block) by a common 1D line segment through the space (low frequency component) combined with a per pixel distance along the line (high frequency and lower accuracy). The scheme compresses smooth regions of the image with little error, and the error in noisy regions of the image is masked by the noise.
Now lets first apply this compression scheme in the context of a traditional shading. In fact, it is directly applicable in current raster engines for complex but lower frequency shading effects, like screen space AO or GI. Downsampling the depth buffer to compute AO on less samples, and then upsampling with a bilateral filter can be considered a form of compression that exploits the lower frequency dominant AO. A related scheme, closer to DXTC, is to perform a min/max depth downsampling, evaluate the AO on the min&max samples per block, and then use these to upsample – with a dual bilateral or even without. The min/max scheme better represents noisy depth distributions and works much better in those more complex cases. (although similar results could be obtained with only storing 1 z-sample per block and stipple-alternating min and max)
So the generalized form of this spatial clustering reduction is to 1. reduce the set of samples into a compressed, smaller examplar set, often reducing the number of dimensions, 2. run your per-sample algorithm on the reduced exemplar set of samples, and then 3. use a bilateral-type filter to interpolate the results from the exemplars back onto the originals. If the exemplars are chosen correctly (and are sufficient) to capture the structure and frequency of the function on the sample set, there will be little error.
As another example, lets consider somewhat blurry, lower frequency reflections on a specular surface. After primary ray hit points are generated for a block of pixels (or rasterized), you have a normal buffer. Then you run a block compression on this buffer to find a best fit min and max normal for the block, and the per pixel interpolators. Using this block information, you can compute specular illumination on the 2 per block normal directions, and then interpolate the results for all of the pixels in the block. In regions where the normal is smooth, such as a smooth surface like the hood of a car, the reduced block normals are a very close approximation. In regions with high frequency noise in the normals, the noise breaks up any coherent pattern in the specular reflection and hides any error due to the block compression. Of course, on depth edges there is additional error due to the depth/position. To handle this, we need to extend the idea to also block compress the depth buffer, resulting in a multi-dimensional clustering, based on 2 dimensions along a sphere for the normal, and one dimension of depth. This could still be approximated by two points along a line, but there are other possibilities, such as a 3 points (forming a triangle), or even storing 2 depth clusters with 2 normals each (4 points in the 3D space). It would be most effecient to make the algorithm adaptive, using perhaps 1-3 candidate examplars for a block. The later bilateral filtering would pick the appropriate neighbor candidates and weights for each full res sample.
Integrating this concept into the hierachical cone tracer amounts to adding a new potential step when considering expansion sub tasks. As I described earlier about primary rays, you can stop subdivision early if you know that the hit intersection is simple and reducable to a plane. Generalizing that idea, the finer levels of the tree expansion can perform an approximation substitution instead of fully expanding and evaulating in each dimension, terminating early. A plane fit is one simple example, with an examplar set being the more general case. The 5D cone subtree at a particular lower level node (like 4×4), which may have many dozens of fully expanded children can be substituted with a lower dimensional approximation and a few candidate samples. This opens up a whole family of algorithms which adaptively compress and reduce the data and computation space. Triangle like planar primitives can be considered a spot optimization that may be cheaper than subidiving and tracing additional spheres.
Its certainly complex, or potentially complex, but I think it represents a sketch of the direction of what an optimal or near optimal tracer would look like. Just as in image compression, increasing code complexity hits some asymptotic wall at some point. I’ve only considered a single frame here, going further would require integrating these ideas with temporal coherence. Temporal coherence is something I’ve discussed earlier, although there’s many options to how to approach that in a heirarchical cone tracer.
What are the potential advantages? Perhaps an order of magnitude speedup or more if you really add alot of the special case optimizations. Of course, this speedup is only at the algorithmic level, and really depends very much on implementation details, which are complex. I think the queue task generation models now being explored on GPUs are the way to go here, and could implement an adaptive tree subdivision system like this effeciently. Coherence can be maintained by organizing expansions in spatially related clusters, and enforcing this through clustering.
But the real advantage would be combining all lighting effects into a single unified framework. Everything would cast, receive, bounce and bend light, with no special cases. Lights would just be objects in the scene, and everything in the voxelized scene would store illumination info. A real system would have to cache this and what not, but it can be considered an intermediate computation in this framework, one of many things you could potentially cache.
I’ve only explored the baby steps of this direction so far, but its interesting to ponder. Of course, the streaming and compression issues in a voxel system are far higher priority. Complex illumination can come later.

More on grid computing costs

I did a little searching recently to see how my conjectured cost estimates for cloud gaming compared to the current market for grid computing. The prices quoted for server rentals vary tremendously, but I found this NewServers ‘Bare Metal Cloud’ service as an interesting example of raw compute server rental by the hour or month (same rate, apparently no bulk discount).

Their ‘Jumbo’ option for 38 cents per hour is within my previous estimate of 25-50 cents per hour. It provides dual quad cores and 8GB of RAM. It doesn’t have a GPU of course, but instead has two large drives. You could substitute those drives for a GPU and keep the cost roughly the same (using a shared network drive for every 32 or 64 servers or whatever – which they also offer). Nobody needs GPU’s in server rooms right now, which is the biggest difference between a game service and anything else you’d run in the cloud, but I expect that to change in the years ahead with Larrabbee and upcoming more general GPUs. (and coming from the other angle, CPU rendering is becoming increasingly viable) These will continue to penetrate into the grid space, driven by video encoding, film rendering, and yes, cloud gaming.

What about bandwidth?
Each server includes 3 GB of Pure Internap bandwidth per hour
So adequate bandwidth for live video streaming is already included. Whats missing, besides the GPU? Fast, low latency video compression, of course. Its interesting that x264, the open source encoder, can do realtime software encoding using 4 intel cores (and its certainly not the fastest out there). So if you had a low latency H.264 encoder, you could just use 4 of the cpus for encoding and 4 to run the game. Low latency H.264 encoders do exist of course, and I suspect that is the route Dave Perry’s Gaikai is taking.
Of course, in the near-term, datacenters for cloud gaming will be custom built, such as what OnLive and OToy are attempting. Speaking of which, the other interesting trend is the adoption of GPU’s for feature film use, as used recently in the latest Harry Potter film. OToy is banking on this trend, as their AMD powered datacenters will provide computation for both film and games. This makes all kinds of sense, because the film rendering jobs can often run at night and use otherwise idle capacity. From an economic perspective, film render farms are already well established, and charge significantly more per server hour – usually measured per Ghz-hour. Typical prices are around 12-6 cents per Ghz in bulk, which would be around a dollar or two per hour for the server example given above. I imagine that this is mainly due to the software expense, which for a render server could add up to be many times the hardware cost.
So, here are the key trends:
– GPU/CPU convergence, leading to a common general server platform that can handle film/game rendering, video compression, or anything really
– next gen game rendering moving into ray tracing and the high end approaches of film
– bulk bandwidth already fairly inexpensive for 720p streaming, and falling 30-40% per year
– steadily improving video compression tech, with H.265 on the horizon, targeting a further 50% improvement in bitrate
Will film and game rendering systems eventually unify? I think this is the route we are heading. Both industries want to simulate large virtual worlds from numerous camera angles. The difference is that games are interesting in live simulation and simultaneous broadcast of many viewpoints, while films aim to produce a single very high quality 2 hour viewpoint. However, live simulation and numerous camera angles are also required during a film’s production, as large teams of artists each work on small pieces of the eventual film (many of which are later cut), and need to be able to quickly preview (even at reduced detail). So the rendering needs of a film production are similar to that of a live game service.
Could we eventually see unified art pipelines and render packages between games and films? Perhaps. (indeed, the art tools are largelly unified already, except world editing is usually handled by propriatary game tools) The current software model for high end rendering packages is not well suited to cloud computing, but the software as a service model would make alot of sense. As a gamer logs in (through a laptop, cable box, microconsole, whatever) and starts a game, that would connect to a service provider to find a host server nearby, possibly installing the rendering software as needed and streaming the data (cached at each datacenter, of course). The hardware and the software could both be rented on demand. Eventually you could even create games without licensing an engine in the traditional sense, but simply by using completely off the shelf software.

Some thoughts on metaprogramming, reflection, and templates

The thought struck me recently that C++ templates really are a downright awful metaprogramming system. Don’t get me wrong, they are very powerful and I definitely use them, but recently I’ve realized that whatever power they have is soley due to enabling metaprogramming, and there are numerous other ways of approaching metaprogramming that actually make sense and are more powerful. We use templates in C++ because thats all we have, but they are an ugly, ugly feature of the language. It would be much better to combine full reflection (like Java or C#) with the capability to invoke reflective code at compile time to get all the performance benefits of C++. Templates do allow you to invoke code at compile time, but through a horribly obfuscated functional style that is completely out of synch with the imperative style of C++. I can see how templates probably evolved into such a mess, starting as a simple extension of the language that allowed a programmer to bind a whole set of function instantiations at compile time, and then someone realizing that its turing complete, and finally resulting in a metaprogramming abomination that never should have been.

Look at some typical simple real world metaprogramming cases. For example, take a generic container, like std::vector, where you want to have a type-specialized function such as a copy routine that uses copy constructors for complex types, but uses an optimized memcpy routine for types where that is equivalent to invoking the copy constructor. For simple types, this is quite easy to do with C++ templates. But using it with more complex user defined structs requires a type function such as IsMemCopyable which can determine if the copy constructor is equivalent to a memcpy. Abstractly, this is simple: the type is mem-copyable if it has a default copy constructor and all of its members are mem-copyable. However, its anything but simple to implement with templates, requiring all kinds of ugly functional code.
Now keep in my mind I havent used Java in many years, and then only briefly, I’m not familar with its reflection, and I know almost nothing of C#, although I understand both have reflection. In my ideal C++ with reflection language, you could do this very simply and naturally with an imperative meta-function with reflection, instead of templates (maybe this is like C#, but i digress):
struct vector {
generic* start, end;
generic* begin() {return start;}
generic* end() {return end;}
int size() {return end-start;}
type vector(type datatype) {start::type = end::type = datatype*;}
};
void SmartCopy(vector& output, vector& input)
{
if ( IsMemCopyable( typeof( *input.begin() ) ) {
memcpy(output.begin(), input.begin(), input.size());
}
else {
for_each(output, input) {output[i] = input[i];}
}
}
bool IsMemCopyable(type dtype) {
bool copyable (dtype.CopyConstructor == memcpy );
for_each(type.members) {
copyable &= IsMemCopyable(type.members[i]);
}
return copyable;
}
The idea is that using reflection, you can unify compile time and run-time metaprogramming into the same framework, with compile time metaprogramming just becoming an important optimization. In my pseudo-C++ syntax, the reflection is accesable through type variables, which actually represent types themselves: pods, structs, classes. Generic types are specified with the ‘generic’ keyword, instead of templates. Classes can be constructed simply through functions, and I added a new type of constructor, a class constructor which returns a type. This allows full metaprogramming, but all your metafunctions are still written in the same imperative language. Most importantly, the meta functions are accessible at runtime, but can be evaluated at compile time as well, as an optimization. For example, to construct a vector instantiation, you would do so explicitly, by invoking a function:
vector(float) myfloats;
Here vector(float) actually calls a function which returns a type, which is more natural than templates. This type constructor for vector assigns the actual types of the two data pointers, and is the largest deviation from C++:
type vector(type datatype) {start::type = end::type = datatype*;}
Everything has a ::type, which can be set and manipulated just like any other data. Also, anything can be made a pointer or reference by adding the appropriate * or &.
if ( IsMemCopyable(typeof( *input.begin() ) ) {
There the * is used to get past the pointer returned by begin() to the underlying data.
When the compiler sees a static instantiation, such as:
vector(float) myfloats;
It knows that the type generated by vector’s type constructor is static and it can optimize the whole thing, compiling a particular instantiation of vector, just as in C++ templates. However, you could also do:
type dynamictype = figure_out_a_type();
vector(dynamictype) mystuff;
Where dynamictype is a type not known at compile time and could be determined by other functions, loaded from disk, or whatever. Its interesting to note that in this particular example, the unspecialized version is not all that much slower as the branch in the copy function is invoked only once per copy, not once per copy constructor.
My little example is somewhat contrived and admittedly simple, but the power of reflective metaprogramming can make formly complex big systems tasks mucher simpler. Take for example the construction of a game’s world editor.
The world editor of a modern game engine is a very complex beast, but at its heart is a simple concept: it exposes a user interface to all of the game’s data, as well as tools to manipulate and process that data, which crunch it into an optimized form that must be streamed from disk into the game’s memory and thereafter parsed, decompressed, or what have you. Reflection allows the automated generation of GUI components from your code itself. Consider a simple example where you want to add dynamic light volumes to an engine. You may have something like this:
struct ConeLight {
HDRcolorRGB intensity_;
BoundedFloat(0,180) angleWidth_;
WorldPosition pos_;
Direction dir_;
TextureRef cookie_;
static HelpComment description_ = “A cone-shaped light with a projected texture.”
};
The editor could then automatically connect a GUI for creating and manipulating ConeLights just based on analysis of the type. The presence of a WorldPosition member would allow it to be placed in the world, the Direction member would allow a rotational control, and the intensity would use an HDR color picker control. The BoundedFloat is actually a type constructor function, which sets custom min and max static members. The cookie_ member (a projected texture) would automatically have a texture locator control and would know about asset dependencies, and so on. Furthermore, custom annotations are possible through the static members. Complex data processing, compression, disk packing and storage, and so on could happen automatically, without having to write any custom code for each data type.
This isn’t revolutionary, in fact our game editor and generic database system are based on similar principles. The difference is they are built on a complex, custom infrastructure that has to parse specially formatted C++ and lua code to generate everything. I imagine most big game editors have some similar custom reflection system. Its just a shame though, because it would be so much easier and more powerful if built into the language.
Just to show how powerful metaprogramming could be, lets go a step farther and tackle the potentially hairy task of a graphics pipeline, from art assets down to the GPU command buffer. For our purposes, art packages expose several special asset structures, namely geometry, textures, and shaders. Materials, segments, meshes and all the like are just custom structures built out of these core concepts. On the other hand, a GPU command buffer is typically built out of fundemental render calls which look something like this (again somewhat simplified):
error GPUDrawPrimitive(VertexShader* vshader, PixelShader* pshader, Primitive* prim, vector samplers, vector vconstants, vector pconstants);
Lets start with a simpler example, that of a 2D screenpass effect (which, these days, encompasses alot).
Since this hypothetical reflexive C language could also feature JIT compilation, it could function as our scripting language as well, the effect could be coded completely in the editor or art package if desired.
struct RainEffect : public FullScreenEffect {
function(RainPShader) pshader;
};
float4 RainPShader(RenderContext rcontext, Sampler(wrap) fallingRain, Sampler(wrap) surfaceRain, AnimFloat density, AnimFloat speed)
{
// … do pixel shader stuf
}
// where the RenderContext is the typical global collection of stuff
struct RenderContext {
Sampler(clamp) Zbuffer;
Sampler(clamp) HDRframebuffer;
float curtime;
// etc ….
};
The ‘function’ keyword specifies a function object, much like a type object with the parameters as members. The function is statically bound to RainPshader in this example. The GUI can display the appropriate interface for this shader and it can be controlled from the editor by inspecting the parameters, including those of the function object. The base class FullScreenEffect has the quad geometry and the other glue stuff. The pixel shader itself would be written in this reflexive C language, with a straightforward metaprogram to actually convert that into HLSL/cg and compile as needed for the platform.
Now here is the interesting part: all the code required to actual render this effect on the GPU can be generated automatically from the parameter type information emedded in the RainPShader function object. The generation of the appropriate GPUDrawPrimitive function instance is thus just another metaprogram task, which uses reflection to pack all the samplers into the appropriate state, set the textures, pack all the float4s and floats into registers, and so on. For a screen effect, invoking this translator function automatically wouldn’t be too much of a performance hit, but for lower level draw calls you’d want to instantiate (optimize) it offline for the particular platform.
I use that example because I actually created a very similar automatic draw call generator for 2D screen effects, but all done through templates. It ended up looking more like how cuda is implemented, and also allowed compilation of the code as HLSL or C++ for debugging. It was doable, but involved alot of ugly templates and macros. I built that system to simplify procedural surface operators for geometry image terrain.
But anyway, you get the idea now, and going from a screen effect you could then tackle 3D geometry and make a completely generic, data driven art pipeline, all based on reflective functions that parse data and translate or reorganize it. Some art pipelines are actually built on this principle already, but oh my wouldn’t it be easier in a more advanced, reflective language.

The Next Generation of Gaming

The current, seventh, home game console generation will probably be the last. I view this as a very good thing, as it really was a tough one, economically, for most game developers. You could blame that in part on the inordinate success of Nintendo this round with its sixth generation hardware, funky controller, and fun mass market games. But that wouldn’t be fair. If anything, they contributed the most to the market’s expansion, and although they certainly took away a little end revenue from the traditional consoles and developers, the 360 and PS3 are doing fine, in both hardware and software sales. No, the real problem is our swollen development budgets, as we spend more and money just to keep up with the competition, all fighting for a revenue pie which hasn’t grown much, if at all.

I hope we can correct that over the upcoming years with the next generation. Its not that we’ll spend much less on the AAA titles, but we’ll spend it more efficiently, produce games more quickly, and make more total revenue as we further expand the entire industry. Gaining back much of the efficiency lost in transitioning to the 7th generation and more to boot, we’ll be able to produce far more games and reach much higher quality bars. We can accomplish all of this by replacing the home consoles with dumb terminals and moving our software out onto data centers.

How will moving computation out into the cloud change everything? Really it comes down to simple economics. In a previous post, I analyzed some of these economics from the perspective of an on-demand service like OnLive. But lets look at it again in a simpler fashion, and imagine a service that rented out servers on demand, by the hour or minute. This is the more general form of cloud computing, sometimes called grid computing, where the idea is to simply turn computation into a commodity, like power or water. A data center would then rent out its servers to the highest bidder. Economic competition would push the price of computation to settle on the cost to the data center plus a reasonable profit margin. (unlike power, water, and internet commodities, there would be less inherent monopoly risk, as no fixed lines are required beyond the internet connection itself)

So in this model, the developer could make their game available to any gamer and any device around the world by renting computation from data centers near customers just as it is needed. The retailer of course is cut out. The publisher is still important as the financier and marketer, although the larger developers could take this on themselves, as some already have. Most importantly, the end consumer can play the game on whatever device they have, as the device only needs to receive and decompress a video stream. The developer/publisher then pays the data center for the rented computation, and you pay only as needed, as each customer comes in and jumps into a game. So how does this compare to our current economic model?

A server in a dataroom can be much more efficient than a home console. It only needs the core computational system: CPU/GPU (which are soon merging anyway) and RAM. Storage can be shared amongst many servers so is negligible (some per game instance is required, but its reasonably minimal). So a high end server core could be had for around $1,000 or so at today’s prices. Even if active only 10 hours per day on average, that generates about 3,000 hours of active computation per year. Amortized over three years of lifespan (still much less than a console generation), and you get ten cents per hour of computation. Even if it burns 500 watts of power (insane) and 500 watts to cool, those together just add another ten more cents per hour. So its under 25 cents per hour in terms of intrinsic cost (and this is for a state of the art rig, dual GPU, etc – much less for lower end). This cost will hold steady into the future as games use more and more computation. Obviously the cost of old games will decrease exponentially, but new games will always want to push the high end.

The more variable cost is the cost of bandwidth, and the extra computation to compress the video stream in real-time. These use to be high, but are falling exponentially as video streaming comes of age. Yes we will want to push the resolution up from 720p to 1080p, but this will happen slowly, and further resolution increases are getting pointless for typical TV setups (yes, for a PC monitor the diminishing return is a little farther off, but still). But what is this cost right now? Bulk bandwidth costs about $10 per megabit/s of dedicated bandwidth per month, or just three cents per hour in our model assuming 300 active server hours in a month. To stream 720p video with H.264 compression, you need about 2 megabits per second of average bandwidth (which is what matters for the data center). The peak bandwidth requirement is higher, but that completely smooths out when you have many users. So thats just $0.06/hour for a 720p stream, or $0.12/hour for a 1080p stream. The crazy interesting thing is that these bandwidth prices ($10/Mbps month) are as of the beginning of this year, and are falling by about 30-40% per year. So really the bandwidth suddenly became economically feasible this year, and its only going to get cheaper. By 2012, these prices will probably have fallen by half again, and streaming even 1080p will be dirt cheap. This is critical for making any predictions or plans about where this all heading.

So adding up all the costs today, we get somewhere around $0.20-0.30 per hour for a high end rig streaming 720p, and 1080p would only be a little more. This means that a profitable datacenter could charge just $.50 per hour to rent out a high end computing slot, and $.25 per hour or a little less for more economical hardware (but still many times faster than current consoles). So twenty hours of a high end graphics blockbuster shooter would cost $10 in server infastructure costs. Thats pretty cheap. I think it would be a great thing for the industry if these costs were simply passed on to the consumer, and they were given some choice. Without the retailer to take almost half of the revenue, the developer and publisher stand to make a killing. And from the consumer’s perspective, the game could cost about the same, but you don’t have any significant hardware cost, or even better, you pay for the hardware cost as you see fit, hourly or monthly or whatever. If you are playing 40 hours a week of an MMO or serious multiplayer game, that $.50 per hour might be a bit much but you could then pick to run it on lower end hardware if you want to save some money. But actually, as I’ll get to some other time, MMO engines designed for the cloud could be super efficient, so much more so than single player engines that they could use far less hardware power per player. But anyway, it’d be the consumer’s choice, ideally.

This business model makes more sense from all kinds of angles. It allows big budget, high profile story driven games to release more like films, where you play them on crazy super-high end hardware, even hardware that could never exist at home (like 8 GPUs or something stupid), maybe paying $10 for the first two hours of the game to experience something insanely unique. There’s so much potential, and even at the low price of $.25-$.50 per hour for a current mid-2009 high end rig, you’d have an order of magnitude more computation than we are currently using on the consoles. This really is going to be a game changer, but to take advantage of it we need to change as developers.

The main opportunity I see with cloud computing here is to reduce our costs or rather, improve our efficiency. We need our programmers and designers to develop more systems with less code and effort in less time, and our artists to build super detailed worlds rapidly. I think that redesigning our core tech and tools premises is the route to achieve this.

The basic server setup we’re looking at for this 1st cloud generation a few years out is going to be some form of multi-terraflop massively multi-threaded general GPU-ish device, with gigs of RAM, and perhaps more importantly, fast access to many terrabytes of shared RAID storage. If Larrabee or the rumours about NVidia’s GT300 are any indication, this GPU will really just be a massively parallel CPU with wide SIMD lanes that are easy to use. (or even automatic) It will probably also have a smaller number of traditional cores, possibly with access to even more memory, like a current PC. Most importantly, each of these servers will be on a very high speed network, densely packed in with dozens and hundreds of similar nearby units. Each of these capabilities by itself is a major upgrade from what we are used to, but taken all together it becomes a massive break from the past. This is nothing like our current hardware.

Most developers have struggled to get game engines pipelined across just the handful of hardware threads on current consoles. Very few have developed toolchains that embrace or take much advantage of many cores. From a programming standpoint, the key to this next generation is embracing the sea of threads model across your entire codebase, from your gamecode to your rendering engine to your tools themselves, and using all of this power to speedup your development cycle.

From a general gameplay codebase standpoint, I could see (or would like to see) traditional C++ giving way to something more powerful. At the very least, it’d like to see general databases, full reflection and at least some auto memory management, like ref counting at least. Reflection alone could pretty radically alter the way you design a codebase, but thats another story for another day. We don’t need these little 10% speedups anymore, we’ll just need the single mega 10000% speedup you get from using hundreds or thousands of threads. Obviously, data parellization is the only logical option. Modifying C++ or outright moving to a language with these features that also has dramatically faster compilation and link efficiency could be an option.

In terms of the core rendering and physics tech, more general purpose algorithms will replace the many specialized systems that we currently have. For example, in physics, an upcoming logical direction is to unify rigid body physics with particle fluid simulation in a system that simulates both rigid and soft bodies by large collections of connected spheres, running a massive parallel grid simulation. Even without that, just partitioning space amongst many threads is a pretty straightforward way to scale physics.

For rendering, I see the many specialized sub systems of modern rasterizers such as terrain, foilage, shadowmaps, water, decals, lod chains, cubemaps, etc, giving way to a more general approach like octree volumes that simultaneously handles many phenomena.

But more importantly, we’ll want to move to data structures and algorithms that support rapid art pipelines. This is one of the biggest current challenges in production, and where we can get the most advantage in this upcoming generation. Every artist or designer’s click and virtual brush stroke costs money, and we need to allow them to do much more with less effort. This is where novel structures like octree volumes will really shine, especially combined with terrabytes of server side storage, allowing more or less unlimited control of surfaces, object densities, and so on without any of the typical performance considerations. Artists will have much less (or any) technical constraints to worry about and can just focus on shaping the world where and how they want.

Winning my own little battle against Cuda

I’ve won a nice little battle in my personal struggle with Cuda recently by hacking in dynamic texture memory updates.

Unfortunately for me and my little Cuda graphics prototypes, Cuda was designed for general (non graphics ) computation. Texture memory access was put in, but as something of an afterthought. Textures in Cuda are read only – kernels can not write to them (ok yes they added limited support for 2D texturing from pitch linear memory, but thats useless), which is pretty much just straight up retarded. Why? Because Cuda allows general memory writes! There is nothing sacred about texture memory, and the hardware certainly supports writing to it in a shader/kernel, and has since render-to-texture appeared in what? DirectX7? But not in Cuda.
Cuda’s conception of writing to texture memory involves writing to regular memory and then calling a Cuda function to perform a sanctioned copy from the temporary buffer into the texture. Thats ok for alot of little demos, but not for large scale dynamic volumes. For an octree/voxel tracing app like mine, you basically fill up your GPU’s memory with a huge volume texture and accessory octree data which is broken up into chunks which can be fully managed by the GPU. You need to then be able to modify these chunks as the view changes or animation changes sections of the volume. Cuda would have you do this by double buffering the entire thing with a big scratch buffer and doing a copy from the linear scratchpad to the cache tiled 3D texture every frame. That copy function, by the way, acheives a whopping 3 Gb/s on my 8800. Useless.
So, after considering various radical alternatives that would avoid doing what I really want to do (which is use native 3D trilinear filtering in a volume texture I can write to anywhere), I realized I should just wait and port to DX11 compute shaders which will hopefully allow me to do the right thing. (and should also allow access to DXTC volumes, which will probalby be important).
In the meantime, I decided to hack my way around the cuda API and write to my volume texture anyway. This isn’t as bad as it sounds, because the GPU doesn’t have any fancy write-protection page faults, so a custom kernel can write anywhere in memory. But you have to know what you’re doing. The task was thus to figure out where exactly it would allocate my volume in GPU memory, and exactly how the GPU’s tiled addressing scheme works.
I did this with a brute force search. The drivers do extensive bounds checking even in release and explode when you attempt to circumvent them, so I wrote a memory-laundring routine to shuffle illegitimate arbitary GPU memory into legitimate allocations the driver would accept. Then I used this to snoop the GPU memory, which allowed a brute force search, using cuda’s routines to copy from cpu linear memory into the tiled volume texture, then snooping the GPU memory to find out exactly where my magic byte ended up, revealing one mapping of a XYZ coordinate and linear address to a tiled address on the GPU (or really, the inverse).
For the strangely curious, here is my currently crappy function for the inverse mapping (from GPU texture address to 3D position):

inline int3 GuessInvTile(uint outaddr)
{
int3 outpos = int3_(0, 0, 0);
outpos.x |= ((outaddr>>0) & (16-1)) << 0;
outpos.y |= ((outaddr>>4) & 1) << 0;
outpos.x |= ((outaddr>>5) & 1) << 4;
outpos.y |= ((outaddr>>6) & 1) << 1;
outpos.x |= ((outaddr>>7) & 1) << 5;
outpos.y |= ((outaddr>>8) & 1) << 2;
outpos.y |= ((outaddr>>9) & 1) << 3;
outpos.y |= ((outaddr>>10) & 1) << 4;
outpos.z |= ((outaddr>>11) & 1) << 0;
outpos.z |= ((outaddr>>12) & 1) << 1;
outpos.z |= ((outaddr>>13) & 1) << 2;
outpos.z |= ((outaddr>>14) & 1) << 3;
outpos.z |= ((outaddr>>15) & 1) << 4;

outpos.x |= ((outaddr>>16) & 1) << 6;
outpos.x |= ((outaddr>>17) & 1) << 7;

outpos.y |= ((outaddr>>18) & 1) << 5;
outpos.y |= ((outaddr>>19) & 1) << 6;
outpos.y |= ((outaddr>>20) & 1) << 7;

outpos.z |= ((outaddr>>21) & 1) << 5;
outpos.z |= ((outaddr>>22) & 1) << 6;
outpos.z |= ((outaddr>>23) & 1) << 7;

return outpos;
}

I’m sure the parts after the 15th output address bit are specific to the volume size and thus are wrong as stated (should be done with division). So really it does a custom swizzle within a 64x32x32 chunk, and then fills the volumes with these chunks in a plain X, Y, Z linear fill. Its curious that it tiles 16 over in X first, and then fills in a 64×32 2D tile before even starting on the Z. This means writing in spans of 16 aligned to the X direction is most effecient for scatter, which is actually kind of annoying, a z-curve tiling would be more convenient. The X-Y alignment is also a little strange, it means that general 3D fetches are memory equivalent to 2 2D fetches in terms of bandwidth and cache.

A little idea about compressing Virtual Textures

I’ve spent a good deal of time working on virtual textures, but took the approach of procedural generation, using the quadtree management system to get a large (10-30x) speedup through frame coherence vs having to generate the entire surface every frame, which would be very expensive.
However, I’ve also always been interested in compressing and storing out virtual texture data on disk, not as a complete replacement to procedural generation, but as a complement (if a particular quadtree node gets too expensive in terms of the procedural ops required to generate it, you could then store its explicit data). But compression is an interesting challenge.
Lately it seems that allot of what I do at work is geared towards finding ways to avoid writing new code, and in that spirit this morning on the way to work I started thinking about applying video compression to virtual textures.
Take something like x264 and ‘trick’ it into compressing a large 256k x 256k virtual texture. The raw data is roughly comparable to a movie, and you could tile out pages from 2D to 1D to preserve locality, organizing it into virtual ‘frames’. Most of the code wouldn’t even know the difference. The motion compensation search code in x264 is more general than ‘motion compensation’ would imply – it simply searches for matching macroblocks which can be used for block prediction. A huge virtual surface texture exhibits excessive spatial correlation, and properly tiled into say a 512x512x100000 3D (or video) layout, that spatial correlation becomes temporal correlation, and would probably be easier to compress than most videos. So you could get an additional 30x or so benefit on top of raw image compression, fitting that massive virtual texture into under a gigabyte or less on disk.
Even better, the decompression and compression is already really fast and solid, and maybe you could even modify some bits of a video system to get fast live edit, where it quickly recompresses a small cut of video (corresponding to a local 2D texture region), without having to rebuild the whole thing. And even if you did have to rebuild the whole thing, you could do that in less than 2 hours using x264 right now on a 4 core machine, and in much less time using a distributed farm or a GPU.
I’m curious how this compares to the compression used in Id tech 5. Its also interesting to think how this scheme exposes the similarity in data requirements between a game with full unique surface texturing and a film. Assuming a perfect 1:1 pixel/texel ratio and continous but smooth scenery change for the game, they become very similar indeed.
I look forward to seeing how the Id tech 5 stuff turns out. I imagine their terrains will look great. On the other hand, alot of modern console games now have great looking terrian environments, but are probably using somewhat simpler techniques. (I say somewhat only because all the LOD, stitching, blending and so on issues encountered when using the ‘standard’ hodgepodge of techniques can get quite complex.)

Rasterization vs Tracing and the theoretical worst case scene

Rasterizer engines don’t have to worry about the thread-pixel scheduling problem as its handled behind the scenes by the fixed function rasterizer hardware. With rasterization, GPU threads are mapped to the object data first (vertex vectors), and then scanned into pixel vector work queues, whose many to one mapping to output pixels is synchronized by dedicated hardware.

A tracing engine on the other hand explicity allocates threads to output pixels, and then loops through the one to many mapping of object data which intersects the pixel’s ray, which imposes some new performance pitfalls.
But if you boil it down to simplicity, the rasterization vs ray tracing divide is really just the difference in a loop ordering and mapping:
for each object
for each pixel ray intersecting object
if ray forms closest intersection
store intersection
vs
for each pixel
for each object intersecting pixel’s ray
if ray forms closest intersection
store intersection
The real meat of course is in the data structures, which determine exactly what these ‘for each ..’ entail. Typically, pixels are stored in a regular grid, and there is much more total object data than pixels, so the rasterization approach is simpler. Mapping from objects to rays is thus typically easier than mapping from rays to objects. Conversely, if your scene is extremely simple, such as a single sphere or a regular grid of objects, the tracing approach is equally simple. If the pixel rays do not correspond to a regular grid, rasterization becomes more complex. If you want to include reflections and secondary ray effects, then the mapping from objects to rays becomes complex.
Once the object count becomes very large, much larger than the number of pixels, the two schemes become increasingly similar. Why? Because the core problem becomes one of dataset management, and the optimal solution is output sensitive. So the problem boils down to finding and managing a minimal object working set: the subset of the massive object database that is necessary to render a single frame.
A massively complex scene is the type of scene you have in a full unique dataset. In a triangle based engine, this is perhaps a surface quadtree or ABBTree combined with a texture quadtree for virtual textures, ala id tech 5. In a voxel engine, this would be an octree with voxel bricks. But the data structure is somewhat independent on whether you trace or rasterize, and you could even mix the two. Either scheme will require a crucial visibility step which determines which subsets of the tree are required for rendering the frame. Furthermore, wether traced or rasterized, the dataset should be about the same, and thus the performance limit is about the same – proportional to the working dataset size.
Which gets to an interesting question: What is the theoretical working set size? If you ignore multi-sample anti-aliasing and anistropic sampling, you need about one properly LOD-filtered object primitive(voxel, triangle+texel, whatever) per pixel. Which is simple, suprisingly small, and of course somewhat useless, for naturally with engines at that level of complexity anti-aliasing and anistropic sampling are important. Anti-aliasing doesnt by itself add much to the requirement, but the anisotropy-isotropy issue turns out to be a major problem.
Consider even the ‘simple’ case of a near-infinite ground plane. Sure its naturally representable by a few triangles, but lets assume it has tiny displacements all over and we want to represent it exactly without cheats. A perfect render. The octree or quadtree schemes are both isotropic, so to get down to pixel-sized primitives, they must subdivide down to the radius of a pixel cone. Unfortunately, each such pixel cone will touch many primitives – as the cone has a near infinite length, and when nearly perpendicular to the surface, will intersect a near infinite number of primitives. But whats the real worst case?
The solution actually came to me from shadow mapping, which has a similar sub problem in mapping flat regular grids to pixels. Consider a series of cascade shadow maps which perfectly cover the ground plane. They line up horizontally with the screen along one dimension, and align with depth along the other dimension – near perfectly covering the set of pixels. How many such cascades do you need? It turns out you need log(maxdist), where maxdist is the extent of the far plane, in relation to the near plane. Assuming a realistic far plane of 16 kilometers and a 1 meter near plane, this works out to 14 cascades. So in coarse approximation, anistropy increases the required object density cost for this scene by a factor of roughly ~10x-20x. Ouch!
This is also gives us the worst case possible scene, which is just that single flat ground plane scaled up: a series of maximum length planes each perpendicular to a slice of eye rays, or alternatively, a vicious series of pixel tunnels aligned to the pixel cones. The worst case now is much much worse than ~10-20x the number of pixels. These scenes are easier to imagine and encounter with an orthographic projection, and thankfully won’t come up with a perspective projection very often, but still are frightening.
It would be nice if we could ‘cheat’, but its not exactly clear how to do that. Typical triangle rasterizers can cheat in anistropic texture sampling, but its not clear how to do that in a tree based subdivision system, wether quadtree or octree or whatever. There may be some option with anistropic trees like KD-trees, but they would have to constantly adapt as the camera moves. Detecting glancing angles in a ray tracer and skipping more distance is also not a clear win as it doesn’t reduce the working set size and breaks memory coherency.