Category Archives: Technology

Intelligence Amplification: Hype and Reality

The future rarely turns out quite as we expect.  Pop sci futurists of a generation ago expected us to be flying to work by the dawn of the 21st century.  They were almost right: both flying cars and jet packs have just recently moved into the realm of realization.  But there is a huge gap between possible and economically practical, between prototype and mass product.

Undoubtedly many of the technologies futurists currently promote today will fare no better.  Some transhumanists, aware that the very future they champion may itself render them obsolete, rest their hopes on human intelligence amplification.  Unfortunately not all future technologies are equally likely.  Between brain implants, nanobots, and uploading, only the latter has long-term competitive viability, but it is arguably more of a technology of posthuman transformation than human augmentation.  The only form of strict intelligence amplification that one should bet much on in is software itself (namely, AI software).

Brain Implants:

Implanting circuitry into the human brain already has uses today in correcting some simple but serious conditions, and we should have little doubt that eventually this technology could grow into full-scale cognitive enhancement: it is at least physically possible.  That being said, there are significant technical challenges in creating effective and safe interfaces between neural tissue and dense electronics at the bandwidth capacities required to actually boost mental capability.  Only a small fraction of possible technologies are feasible, and only a fraction of those actually are economically competitive.

Before embarking on a plan for brain augmentation, let’s briefly consider the simpler task of augmenting a computer.  At a high level of abstraction, the general Von Neumman architecture separates memory and computation.  Memory is programatically accessed and uniformly addressable.  Processors in modern parallel systems are likewise usually modular and communicate with other processors and memory through clearly defined interconnect channels that are also typically uniformly addressable and time-shared through some standardized protocol.  In other words each component of the system, whether processor or memory, can ‘talk’ to other components in a well defined language.  The decoupling and independence of each module, along with the clearly delineated communication network, makes upgrading components rather straightforward.

The brain is delineated into many functional modules, but the wiring diagram is massively dense and chaotic.  It’s a huge messy jumble of wiring.  The entire outer region of the brain, the white matter, is composed of this huge massed tangle of interconnect fabric.  And unlike in typical computer systems, most of those connections appear to be point to point.  If two brain regions need to talk to each other, typically there are great masses of dedicated wires connecting them.  Part of the need of all that wiring stems from the slow speed of the brain.  It has a huge computational capacity but the individual components are extremely slow and dispersed, so the interconnection needs are immense.

The brain’s massively messy interconnection fabric poses a grand challenge for advanced cybernetic interfaces.  It has only a few concentrated conduits which external interfaces could easily take advantage of: namely the main sensory and motor pathways such as the optic nerve, audio paths, and the spinal cord.  But if the aim of cognitive enhancement is simply to interface at the level of existing sensory inputs, then what is the real advantage over traditional interfaces?  Assuming one has an intact visual system, there really is little to no advantage in directly connecting to the early visual cortex or the optical nerve over just beaming images in through the eye.

Serious cognitive enhancement would come only through outright replacement of brain subsystems and or through significant rewiring to allow cortical regions to redirect processing to more capable electronic modules.  Due to the wiring challenge, the scale and scope of the required surgery is daunting, and it is not yet clear that it will ever be economically feasible without some tremendous nanotech-level breakthroughs.

However, these technical challenges are ultimately a moot point.  Even when we do have the technology for vastly powerful superhuman brain implants, it will never be more net energy/cost effective than spending the same resources on a pure computer hardware AI system.

For the range of computational problems it is specialized for, the human brain is more energy efficient than today’s computers, but largely because it runs at tremendously slow speeds compared to our silicon electronics, and computational energy demands scale with speed.  We have already crossed the miniaturization threshold where our transistors are smaller than the smallest synaptic computing elements in the brain[1].  The outright advantage of the brain (at least in comparison to normal computers) is now mainly in the realm of sheer circuitry size (area equivalent to many thousands of current chips), and will not last beyond this decade.

So when we finally master all of the complexity of interfacing dense electronics with neural tissue, and we somehow find a way to insert large chunks of that into a living organic brain without damaging it beyond repair, and we somehow manage to expel all of the extra waste heat without frying the brain (even though it already runs with little to no spare heat capacity), it will still always be vastly less efficient than just building an AI system out of the same electronics!

We don’t build new supercomputers by dusting off old Crays to upgrade them via ‘interfacing’ with much faster new chips.


Ray Kurzweil puts much faith in the hope of nanobots swarming through our blood, allowing us to interface more ‘naturally’ with external computers while upgrading and repairing neural tissue to boot.  There is undoubtedly much value in such a tech, even if there is good reason to be highly skeptical about the timeline of nanobot development.  We have a long predictable trajectory in traditional computer technology and good reasons to have reasonable faith in the IRTS roadmap.  Drexlian style nanobots on the other hand have been hyped for a few decades now but if anything seem even farther away.

Tissue repairing nanobots of some form seem eventually likely (as is all technology given an eventual Singularity), but ultimately they are no different from traditional implants in the final analysis.  Even if possible, they are extremely unlikely to be the most efficient form of computer (because of the extra complexity constraint of mobility).  And if nanobots somehow turned out to be the most efficient form for future computers, then it would still be more efficient to just build a supercomputer AI out of pure nanobots!

Ultimately then the future utility of nanobots comes down to their potential for ‘soft uploading’.  In this regard they will just be a transitional form: a human would use nanobots to upload, and then move into a faster, more energy efficient substrate.  But even in this usage nanobots may be unlikely, as nanobots are a more complex option in the space of uploading technologies: destructive scanning techniques will probably be more viable.


Uploading is the ultimate transhumanist goal, at least for those who are aware of the choices and comfortable with the philosophical questions concerning self-hood. But at this point in time it is little more than a dream technology.  It’s development depends on significant advances in not only computing, but also in automated 3D scanning technologies which currently attract insignificant levels of research funding.

The timeline for future technologies can be analyzed in terms of requirement sets.  Uploading requires computing technology sufficient for at least human-level AI, and possibly much more. [2]  Moreover, it also probably requires  technology powerful enough to economically deconstruct and scan around ~1000 cubic centimeters of fragile neural tissue down to resolution sufficient for imaging synaptic connection strengths (likely nanometer-level resolution), recovering all of the essential information into digital storage, saving a soul of pure information from it’s shell of flesh, so to speak.

The economic utility of uploading thus boils down to a couple of simple yet uncomfortable questions: what is the worth of a human soul?  What is the cost of scanning a brain?

Not everyone will want to upload, but those that desire it will value it highly indeed, perhaps above all else.  Unfortunately most uploads will not have much if any economic value, simply due to competition from other uploads and AIs.  Digital entities can be replicated endlessly, and new AIs can be grown or formed quickly.  So uploading is likely to be the ultimate luxury service, the ultimate purchase.  Who will be able to afford it?

The cost of uploading can be broken down into the initial upfront research cost followed by the per-upload cost of the scanning machine’s time and the cost of the hardware one uploads into.  Switching to the demand view of the problem, we can expect that people will be willing to pay at least one year of income for uploading, and perhaps as much as half or more of their lifetime income.  A small but growing cadre of transhumanists currently pay up to one year of average US income for cryonic preservation, even with only an expected chance of eventual success.  Once uploading is fully developed into a routine procedure, we can expect it will attract a rather large market of potential customers willing to give away a significant chunk of their wealth for a high chance of living many more lifetimes in the wider Metaverse.

On the supply side it seems reasonable that the cost of a full 3D brain scan can eventually be scaled down to the cost of etching an equivalent amount of circuitry using semiconductor lithography.  Scanning technologies are currently far less developed but eventually have similar physical constraints, as the problem of etching ultra-high resolution images onto surfaces is physically similar to the problem of ultra-high resolution scanning of surfaces.  So the cost of scanning will probably come down to some small multiple of the cost of the required circuitry itself.  Eventually.

Given reasonable estimates for about 100 terrabytes or so of equivalent bytes for the whole brain, this boils down to just: 1.) <$10,000 if the data is stored in 2011 hard drives, or 2.) < 100,000$ for 2011 flash memory, or 3.) <500,000$ for 2011 RAM[3].  We can expect a range of speed/price options, with a minimum floor price corresponding to the minimum hardware required to recreate the original brain’s capabilities.  Based on current trends and even the more conservative projections for Moore’s Law, it seems highly likely that the brain hardware cost is already well under a million dollars and will fall into the 10 to 100 thousand dollar range by the end of the decade.

Thus scanning technology will be the limiting factor for uploading until it somehow attracts the massive funding required to catch up with semiconductor development.  Given just how far scanning has to go, we can’t expect much progress until perhaps Moore’s Law begins to slow down and run it’s course, the world suddenly wakes up to the idea, or we find a ladder of interim technologies that monetize the path to uploading.  We have made decades of progress in semiconductor miniaturization only because each step along the way has paid for itself.

The final consideration is that Strong AI almost certainly precedes uploading.  We can be certain that the hardware requirements to simulate a scanned human brain are a strict upper bound on the requirements for a general AI of equivalent or greater economic productivity.  A decade ago I had some hope that scanning and uploading could arrive before the first generation of human surpassing general AI’s.  Given the current signs of an AI resurgence this decade and the abysmal slow progress in scanning, it now appears more clear that uploading is a later post-AI technology.

  1. According to wikipedia, synaptic clefts measure around 20-nm.  From this we can visually guesstimate that typical synaptic axon terminals are 4-8 times that in diameter, say over 100-nm.  In comparison the 2011 intel microprocessor I am writing this on is built on 32-nm ‘half-pitch’ features, which roughly means that the full distance between typical features is 64-nm.  The first processors on the 22-nm node are expected to enter volume production early 2012.  Of course smallest feature diameter is just one aspect of computational performance, but is an interesting comparison milestone nonetheless.
  2. See the Whole Brain Emulation Roadmap for a more in depth requirements analysis.  It seems likely that scanning technology could improve rapidly if large amounts of money were thrown at it, but that doesn’t much help clarify any prognostications.
  3. I give a range of prices just for the storage cost portion because it represents a harder bound.  There is more variance in the cost estimates for computation, especially when one considers the range of possible thoughtspeeds, but the computational cost can be treated as some multiplier over the storage cost.

Overdue Update

I need to somehow enforce a mental pre-committment to blog daily.  It’s been almost half a year and I have a huge backlog of thoughts I would like to commit to permanent long term storage.

Thus, a commitment plan to some upcoming future posts:

  •  In October/November of last year(2010), I researched VR HMDs and explored the idea of a next-generation interface.  I came up with a novel hardware idea that could potentially solve the enormous resolution demands of a full FOV optic-nerve saturating near-eye display device (effective resolution of say 8k x 4k per eye or higher).  After a little research I found the type of approach I discovered already has a name: a foveal display, although current designs in the space are rather primitive.  The particular approach I have in mind, if viable, could solve the display problem once and for all.  If an optimized foveal display could be built into eyewear, you would never need any other display – it would replace monitors, tvs, smartphone screens and so on.  Combine a foveal HMD with a set of cameras spread out in your room like stereo speakers and some software for real-time vision/scene voxelization/analysis, and we could have a Snowcrash interface (and more).
  • Earlier in this year I started researching super-resolution techniques.  Super-resolution is typically used to enhance old image/video data and has found a home in upconverting SD video. I have a novel application in mind:  Take a near flawless super-res filter and use it as a general optimization for the entire rendering problem.  This is especially useful for near-future high end server based rendering solutions.  Instead of doing expensive ray-tracing and video compression on full 1080p frames, you run the expensive codes on a 540p frame and then do a fast super-res upconversion to 1080p (potentially a 4x savings on your entire pipeline!).  It may come as surprise that current state of the art super-res algorithms can do a 2x upsample from 540p to 1080p at very low error rates: well below the threshold of visual perception.  I have come up with what may be the fastest, simplest super-res technique that still achieves upsampling to 1080p with imperceptible visual error.  A caveat is that your 540p image must be quite good, which has implications for rendering accuracy, anti-aliasing, and thus rendering strategy choices.
  • I have big grandiose plans for next-generation cloud based gaming engines.  Towards that end, I’ve been chugging away at a voxel ray tracing engine.  This year I more or less restarted my codebase, designing for Nvidia’s fermi and beyond along with a somewhat new set of algorithms/structures.  Over the summer I finished some of the principle first pipeline tools, such as a triangle voxelizer, some new tracing loops and made some initial progress towards a fully dynamic voxel scene database.
  • Along the way to Voxeland Nirvanah I got completely fed up with Nvidia’s new debugging path for cuda (they removed the CPU emulation path) and ended up writing my own cuda emulation path via a complete metaparser in C++ templates that translates marked up ‘pseudo-cuda’ to either actual cuda or a scalar CPU emulation path.  I built most of this in a week and it was an interesting crash course in template based parsing.  Now I can run any of my cuda code on the CPU.  I can also mix and match both paths, which is really useful for pixel level debugging.  In this respect the new path i’ve built is actually more powerful and useful than nvidia’s old emulation path as that required full seperate recompilation.  Now I can run all my code on the GPU, but on encountering a problem I can copy the data back to the CPU and re-run functions on the CPU path with full debugging info.  This ends up being better for me than using nvidia’s parallel insight for native GPU debugging, because insight’s debug path is rather radically different than the normal compilation/execution path and you can’t switch between them dynamically.
  • In the realm of AI, I foresee two major hitherto unexploited/unexplored application domains related to Voxeland Nirvanah.  The first is what we could call an Artificial Visual Cortex.  Computer Vision is the inverse of Computer Graphics.  The latter is concerned with transforming a 3+1D physical model M into a 2+1 D viewpoint image sequence I.  The former is concerned with plausibly reconstructing the physical model M given a set of examples of viewpoint image sequences I.  Imagine if we had a powerful AVC trained on a huge video database that could then extract plausible 3D scene models from video.  Cortical models feature inversion and inference.  A powerful enough AVC could amplify rough 2D image sketches into complete 3D scenes.  In some sense this would be an artificial 3D artist, but it could take advantage of more direct and efficient sensor and motor modalities.  There are several aspects to this application domain that make it much simpler than a full AGI.  Computational learning is easier if one side of the mapping transform is already known.  In this case we can prime the learning process by using ray-tracing directly as the reverse transformation pathway (M->I).  This is a multi-billion dollar application area for AI in the field of computer graphics and visualization.
  • If we can automate artists, why not programmers?  I have no doubt that someday in the future we will have AGI systems that can conceive and execute entire technology businesses all on their own, but well before that I foresee a large market role for more specialized AI systems that can help automate more routine programming tasks.  Imagine a programming AI that has some capacity for natural language understanding and some ontology that combines knowledge of some common-sense english, programming, and several programming languages.  Compilation is the task of translating between two precise machine languages expressed in some context-free grammar.  There are deterministic algorithms for such translations.  For the more complex unconstrained case of translation between two natural languages we have AI systems that use probabilistic context-sensitive-grammars and semantic language ontologies.  Translating from a natural language to a programming language should have intermediate complexity.  There are now a couple of research systems in natural language programming that can do exactly this (such as sEnglish).  But imagine combining such a system with an automated ontology builder such as TEXTRUNNER which crawls the web to expand it’s knowledge base.  Take such a system and add an inference engine and suddenly it starts getting much more interesting.  Imagine building entire programs in pseudo-code, with your AI using it massive onotology of programming patterns and technical language to infer entire functions and sub-routines.  Before full translation, compilation and test, the AI could even perform approximate-simulation to identify problems.  Imagine writing short descriptions of data structures and algorithms and having the AI fill in details and even potentially handling translation to multiple languages, common optimizations, automatic parallelization, and so on.  Google itself could become an algorithm/code repository.  Reversing the problem, an AI could read a codebase and began learning likely structures and simplifications to high-level english concept categories, learning what the code is likely to do.  Finally, there are many sub-problems in research where you really want to explore a design space and try N variations in certain dimensions.  An AI system with access to a bank of machines along with compilation and test procedures could explore permutations at very high speed indeed.  At first I expect these type of programming assistant AIs to have wide but shallow knowledge and thus amplify and assist rather than replace human programmers.  They will be able to do many simple programming tasks much faster than a human.  Eventually such systems will grow in complexity and then you can combine them with artificial visual cortices to expand their domain of applicability and eventually get a more complete replacement for a human engineer.

Building the Brain

A question of hardware capability?

When can we expect the Singularity? What kind of hardware would be required for an artificial cortex? How far out into the future of Moore’s Law is such technology?

The startling answer is that the artificial cortex, and thus the transition to a profoundly new historical era, is potentially much closer than most people realize. The problem is mainly one of asking the right questions. What is the computational power of the human brain? This is not quite the right question. With a few simple tools a human can perform generic computation – indeed computers were human long before they were digital (see the history of the word: computer).  The computational speed of the human brain aided with simple tools is very very low, less than one operation per second.  Most studies then of reverse engineering the human brain are really asking a different question: how much digital computation would it require to simulate the human brain?  Estimates vary, but they are usually of order near 10^15 – quadrillions of operations per second or less for functional equivalence, up to around 10^18 for direct simulation, plus or minus a few orders of magnitude.  The problem with this approach is its similar to asking how much digital computation would it require to simulate a typical desktop processor by physically simulating each transistor.  The answer is surprising.  A typical circa 2010 desktop processor has on the order of a billion transistors, which switch on the order of a few billion times per second.  So simulating a current desktop processor using the same approach that we used to estimate brain capacity gives us a lower bound of a billion billion or 10^18 operations per second, realistically closer to 10^20 operations per second required to physically simulate a current desktop processor in real-time – beyond the upper ranges for typical estimates of simulating the human brain in real time.  This is surprising given the conventional wisdom that the human brain is so much more complex than our current computers, so its worth restating:

If we define computational time complexity as the number of operations per second required to simulate a physical system on a generic computer, then current desktop processors circa 2010 have already exceeded the complexity of the human brain.

This space-time complexity analysis can be more accurately broken into two components: space complexity and speed.  Space complexity is simply the information storage capacity of the system, measured in bits or bytes.  Brains get their massive information capacity from their synapses, which can be conservatively estimated as the equivalent to a byte of digital storage each, thus giving an upper bound of around 10^15 bytes for directly storing all the brain’s synapses – a petabyte of data storage, down to around a hundred terrabytes depending on the particular neuroscience estimate we use.  Personal computers now have hard drives with terrabytes of storage, and supercomputers of 2010 are just now hitting a petabyte of memory capacity, which means they have the base storage capacity required to comfortably simulate the brain completely in RAM.  Clearly brains have a big advantage in the space complexity department: their storage density is several orders of magnitude greater than our 2010 electronics (although this will change in about another 10-15 years of moore’s law).  However, along the speed dimension the advantage completely flips: current silicon electronics are about a million times faster than organic circuits.  So your desktop processor may only have the intrinsic spatial complexity of a cockroach, but signals flow through its circuits about six orders of magnitude faster – like a hyper accelerated cockroach.  Using one computational system to simulate another always implies a massive trade-off in speed.  The simplest modern processor cores (much simpler than the intel CPU you are using) uses hundreds of thousands to millions of transistors, and thus even if we could simulate a synapse with just a single instruction per clock cycle, we may only just barely manage to simulate a cockroach brain in real-time.  And note that the desktop processor would never be able to naively simulate something as complex as a human brain without vastly increasing its memory or storage capacity up to that of a super computer.  And even then, running on supercomputers detailed brain simulations to date achieve only a small fraction of real-time performance: much less than 10%.  It takes a human brain years to acquire language, so slow simulations are completely out of the question: we can’t simulate for 20 years just to see if our brain model develops to the intelligence level of a two year old!  Clearly, the speed issue is critical, and detailed simulation on a generic computer is not the right approach.

Capacity vs Speed

The memory capacity of a cortex is one principle quantitative measure underlying intelligence – a larger cortex with more synaptic connections can store and hold more memory patterns, and perform more total associative computations every cycle in direct proportion.  Certainly after we can match the human brain’s capacity, we will experiment with larger brains, but they will always have a proportionally higher cost in construction and power.  Past some point of scaling a brain 2 or 4 or X times larger and more expensive is probably not an improvement over an equivalent number of separate brains (and the distinction further blurs if the separate brains are networked together through something like language).  On this note, there are some reasons to believe that the human brain is already near a point of diminishing returns in the size department.  Whales and elephants, both large advanced mammals with plenty of room for much more massive capacities, sport brains built with a similar order of neurons as humans.  In numerous long separated branches of the mammalian line, brains grew to surface areas all within a narrow logarithmic factor: around 2,500 cm^2 in humans, 3,700 cm^2 in bottlenose dolphins, and around 6,000-8,000 cm^2 in elephant and whale lineages.  They all compare similarly in terms of neuron and synapse counts even though the body sizes, and thus the marginal resource cost of a % increase in brain size vary vastly: a whale or elephant brain is small compared to its body size, and consumes a small portion of its total resources.  The human brain definitely evolved rapidly from the hominid line, and is remarkably large given our body size, but our design’s uniqueness is really a matter of packing a full-sized large mammal brain into a small, crammed space.  The wiring problem poses a dimensional scaling constraint on brain size: total computation power scales with volume, but non-local communication scales with surface area, limiting a larger brain’s ability to effectively coordinate itself.  Similar dimensional scaling constraints govern body sizes, making insects insanely strong relative to their size and limiting the maximum plausible dimension of land animals to something dinosaur sized before they begin to fall apart.  A larger brain developed in humans hand in hand with language and early technology, and is probably optimized to human’s age: providing enough pattern-recognition prowess and capacity to learn complex concepts continuously for decades before running into capacity limits.  The other large-brained mammals have similar natural ages.  Approaching the capacity limit we can expect aging brains to becoming increasingly saturated, losing flexibility and the ability to learn new information, or retaining flexibility at the expense of forgetfulness and memory loss.  Its thus reasonable to conclude that the storage capacity of a human brain would be the minimum, the starting point, but increasing capacity further probably has a only a sublinear increase in effective intelligence.  Its probably more useful only in combination with speed, as a much faster thinking being swould be able to soak up knowledge proportionally faster.

The Great Shortcut: Fast Algorithmic Equivalence

We can do much, much better than simulating a brain synapse by synapse.  As the brain’s circuits are mapped, we can figure out what fundamental computations they are performing by recording mases of neuron data, simulating the circuits, and then fitting this data to matching functions.  For much of the brain, this has already been done.  The principle circuits of the cortex have been mapped fairly well, and all though there are still several competing implementation ideas at the circuit level, we have a pretty good idea of what these circuits can do at the more abstract level.  More importantly, simulations built on these concepts can accurately recreate visual data processing in the associated circuits that is both close to biologically measured results and effective for the circuit’s task – which is in this case is immediate fast object recognition (for more details, see papers such as: “Robust Object Recognition with Cortex-Like Mechanisms“.)  As the cortex – the great outer bulk of the brain – reuses this same circuit element throughout its surface, we can now see possible routes for performing equivalent computations but with dramatically faster algorithms.  Why should this be possible in principle?  Several reasons:

  1. Serial vs Parallel: For the brain and its extremely slow circuits, time is critical and circuits are cheap – it has so many billions of neurons (and hundreds of trillions of synapses) that it will prefer solutions that waste neuronal circuitry if they reduce the critical path length and thus are faster.  From the brain’s perspective, a circuit that takes 4 steps and uses a million synapses is much better than one which takes 30 steps and uses a thousand synapses.  Running on a digital computer that is a million times faster and a million times less parallel, we can choose more appropriate and complex (but equivalent) algorithms.
  2. Redundancy: Not all – if any – synapses store unique data.  For example, the some hundred million neurons in the V1 layer of each visual cortex all compute simple gabor-like edge filters from a library of a few dozen possible orientations and scales.  The synapse weights for this layer could be reused and would take up memory in the kilobytes – a difference of at least 6 orders of magnitude vs the naive full simulation (where synapse = byte).  This level of redundancy is probably on the far end of the scale, but redundancy is definitely a common cortical theme.
  3. Time Slicing:  Only a fraction of the brain’s neurons are active at any one point in time (if this fraction escalates too high the result is a seizure), and if we ignore random background firing, this fraction is quite low – in the range of 1% or possibly even as low as 0.1%.  This is of course a net average and depends on the circuit – some are more active than others – but if you think of the vast accumulated knowledge in a human mind and what small fraction of it is available or relevant at any one point, its clear that only a fraction of the total cortical circuitry (and brain) is important during any one simulation step.

The Cortexture: The end result of these observations is that a smart algorithmic equivalent cortical simulation could be at least three orders of magnitude faster than a direct simulation which naively evaluates every synapse every timestep.  The architecture I envision would organize cortical sheets into a spatial database that helps track data flow dependencies, storing most of the unique synaptic data (probably compressed) on a RAID disk array (possibly flash) which would feed one or more GPUs.  With a few terrabytes of disk and some compression, you could store at least a primate level brain, if not a human-equivalent cortex.  A couple of GPUs with a couple gigabytes of RAM each would store the active circuits (less than 1% of total synapses), which would be constantly changing and being streamed out as needed.  Fast flash RAID systems can get over a gigabyte per second of bandwidth, so you could swap out the active cortical elements every second.  I believe this is roughly fast enough to match human task or train of thought switching time.  The actual cortical circuit evaluation would be handled by a small library of special optimized equivalent GPU programs.  One would simulate the canonical circuit – and I believe I have an algorithm that is at least 10 times faster than naive evaluation for what the canonical circuit can do, but other algorithms could be even faster for some regions where the functionality is known and specialized.  For example, the V1 layers which perform gabor-like filters use a very naive technique in the brain and the equivalent result could be computed perhaps 100x faster with a very smart algorithm.  I’m currently exploring these techniques in more detail.

End Conclusion: If the brain was fully mapped (and that is the main task at hand – many mechanisms such as learning are still being teased out) and a sufficient group of smart engineers started working on optimizing its algorithms, we could probably implement a real-time artificial cortex in less than five years using today’s hardware on a machine costing somewhere between $10,000-$1,000,000.  (I know that is a wide error range, but I believe it is thus accurate.)  This cost is of course falling exponentially year by year.

Neuromorphic Computing

A sober analysis of the current weight of neuroscience data – specifically the computational complexity of the mapped cortical circuits and their potential for dramatic algorithmic optimization on faster machines – leads to the startling, remarkable conclusion that we already have the hardware capability to implement the brain’s algorithms in real-time today. In fact, it seems rather likely that by the time the brain is reverse engineered and we do figure out the software, the hardware will have already advanced enough that achieving faster than real-time performance will be quite easy.  The takeoff will likely be very rapid.

The Cortexture approach I described earlier, or any AI architecture running on today’s computers, will eventually run into a scalability problem due to disk and bus bandwidth speeds.  To really accelerate into the singularity, and get to 100’s or 1,000’s of times acceleration vs human thoughtspeed will likely require a fundamental redesign of our hardware along cortical lines.  Cortical neural circuitry is based on mixed analog and digital processing, and combines memory and analog computation in a single elemental structure – the synapse. The data storage and processing are both built into the synapses and the equivalent total raw information flow rate is roughly the total synapses multiplied by their signaling rate.  The important question really is thus what is the minimal efficient equivalent of the synapse for CMOS technology? Remarkably, the answer may be the mysterious 4th element of basic computing, the memristor. Discovered mathematically decades ago, this circuit building block was only realized recently and is already being heralded as the ‘future of artificial intelligence‘ – as it has electric properties very similar to the synapse – combining long term data storage and computation in a single element. For a more in depth design for a complete artificial cortex based on this new circuit element, take a look at “Cortical computing with memristive nanodevices“. This is a fascinating approach, and could achieve cortical complexity parity fairly soon, if the required fabrication technology was ready and developed. However, even though the memristor is quite exciting and looks likely to play a major role in future neuromorphic systems, conventional plain old CMOS circuits certainly can emulate synapses.  Existing mixed digital/analog technique can represent synapses in artificial neurons effectively using around or under 10 transistors. This hybrid method has the distinct advantage of avoiding costly digital multipliers that use tens of thousands of transistors – instead using just a handful of transistors per synapse. The idea is designs in this space can directly emulate cortical circuits in highly specialized hardware, performing the equivalent of a multiplication for every synaptic connection every clock cycle. There are a wide space of possible realizations of neuromorphic architectures, and this field looks to just be coming into its own.  Google: “artificial cortex” or “neuromorphic” for papers and more info.  DARPA, not to be undone, has launched its own neuromoprhic computing, called SyNAPSE – which has a blog here just so you can keep tabs on skynet.

The important quantitative dimensions for the cortex are synapse density, total capacity (which is just density * surface area), and clock rate. The cortex topology is actually 2d: that of a flat, relatively thin sheet (around 6 neurons thick) which is heavily folded into the volume of the brain, a space filling fractal. If you were to unfold it, it would occupy about one square foot or 2,500 square centimeters – the area of roughly a thousand typical processor dies. It has a density of about 100-4,000 million (10^8-10^9) synapses per mm^2. Current 40nm and 32nm CMOS technology circa 2010 can pack roughly 6-10 million (10^6-10^7) transistors onto a mm^2, so semiconductor density is within about a factor of 20-500 of the biological cortex in terms of feature density (more accurate synapse density figures await more detailed brain scans). This is a critical upcoming milestone – when our CMOS technology will match the information density and miniaturization level of the cortex.  This represents another 4 to 8 density doublings (currently occurring every 2 years, but expected to slow down soon), which we can expect to hit around the 11nm node or shortly thereafter in the early to mid 2020’s – the end of the semiconductor industry’s current roadmap.  This is also the projected end of the road for conventional CMOS technology and where the semiconductor roadmap wanders into the more hypothetical realm of nano-electronics.  When that does happen, neuromorphic designs will have some distinct advantages in terms of fault tolerance and noise resistance which could allow them to scale forward more quickly.  It is also expected that moving more into the 3rd dimension will be important, and leakage and other related quantum issues will limit further speed and power efficiency improvements – all pointing towards more brain-like computer designs.

Scaling up to the total memory/feature capacity of the brain (hundreds to a thousand trillion synapses), even when semiconductor technology reaches parity in density, will still take a large number of chips (having roughly equivalent total surface area). Today’s highest density memory chips have a few billion transistors, and you would need hundreds of thousands to equal the total memory of the brain. High end servers are just starting to reach a terrabyte of memory (with hundreds of individual chips), and you would then need hundreds of these. A far more economical and cortically inspired idea is to forgo ‘chips’ completely and just turn the entire silicon wafer into a single large usable neuromorphic computing surface. The inherent fault tolerance of the cortex can be exploited by these architectures – there is no need to cut up the wafer into dies and identify defective components, they can be simply disabled or statistically ignored during learning. This fascinating contrarian approach to achieving large neuromorphic circuits is being explored by the FACETS research team in Europe. So, in the end analysis, it looks reasonable that in the near future (roughly a decade) a few hundred terrabytes of cortical equivalent neuromorphic circuitry could soon be produced on one to a couple dozen CMOS wafers (the equivalent of a few thousand cheap chips), even using conventional CMOS technology. More importantly this type of architecture can be relatively simple and highly repetitive and it can run efficiently at low clock rates and thus at low power, greatly simplifying manufacturing issues. Its hard to estimate the exact cost, but due to the combination of low voltage/clock, single uncut wafer design, perfect yield, and so on, the economics should be similar to memory – RAM chips, which arrive first at new manufacturing nodes, are cheaper to produce, and consume less power.  Current 2010 RAM prices are at about $10 per GB, or very roughly 1 billion transistors per dollar.

Continuum of hardware efficiencies for cortical learning systems:

CPU,GPU Simulation: efficiency (die area, performance, power) 10^-8-10^6

FPGA, ASIC: efficiency 10^-5 to 10^-3

Neuromorphic (mixed analog/digital or memristors): 10^-2 to 1

CPU simulation is incredibly inefficient compared to the best solutions for a given problem, but CPU’s versatility and general applicability across all problems ensures they dominate the market and thus they get the most research attention, the economy of scale advantage, and are first to benefit from new foundry process improvements.  Dedicated ASIC’s are certainly employed widely today in the markets that are big enough to support them, but always face competition from CPU’s scaling up faster.  At the far end are hypothetical cortical systems built from memristors, which could function as direct 1:1 synapse equivalents.  We can expect that as moore’s law slows down this balance will eventually break down and favor designs farther down the spectrum.  Several forces will combine to bring about this shift: approaching quantum limitations which cortical designs are better adapted for, increased market potential of AI applications, and the end of the road for conventional lithography.

The Road Ahead

A human scale artificial cortex could be built today, if we had a complete connectome.  In the beginning it would start out as only an  infant brain – it would then take years to accumulate the pattern recognition knowledge base of a two year old and begin to speak, and then could take a few dozen additional years to achieve an education and any real economic value.  This assumes, unrealistically, that the first design tested would work.  It would almost certainly fail.  Successive iterations would take even more time.  This is of course is the real reason why we don’t have human-replacement AI yet: humans are still many orders of magnitude more economically efficient.

Yet we should not be so complacently comfortable in our economic superiority, for moore’s law ensures that the cost of such a cortical system will decrease exponentially.

Now consider another scenario, where instead of being constrained to current CPUs or GPUs, we invest billions in radical new chip technologies and even new foundries to move down the effeciency spectrum with a neuromorphic designs or at least a very powerful dedicated cortical ASIC.  Armed with some form of specialized cortical chip ready for mass volume production today at the cheap end of chip prices (where a dollar buys about a billion transistors, instead of hundreds of dollars for a billion transistors as in the case of high end logic CPUs and GPUs), we would expect a full human brain sized system to cost dramatically less: closer to the cost of the equivalent number of RAM transistors – on the order of $10 million dollars for a petabyte (assuming 10 transistors = synapse, memristors are even better).  Following the semiconductor industry roadmap (which factors in a slowing of moore’s law this decade), we could expect the cost of a quadrillion synapse or petabyte system to fall below $1 million by the end of the decade, and reach $100,000 in volume by the mid 2020’s – the economic tipping point of no return.  But even at a million dollars a pop, a future neuromorphic computing system of human cortical capacity and complexity would be of immeasurable value, for it could possess a fundamental, simply mind-boggling advantage of speed.  As you advance down the specialization spectrum from GPU’s to dedicated cortical ASICs and eventually neuromorphic chips and memristors, speed and power efficiency increases by orders of magnitude – with dedicated ASICS offering 10x to 100x speedups, and direct neuromorphic systems offerings speedups of 1000x or more.  If it takes 20-30 years to train a cortex from infant to educated adult mind, power and training time are the main cost.  A system that could do that in 1/10th the time would be suddenly economical, and a system that could do that in 1/100th of the time of a human would rapidly bring about the end of the world as we know it.

Thinking at the Speed of Light

Our biological brains have high information densities and are extraordinarily power efficient, but this is mainly because they are extremely slow:  with cycle times in the hundreds of hertz or approaching a kilohertz.  This relatively slow speed is a fundamental limitation of computing with living cells and (primarily)chemical synapses with their organic fragility. Semiconductor circuits do not have this limitation. Operating at the low frequencies and clock rates of their biological inspirations, neuromorphic systems can easily simulate biological networks in real-time and with comparable energy efficiency.  The most efficient neuromorphic computer generally can access all of its memory and synapses every clock cycle, so it can still perform immense calculations per second at very low speeds, just like biological neural nets. But you can also push up the clock rate, pump more power through the system, and run the circuit at megahertz rate or even gigahertz rate, equivalent to one thousand to one million times biological speed. Current systems with mixed digital/analog synaptic circuits can already achieve 1000-10000x biological ‘real-time’ on old CMOS manufacturing nodes and at low power and heat points. This is not an order of magnitude improvement over simulation on a similar sized and tech digital computer, its more like six orders of magnitude.  That being said, the wiring problem will still be a fundamental obstacle.  The brain optimizes against this constraint by taking the form of a 2D sheet excessively folded into a packed 3D space – a  space filling curve.  The entire outer surface is occupied by connectivity wiring – the white matter.  Our computer chips are currently largely 2D, but are already starting to move into the 3rd dimension.  A practical full electronic speed artificial cortex may require some novel solutions for high-speed connectivity, such as directly laser optical links, or perhaps a huge mass of fiber connections.  Advanced artificial cortices may end up looking much like the brain: with the 2D circuity folded up into a 3D sphere, interspersed with something resembling a vascular system for liquid cooling, en-sheathed in a mass of dense optical interconnects.  Whatever the final form, we can expect that the fundamental speed advantage inherit to electronics will be fully exploited.

By the time we can build a human complexity artificial cortex, we will necessarily already be able to run it many times faster than real time, eventually accelerating by factors of thousands and then even millions.

Speed is important because of the huge amount of time a human mind takes to develop.  Building practical artificial cortex hardware is only the first step.  To build a practical mind, we must also unlock the meta-algorithm responsible for the brain’s emergent learning behavior.  This is an active area of research, and there are some interesting emerging general theories, but testing any of them on a human sized cortex is still inordinately costly.  A fresh new artificial brain will be like an infant: full of noisy, randomized synaptic connections.  An infant brain does not have a mind so much as the potential space from which a mind will etch itself through the process of development.  Running in real-time in a sufficiently rich virtual reality, it would take years of simulation just to test development to a childhood stage, and decades to educate a full adult mind.  Thus accelerating the simulation many times beyond real-time has a huge practical advantage.  Thus the need for speed.

The test of a human-level AI is rather simple and is the same qualitative intelligence tests we apply to humans: its mind develops sufficiently to learn human language, then it learns to read, and it progresses through education into a working adult. Learning human language is ultimately the fundamental aspect of becoming a modern human mind – far more than your exact brain architecture or even substrate. If the brain’s cortical capacity is sufficient and the wiring organization is correctly mapped, it should then be able to self-educate and develop rapidly.

I highly doubt that other potential short cut routes to AGI (artificial general intelligence) will bear fruit – although narrow AIs will always have their uses as will simpler, animal-like AIs (non-language capable), but it seems inevitable that a human level intelligence will require something similar to a cortex (at least at the meta-algorithmic level of some form of self-organizing deep, hierarchical probabilistic networks – doesn’t necessarily have to use ‘neurons’ ). Furthermore, even if the other routes being explored to AI do succeed, its even less likely that they will scale to the insane speeds that the cortex design should be capable of (remember, the cortex runs at < 1000hz, which means we can eventually take that same design and speed it up by a factor of at least a million.) From a systems view, its seems likely that the configuration space of our biological cortex meta-wiring is effectively close to optimal in some sense – evolution has already well explored that state space. From an engineering perspective, taking an existing, heavily optimized design for intelligence and then porting it to a substrate that can run many orders of magnitude faster is a clear winning strategy.

Clock rate control, just like on our current computers, should allow posthumans to alter their speed of thought as needed. In the shared virtual realities they will inhabit with their human teachers and observers, they will think at ‘slow’ real-time human rates, with kilohertz clock rates and low power usage. But they will also be able to venture into a vastly accelerated inner space, drawing more power and thinking many many times faster than us. Running on even today’s CMOS technology, they could theoretically attain speeds up to about a million times faster than a biological brain, although at these extreme speeds the power and heat dissipation requirements would be large – like that of current supercomputers.

Most futurist visions of the Singularity consider AIs that are more intelligent, but not necessarily faster than human minds, but its clear that the speed is the fundemental difference between the two substrates. Imagine one of these mind children growing up in a virtual environment where it could dilate time by a factor of 1-1000x at will. Like real children, it will probably require both imitation and reinforcement learning with adults to kick start the early development phases (walking, basic environment interaction, language, learning to read). Assuming everything else was identical (the hardware cortex is a very close emulation), this child could develop very rapidly – the main bottleneck being the slow-time interaction with biological humans. Once a young posthuman learns to read, it can hop on the web, download texts, and progress at a truly staggering pace – assuming a blindingly fast internet connection to keep up (although perceived internet latency would be subjectively far worse in proportion to the acceleration – can’t do much about that) . Going to college wouldn’t really be a realistic option, but reading at 1000x real-time would have some pretty staggering advantages. It could read 30 years of material in just 10 days, potentially becoming a world class expert in a field of its choosing in just a week. The implications are truly profound. Entering this hypertime acceleration would make the most sense when reading, working, or doing some intellectual work. The effect for a human observer would be that anything the posthuman was intelligent enough to do it would be able to do near instantly, from our perspective.  The presence of a posthuman would be unnerving.  With a constant direct internet connection and a 1000x acceleration factor, a posthuman could read a book during the time it would take a human to utter a few sentences in conversation.

Clearly existing in a different phase space than us, its only true peers would be other equivalent posthumans; if built as a lone research model, it could be lonely in the extreme. Perhaps it could be selected or engineered for monastic or ascetic qualities, but it would probably be more sensible to create small societies of posthumans which can interact and evolve together – humans are social creatures, and our software descendants would presumably inherit this feature by default. The number of posthumans and their relative intelligence will be limited by our current computing process technology: the transitor density and cost per wafer – so their potential population growth will be more predictable and follow semiconductor trends (at least initially).  Posthumans with more equivalent synapses and neurons than humans could presumably become super-intelligent in other quantitative dimension, that of mental capacity – able to keep track of more concepts, learn and recall more knowledge, and so on than humans – albeit with the slow linear scaling discussed previously. But even posthumans with mere human-capacity brains could be profoundly, unimaginably super-intelligent in their speed of thought, thanks to the dramatically higher clock rates possible on their substrate – and thus in a short matter of time they would become vastly more knowledgeable. The maximum size of an artificial cortex would be limited mainly by economics for the wafers and then by bandwidth and latency constraints. There are tradeoffs between size, speed, and power for a given fabrication technology, but in general, larger cortices would be more limited in their top speed. The initial generations will probably occupy a fair amount of server floor space and operate not much faster than real-time, but then each successive generation will be smaller and faster, eventually approaching a form factor similar to the human brain, and eventually pushing the potential clock rate to the technology limits (more than a million times real-time for current CMOS tech). But even with small populations at first, it seems likely that the first successful generation of posthumans to reach upper-percentile human intelligence will make an abrupt and disruptive impact on the world. But fortunately for us, physics does impose some costs to thinking at hyperspeed.

Fast Brains and Slow Computers

A neuromorphic artificial cortex will probably have data connections that allow its synaptic structures to be saved to external storage, but a study of current theory and designs in the field dispels some common myths: an artificial cortex will be a very separate specialized type of computer hardware, and will not automatically inherit a digital computer’s supposed advantages such as perfect recall.  It will probably not be able to automagically download new memories or skills as easily as downloading new software.  The emerging general theories of the brain’s intelligence, such as the heirchachial bayesian network models, all posit that learned knowledge is stored in a deeply non-local, distributed and connected fashion, very different than say a digital computer’s random access memory (even if the said synapses are implemented in RAM).  Reading (or accessing) memories and writing memories in a brain-like network intrinsically involves thinking about the associated concepts – as memories are distributed associations, and everywhere tangled up to existing memory patterns.  An artificial cortex could be designed to connect to external computer systems more directly than through the senses, but this would have only marginal advantages.  For example, we know from dreams that the brain can hallucinate visual input by bypassing the lowest layers of the visual cortex and directly stimulating regions responsible for recognizing moving objects, shapes, colors, etc, all without actually requiring input from the retina.  But this is not much of a difference for a posthuman mind already living in a virtual reality – simulating sound waves and their conversion into neural audio signals and simulating the processing into neural patterns representing spoken dialog is not that much different than just directly sending the final dialog representing neural patterns into the appropriate regions.  The small differences will probably show up as what philosophers call qualia – those subjective aspects of consciousness or feelings that operate well below the verbal threshold of explanation.

Thus a posthuman with a neuromorphic artificial cortex will still depend heavily on traditional computers to run all the kinds of software that we use today, and to simulate a virtual environment complete with a virtual body and all that entails. But the posthuman will essentially think at the computer’s clock rate. The human brain has a base ‘clock rate’ of about a kilohertz, completing on the order of a thousand neural computing steps per second simultaneously for all the trillions of circuits. A neuromorphic computer works the same, but the clock rate can be dramatically sped up to CMOS levels. A strange and interesting consequence is that a posthuman thinking at hyperspeed would subjectively experience its computer systems and computer environment slow down by an equivalent rate. Its likely the neuromorphic hardware will have considerably lower clock rates than traditional CPUs for power reasons, but they have the same theoretical limit, and running at the same clock rate, a posthuman would experience a subjective second in just a thousand clock cycles, which is hardly enough time for a traditional CPU to do anything. Running at a less ambitious acceleration factor of just 1000x, and with gigahertz computer systems, the posthuman would still experience a massive slowdown in its computer environment, as if it had jumped back more than a decade in time to a vastly slower era of of megahertz computing.  However, we can imagine that by this time traditional computers will be much further down the road of parallelization, so a posthuman’s typical computer will consist of a very large number of cores and software will be much more heavily parallelized.  Nevertheless, its generally true that a posthuman, no matter what level of acceleration, will still have to wait the same amount of time as anyone else, including a regular human, for its regular computations to complete.

Ironically, while posthumans will eventually be able to think thousands or even millions of times faster than biological humans, using this quickening ability will have the perceptual effect of slowing down the external universe in direct proportion – including their computer environment.  Thus greatly accelerated posthumans will spend proportionally inordinate amounts of subjective time waiting on regular computing tasks.

The combination of these trends leads to the conclusion that a highly variable clock rate will be an important feature for future posthuman minds.  Accelerating to full thought speed – quickening – will probably be associated with something like entering an isolated meditative state.  We can reason that at least in the initial phases of posthuman development, their simulated realities will mainly run at real-time, in order to provide compatibility with human visitors and to provide full fidelity while conserving power.  When quickening, a posthuman would experience its simulated reality slowing down in proportion, grinding to a near halt at higher levels of acceleration.  This low-computational mode would still be very useful for much of human mental work: reading, writing, and old-fashioned thinking.

In the present, we are used to computers getting exponentially faster while the speed of human thought remains constant.  All else being equal, we are now in a regime where the time required for fixed computational tasks is decreasingly exponentially (even if new software tends to eat much of this improvement.)  The posthuman regime is radically different.  In the early phases of ramp up the speed of thought will increase rapidly until it approaches the clock rate of the processor technology.  During this phase the trend will actually reverse – posthuman thoughtspeed will increase faster than computer speed and from a posthumans perspective, computers will appear to get exponentially slower.  This phase will peter out when posthuman thoughtspeed approaches the clock rate – somewhere around a million times human thoughtspeed for the fastest, most extremely optimized neuromorphic designs using today’s process technology (gigahertz vs a kilohertz).  At that point there is little room for further raw speed of thought improvements (remember, the brain can recognize objects and perform relatively complex judgements in just a few dozen ‘clock cycles’ – not much room to improve on that in terms of speed potential given its clock rate).

After the initial ramp up regime, Moore’s law will continue of course, but at that point you enter a new plateau phase.  In this second regime, once the algorithms of intelligence are well mapped to hardware designs, further increases in transistor density will enable more traditional computer cores per dollar and more ‘cortical columns or equivalents’ per dollar in direct proportion.  Posthuman brains may get bigger, and or they may get cheaper, but the clock speed wouldn’t change much (as any process improvement in clock rate would speed up both traditional computers and posthuman brains).  So in the plateau phase, you have this weird effect where computer clock rate is more or less fixed at a far lower level than we are used to – about a million times less or so from the perspective of the fastest neuromorphic posthuman brain designs.  This would correspond to computer clock rates measured in the kilohertz.  The typical computer available to a posthuman by then would certainly have far more cores than today, thousands or perhaps even millions, but they would be extremely slow from the posthuman’s perspective.  Latency and bandwidth would be similarly constrained, which would effectively expand the size of the world in terms of communication barriers – and this single idea has wide ranging implications for understanding how posthuman civilizations will diverge and evolve.  It suggests a strong diversity increasing counter–globalization effect which would further fragment and disperse localized sub-populations for better or worse.

What would posthumans do in the plateau phase, forever limited to extremely slow, roughly kilohertz-speed computers?  This would limit the range of effective tasks they could do in quicktime.  Much of hardware and software design, engineering, etc would be limited by slow computer speeds.  Surprisingly, the obvious low-computational tasks that could still run at full speed would be the simpler, lower technology creative occupations such as writing.  It’s not that posthumans wouldn’t be good at all computer intensive tasks as well – they certainly would be superhuman in all endeavors.  The point is rather that they will be vastly, incomprehensibly more effective only in those occupations that are not dependent on the speed of general purpose computation.  Thus we can expect that they will utterly dominate professions such as writing.

It seems likely that very soon into the posthuman era, the bestseller lists will be inundated by an exponentially expanding set of books, the best of which would be noticeably better than anything humans could write (and most probably written under pseduo-names and with fake biographies).  When posthumans achieve 1000x human thoughtspeed, you might go on a short vacation and come back to find several years of new literature.  When posthumans achieve 1 million X human thoughtspeed, you might go to sleep and wake up to find that the number of books in the world (as well as the number of languages), just doubled over night.  Of course, by that point, you’re already pretty close to the end.

We can expect that the initial posthuman hardware requirements will be expensive and thus they will be few in number and of limited speed, but once they achieve economic parity with human workers, we can expect the tipping point to crash like a tidal wave, with a rapid succession of hardware generations increasing maximum thoughtspeed while reducing size, cost, and power consumption and huge economies of scale leading to an exponential posthuman population expansion, and their virtual realities eventually accelerating well beyond human comprehension.

Conversing with the Quick and the Dead

CUI: The Conversational User Interface

Recently I was listening to an excellent interview (which is about an hour long) with John Smart of Acceleration Watch, where he specifically was elucidating his ideas on the immediate future evolution of AI, which he encapsulates in what he calls the Conversational Interface. In a nutshell, its the idea that the next major development in our increasingly autonomous global internet is the emergence and widespread adoption of natural language processing and conversational agents. This is currently technology on the tipping point of the brink, so its something to watch as numerous startups are starting to sell software for automated call centers, sales agents, autonomous monitoring agents for utilities, security, and so on. The immediate enabling trends are the emergence of a global liquid market for cheap computing and fairly reliable off the shelf voice to text software that actually works. You probably have called a bank and experienced the simpler initial versions of this which are essentially voice activated multiple choice menus, but the newer systems on the horizon are a wholly different beast: an effective simulacra of a human receptionist which can interpret both commands and questions, ask clarifying questions, and remember prior conversations and even users. This is an interesting development in and of itself, but the more startling idea hinted at in Smart’s interview is how natural language interaction will lead to anthropomorphic software and how profoundly this will eventually effect the human machine symbiosis.

Humans are rather biased judges of intelligence: we have a tendency to attribute human qualities to anything that looks or sounds like us, even if its actions are regulated by simple dumb automata. Aeons of biological evolution have preconditioned us to rapidly identify other intelligent agents in our world, categorize them as potential predators, food, or mates, and take appropriate action. Its not that we aren’t smart enough to apply more critical and intensive investigations into a system to determine its relative intelligence, its that we have super-effective visual and auditory shortcuts which bias us. These are most significantly important in children, and future AI developers will be able to exploit these biases is to create agents with emotional attachments. The Milo demo from Microsoft’s Project Natal is a remarkable and eerie glimpse into the near future world of conversational agents and what Smart calls ‘virtual twins’. After watching this video, consider how this kind of technology can evolve once it establishes itself in the living room in the form of video game characters for children. There is a long history of learning through games, and the educational game market is a large, well developed industry. The real potential hinted at in Peter Molyneux’s demo is a disruptive convergence of AI and entertainment which I see as the beginning of the road to the singularity.

Imagine what entrepreneurial game developers with large budgets and the willingness to experiment outside of the traditional genres could do when armed with a full two way audio-visual interface like Project Natal, the local computation of the xbox 360 and future consoles, and a fiber connection to the up and coming immense computing resources of the cloud (fueled by the convergence of general GPUs and the huge computational demands of the game/entertainment industry moving into the cloud). Most people and even futurists tend to think of Moore’s Law as a smooth and steady exponential progression, but the reality from the perspective of a software developer (and especially a console game developer) is a series of massively disruptive jumps: evolutionary punctuated equilibrium. Each console cycle reaches a steady state phase towards the end where the state space of possible game ideas, interfaces and simulation technologies reaches a near steady state, a technological tapering off, followed by the disruptive release of new consoles with vastly increased computation, new interfaces, and even new interconnections. The next console cycle is probably not going to start until as late as 2012, but with upcoming developments such as Project Natal and OnLive, we may be entering a new phase already.

The Five Year Old’s Turing Test

Imagine a future ‘game system’ aimed at relatively young children with a Natal like interface: a full two way communication portal between the real and the virtual: the game system can both see and hear the child, and it can project a virtual window through which the inner agents can be seen and heard. Permanently connected to the cloud through fiber, this system can tap into vast distant computing resources on demand. There is a development point, a critical tipping point, where it will be economically feasible to make a permanent autonomous agent that can interact with children. Some certainly will take the form of an interactive, talking version of a character like Barney and semi-intelligent such agents will certainly come first. But for the more interesting and challenging development of human-level intelligence, it could actually be easier to make a child-like AI, one that learns and grows with its ‘customer’. Not just a game, but a personalized imaginary friend to play games with, and eventually to grow up with. It will be custom designed (or rather developmentally evolved) for just this role – shaped by economic selection pressure.

The real expense of developing an AI is all the training time, and a human-like AI will need to go through a human-like childhood developmental learning process. The human neocortex begins life essentially devoid of information, with random synaptic connections and a cacophony of electric noise. From this consciousness slowly develops as the cortical learning algorithm begins to learn patterns through sensory and motor interaction with the world. Indeed, general anesthetics work by introducing noise into the brain that drowns out coherent signalling and thus consciousness. From an information theoretic point of view, it may be possible to thus use less computing power to simulate an early developmental brain – storing and computing only the information above the noise signals. If such a scalable model could be developed, it would allow the first AI generation to begin decades earlier (perhaps even today), and scale up with moore’s law as they require more storage and computation.

Once trained up to the mental equivalent level of a five-year old, a personal interactive invisible friend might become a viable ‘product’ well before adult level human AIs come about. Indeed, such a ‘product’ could eventually develop into a such an adult AI, if the cortical model scales correctly and the AI is allowed to develop and learn further. Any adult AI will start out as a child, there is no shortcuts. Which raises some interesting points: who would parent these AI children? And inevitably, they are going to ask two fundamental questions which are at the very root of being, identity, and religion:
what is death? and Am I going to die?

The first human level AI children with artificial neocortices will most likely be born in research labs – both academic and commercial. They will likely be born into virtual bodies. Some will probably be embodied in public virtual realities, such as Second Life, with their researcher/creators acting as parents, and with generally open access to the outside world and curious humans. Others may develop in more closed environments tailored to a later commercialization. For the future human parents of AI mind children, these questions will be just as fundamental and important as they are for biological children. These AI children do not have to ever die, and their parents could answer so truthfully, but their fate will entirely depend on the goals of their creators. For AI children can be copied, so purely from an efficiency perspective, there will be a great pressure to cull the rather unsuccessful children – the slow learners, mentally unstable, or otherwise undesirable – and use their computational resources to duplicate the most successful and healthy candidates. So the truthful answers are probably: death is the permanent loss of consciousness, and you don’t have to die but we may choose to kill you, no promises. If the AI’s creators/parents are ethical and believe any conscious being has the right to life, then they may guarantee their AI’s permanency. But life and death for a virtual being is anything but black and white: an AI can be active permanently or for only an hour a day or for an hour a year – life for them is literally conscious computation and near permanent sleep is a small step above death. I suspect that the popular trend will be to teach AI children that they are all immortal and thus keep them happy.
Once an AI is developed to a certain age, they can then be duplicated as needed for some commercial application. For our virtual Milo example, an initial seed Milo would be selected from a large pool raised up in a virtual lab somewhere, with a few best examples ‘commercialized’ and duplicated out as needed every time a kid out on the web wants a virtual friend for his xbox 1440. Its certainly possible that Milo could be designed and selected to be a particularly robust and happy kid. But what happens when Milo and his new human friend start talking and the human child learns that Milo is never going to die because he’s an AI? And more fundamentally, what happens to this particular Milo when the xbox is off? If he exists only when his human owner wants him to, how will he react when he learns this?
Its most likely that semi-intelligent (but still highly capable) agents will develop earlier, but as moore’s law advances along with our understanding of the human brain, it becomes increasingly likely someone will tackle and solve the human-like AI problem, launching a long-term project to start raising an AI child. Its hard to predict when this could happen in earnest. There are already several research projects underway attempting to do something along these lines, but nobody yet has the immense computational resources to throw at a full brain simulation (except perhaps for the government), nor do we even have a good simulation model yet (although we may be getting close there), and its not clear that we’ve found the types of shortcuts needed to start one with dramatically less resources, and it doesn’t look like any of the alternative non-biological AI routes are remotely on the path towards producing something as intelligent as a five year old. Yet. But it looks like we could see this in a decade.
And when this happens, these important questions of consciousness, identity and fundemental rights (human and sapient) will come into the public spotlight.
I see a clear ethical obligation to extend full rights to all human-level sapients, silicon, biological, or what have you. Furthermore, those raising these first generations of our descendants need to take on the responsibility of ensuring a longer term symbiosis and our very own survival, for its likely that AI will develop ahead of the technologies required for uploading, and thus these new mind children will lead the way into the unknown future of the Singularity.

More on grid computing costs

I did a little searching recently to see how my conjectured cost estimates for cloud gaming compared to the current market for grid computing. The prices quoted for server rentals vary tremendously, but I found this NewServers ‘Bare Metal Cloud’ service as an interesting example of raw compute server rental by the hour or month (same rate, apparently no bulk discount).

Their ‘Jumbo’ option for 38 cents per hour is within my previous estimate of 25-50 cents per hour. It provides dual quad cores and 8GB of RAM. It doesn’t have a GPU of course, but instead has two large drives. You could substitute those drives for a GPU and keep the cost roughly the same (using a shared network drive for every 32 or 64 servers or whatever – which they also offer). Nobody needs GPU’s in server rooms right now, which is the biggest difference between a game service and anything else you’d run in the cloud, but I expect that to change in the years ahead with Larrabbee and upcoming more general GPUs. (and coming from the other angle, CPU rendering is becoming increasingly viable) These will continue to penetrate into the grid space, driven by video encoding, film rendering, and yes, cloud gaming.

What about bandwidth?
Each server includes 3 GB of Pure Internap bandwidth per hour
So adequate bandwidth for live video streaming is already included. Whats missing, besides the GPU? Fast, low latency video compression, of course. Its interesting that x264, the open source encoder, can do realtime software encoding using 4 intel cores (and its certainly not the fastest out there). So if you had a low latency H.264 encoder, you could just use 4 of the cpus for encoding and 4 to run the game. Low latency H.264 encoders do exist of course, and I suspect that is the route Dave Perry’s Gaikai is taking.
Of course, in the near-term, datacenters for cloud gaming will be custom built, such as what OnLive and OToy are attempting. Speaking of which, the other interesting trend is the adoption of GPU’s for feature film use, as used recently in the latest Harry Potter film. OToy is banking on this trend, as their AMD powered datacenters will provide computation for both film and games. This makes all kinds of sense, because the film rendering jobs can often run at night and use otherwise idle capacity. From an economic perspective, film render farms are already well established, and charge significantly more per server hour – usually measured per Ghz-hour. Typical prices are around 12-6 cents per Ghz in bulk, which would be around a dollar or two per hour for the server example given above. I imagine that this is mainly due to the software expense, which for a render server could add up to be many times the hardware cost.
So, here are the key trends:
– GPU/CPU convergence, leading to a common general server platform that can handle film/game rendering, video compression, or anything really
– next gen game rendering moving into ray tracing and the high end approaches of film
– bulk bandwidth already fairly inexpensive for 720p streaming, and falling 30-40% per year
– steadily improving video compression tech, with H.265 on the horizon, targeting a further 50% improvement in bitrate
Will film and game rendering systems eventually unify? I think this is the route we are heading. Both industries want to simulate large virtual worlds from numerous camera angles. The difference is that games are interesting in live simulation and simultaneous broadcast of many viewpoints, while films aim to produce a single very high quality 2 hour viewpoint. However, live simulation and numerous camera angles are also required during a film’s production, as large teams of artists each work on small pieces of the eventual film (many of which are later cut), and need to be able to quickly preview (even at reduced detail). So the rendering needs of a film production are similar to that of a live game service.
Could we eventually see unified art pipelines and render packages between games and films? Perhaps. (indeed, the art tools are largelly unified already, except world editing is usually handled by propriatary game tools) The current software model for high end rendering packages is not well suited to cloud computing, but the software as a service model would make alot of sense. As a gamer logs in (through a laptop, cable box, microconsole, whatever) and starts a game, that would connect to a service provider to find a host server nearby, possibly installing the rendering software as needed and streaming the data (cached at each datacenter, of course). The hardware and the software could both be rented on demand. Eventually you could even create games without licensing an engine in the traditional sense, but simply by using completely off the shelf software.

Some thoughts on metaprogramming, reflection, and templates

The thought struck me recently that C++ templates really are a downright awful metaprogramming system. Don’t get me wrong, they are very powerful and I definitely use them, but recently I’ve realized that whatever power they have is soley due to enabling metaprogramming, and there are numerous other ways of approaching metaprogramming that actually make sense and are more powerful. We use templates in C++ because thats all we have, but they are an ugly, ugly feature of the language. It would be much better to combine full reflection (like Java or C#) with the capability to invoke reflective code at compile time to get all the performance benefits of C++. Templates do allow you to invoke code at compile time, but through a horribly obfuscated functional style that is completely out of synch with the imperative style of C++. I can see how templates probably evolved into such a mess, starting as a simple extension of the language that allowed a programmer to bind a whole set of function instantiations at compile time, and then someone realizing that its turing complete, and finally resulting in a metaprogramming abomination that never should have been.

Look at some typical simple real world metaprogramming cases. For example, take a generic container, like std::vector, where you want to have a type-specialized function such as a copy routine that uses copy constructors for complex types, but uses an optimized memcpy routine for types where that is equivalent to invoking the copy constructor. For simple types, this is quite easy to do with C++ templates. But using it with more complex user defined structs requires a type function such as IsMemCopyable which can determine if the copy constructor is equivalent to a memcpy. Abstractly, this is simple: the type is mem-copyable if it has a default copy constructor and all of its members are mem-copyable. However, its anything but simple to implement with templates, requiring all kinds of ugly functional code.
Now keep in my mind I havent used Java in many years, and then only briefly, I’m not familar with its reflection, and I know almost nothing of C#, although I understand both have reflection. In my ideal C++ with reflection language, you could do this very simply and naturally with an imperative meta-function with reflection, instead of templates (maybe this is like C#, but i digress):
struct vector {
generic* start, end;
generic* begin() {return start;}
generic* end() {return end;}
int size() {return end-start;}
type vector(type datatype) {start::type = end::type = datatype*;}
void SmartCopy(vector& output, vector& input)
if ( IsMemCopyable( typeof( *input.begin() ) ) {
memcpy(output.begin(), input.begin(), input.size());
else {
for_each(output, input) {output[i] = input[i];}
bool IsMemCopyable(type dtype) {
bool copyable (dtype.CopyConstructor == memcpy );
for_each(type.members) {
copyable &= IsMemCopyable(type.members[i]);
return copyable;
The idea is that using reflection, you can unify compile time and run-time metaprogramming into the same framework, with compile time metaprogramming just becoming an important optimization. In my pseudo-C++ syntax, the reflection is accesable through type variables, which actually represent types themselves: pods, structs, classes. Generic types are specified with the ‘generic’ keyword, instead of templates. Classes can be constructed simply through functions, and I added a new type of constructor, a class constructor which returns a type. This allows full metaprogramming, but all your metafunctions are still written in the same imperative language. Most importantly, the meta functions are accessible at runtime, but can be evaluated at compile time as well, as an optimization. For example, to construct a vector instantiation, you would do so explicitly, by invoking a function:
vector(float) myfloats;
Here vector(float) actually calls a function which returns a type, which is more natural than templates. This type constructor for vector assigns the actual types of the two data pointers, and is the largest deviation from C++:
type vector(type datatype) {start::type = end::type = datatype*;}
Everything has a ::type, which can be set and manipulated just like any other data. Also, anything can be made a pointer or reference by adding the appropriate * or &.
if ( IsMemCopyable(typeof( *input.begin() ) ) {
There the * is used to get past the pointer returned by begin() to the underlying data.
When the compiler sees a static instantiation, such as:
vector(float) myfloats;
It knows that the type generated by vector’s type constructor is static and it can optimize the whole thing, compiling a particular instantiation of vector, just as in C++ templates. However, you could also do:
type dynamictype = figure_out_a_type();
vector(dynamictype) mystuff;
Where dynamictype is a type not known at compile time and could be determined by other functions, loaded from disk, or whatever. Its interesting to note that in this particular example, the unspecialized version is not all that much slower as the branch in the copy function is invoked only once per copy, not once per copy constructor.
My little example is somewhat contrived and admittedly simple, but the power of reflective metaprogramming can make formly complex big systems tasks mucher simpler. Take for example the construction of a game’s world editor.
The world editor of a modern game engine is a very complex beast, but at its heart is a simple concept: it exposes a user interface to all of the game’s data, as well as tools to manipulate and process that data, which crunch it into an optimized form that must be streamed from disk into the game’s memory and thereafter parsed, decompressed, or what have you. Reflection allows the automated generation of GUI components from your code itself. Consider a simple example where you want to add dynamic light volumes to an engine. You may have something like this:
struct ConeLight {
HDRcolorRGB intensity_;
BoundedFloat(0,180) angleWidth_;
WorldPosition pos_;
Direction dir_;
TextureRef cookie_;
static HelpComment description_ = “A cone-shaped light with a projected texture.”
The editor could then automatically connect a GUI for creating and manipulating ConeLights just based on analysis of the type. The presence of a WorldPosition member would allow it to be placed in the world, the Direction member would allow a rotational control, and the intensity would use an HDR color picker control. The BoundedFloat is actually a type constructor function, which sets custom min and max static members. The cookie_ member (a projected texture) would automatically have a texture locator control and would know about asset dependencies, and so on. Furthermore, custom annotations are possible through the static members. Complex data processing, compression, disk packing and storage, and so on could happen automatically, without having to write any custom code for each data type.
This isn’t revolutionary, in fact our game editor and generic database system are based on similar principles. The difference is they are built on a complex, custom infrastructure that has to parse specially formatted C++ and lua code to generate everything. I imagine most big game editors have some similar custom reflection system. Its just a shame though, because it would be so much easier and more powerful if built into the language.
Just to show how powerful metaprogramming could be, lets go a step farther and tackle the potentially hairy task of a graphics pipeline, from art assets down to the GPU command buffer. For our purposes, art packages expose several special asset structures, namely geometry, textures, and shaders. Materials, segments, meshes and all the like are just custom structures built out of these core concepts. On the other hand, a GPU command buffer is typically built out of fundemental render calls which look something like this (again somewhat simplified):
error GPUDrawPrimitive(VertexShader* vshader, PixelShader* pshader, Primitive* prim, vector samplers, vector vconstants, vector pconstants);
Lets start with a simpler example, that of a 2D screenpass effect (which, these days, encompasses alot).
Since this hypothetical reflexive C language could also feature JIT compilation, it could function as our scripting language as well, the effect could be coded completely in the editor or art package if desired.
struct RainEffect : public FullScreenEffect {
function(RainPShader) pshader;
float4 RainPShader(RenderContext rcontext, Sampler(wrap) fallingRain, Sampler(wrap) surfaceRain, AnimFloat density, AnimFloat speed)
// … do pixel shader stuf
// where the RenderContext is the typical global collection of stuff
struct RenderContext {
Sampler(clamp) Zbuffer;
Sampler(clamp) HDRframebuffer;
float curtime;
// etc ….
The ‘function’ keyword specifies a function object, much like a type object with the parameters as members. The function is statically bound to RainPshader in this example. The GUI can display the appropriate interface for this shader and it can be controlled from the editor by inspecting the parameters, including those of the function object. The base class FullScreenEffect has the quad geometry and the other glue stuff. The pixel shader itself would be written in this reflexive C language, with a straightforward metaprogram to actually convert that into HLSL/cg and compile as needed for the platform.
Now here is the interesting part: all the code required to actual render this effect on the GPU can be generated automatically from the parameter type information emedded in the RainPShader function object. The generation of the appropriate GPUDrawPrimitive function instance is thus just another metaprogram task, which uses reflection to pack all the samplers into the appropriate state, set the textures, pack all the float4s and floats into registers, and so on. For a screen effect, invoking this translator function automatically wouldn’t be too much of a performance hit, but for lower level draw calls you’d want to instantiate (optimize) it offline for the particular platform.
I use that example because I actually created a very similar automatic draw call generator for 2D screen effects, but all done through templates. It ended up looking more like how cuda is implemented, and also allowed compilation of the code as HLSL or C++ for debugging. It was doable, but involved alot of ugly templates and macros. I built that system to simplify procedural surface operators for geometry image terrain.
But anyway, you get the idea now, and going from a screen effect you could then tackle 3D geometry and make a completely generic, data driven art pipeline, all based on reflective functions that parse data and translate or reorganize it. Some art pipelines are actually built on this principle already, but oh my wouldn’t it be easier in a more advanced, reflective language.

The Next Generation of Gaming

The current, seventh, home game console generation will probably be the last. I view this as a very good thing, as it really was a tough one, economically, for most game developers. You could blame that in part on the inordinate success of Nintendo this round with its sixth generation hardware, funky controller, and fun mass market games. But that wouldn’t be fair. If anything, they contributed the most to the market’s expansion, and although they certainly took away a little end revenue from the traditional consoles and developers, the 360 and PS3 are doing fine, in both hardware and software sales. No, the real problem is our swollen development budgets, as we spend more and money just to keep up with the competition, all fighting for a revenue pie which hasn’t grown much, if at all.

I hope we can correct that over the upcoming years with the next generation. Its not that we’ll spend much less on the AAA titles, but we’ll spend it more efficiently, produce games more quickly, and make more total revenue as we further expand the entire industry. Gaining back much of the efficiency lost in transitioning to the 7th generation and more to boot, we’ll be able to produce far more games and reach much higher quality bars. We can accomplish all of this by replacing the home consoles with dumb terminals and moving our software out onto data centers.

How will moving computation out into the cloud change everything? Really it comes down to simple economics. In a previous post, I analyzed some of these economics from the perspective of an on-demand service like OnLive. But lets look at it again in a simpler fashion, and imagine a service that rented out servers on demand, by the hour or minute. This is the more general form of cloud computing, sometimes called grid computing, where the idea is to simply turn computation into a commodity, like power or water. A data center would then rent out its servers to the highest bidder. Economic competition would push the price of computation to settle on the cost to the data center plus a reasonable profit margin. (unlike power, water, and internet commodities, there would be less inherent monopoly risk, as no fixed lines are required beyond the internet connection itself)

So in this model, the developer could make their game available to any gamer and any device around the world by renting computation from data centers near customers just as it is needed. The retailer of course is cut out. The publisher is still important as the financier and marketer, although the larger developers could take this on themselves, as some already have. Most importantly, the end consumer can play the game on whatever device they have, as the device only needs to receive and decompress a video stream. The developer/publisher then pays the data center for the rented computation, and you pay only as needed, as each customer comes in and jumps into a game. So how does this compare to our current economic model?

A server in a dataroom can be much more efficient than a home console. It only needs the core computational system: CPU/GPU (which are soon merging anyway) and RAM. Storage can be shared amongst many servers so is negligible (some per game instance is required, but its reasonably minimal). So a high end server core could be had for around $1,000 or so at today’s prices. Even if active only 10 hours per day on average, that generates about 3,000 hours of active computation per year. Amortized over three years of lifespan (still much less than a console generation), and you get ten cents per hour of computation. Even if it burns 500 watts of power (insane) and 500 watts to cool, those together just add another ten more cents per hour. So its under 25 cents per hour in terms of intrinsic cost (and this is for a state of the art rig, dual GPU, etc – much less for lower end). This cost will hold steady into the future as games use more and more computation. Obviously the cost of old games will decrease exponentially, but new games will always want to push the high end.

The more variable cost is the cost of bandwidth, and the extra computation to compress the video stream in real-time. These use to be high, but are falling exponentially as video streaming comes of age. Yes we will want to push the resolution up from 720p to 1080p, but this will happen slowly, and further resolution increases are getting pointless for typical TV setups (yes, for a PC monitor the diminishing return is a little farther off, but still). But what is this cost right now? Bulk bandwidth costs about $10 per megabit/s of dedicated bandwidth per month, or just three cents per hour in our model assuming 300 active server hours in a month. To stream 720p video with H.264 compression, you need about 2 megabits per second of average bandwidth (which is what matters for the data center). The peak bandwidth requirement is higher, but that completely smooths out when you have many users. So thats just $0.06/hour for a 720p stream, or $0.12/hour for a 1080p stream. The crazy interesting thing is that these bandwidth prices ($10/Mbps month) are as of the beginning of this year, and are falling by about 30-40% per year. So really the bandwidth suddenly became economically feasible this year, and its only going to get cheaper. By 2012, these prices will probably have fallen by half again, and streaming even 1080p will be dirt cheap. This is critical for making any predictions or plans about where this all heading.

So adding up all the costs today, we get somewhere around $0.20-0.30 per hour for a high end rig streaming 720p, and 1080p would only be a little more. This means that a profitable datacenter could charge just $.50 per hour to rent out a high end computing slot, and $.25 per hour or a little less for more economical hardware (but still many times faster than current consoles). So twenty hours of a high end graphics blockbuster shooter would cost $10 in server infastructure costs. Thats pretty cheap. I think it would be a great thing for the industry if these costs were simply passed on to the consumer, and they were given some choice. Without the retailer to take almost half of the revenue, the developer and publisher stand to make a killing. And from the consumer’s perspective, the game could cost about the same, but you don’t have any significant hardware cost, or even better, you pay for the hardware cost as you see fit, hourly or monthly or whatever. If you are playing 40 hours a week of an MMO or serious multiplayer game, that $.50 per hour might be a bit much but you could then pick to run it on lower end hardware if you want to save some money. But actually, as I’ll get to some other time, MMO engines designed for the cloud could be super efficient, so much more so than single player engines that they could use far less hardware power per player. But anyway, it’d be the consumer’s choice, ideally.

This business model makes more sense from all kinds of angles. It allows big budget, high profile story driven games to release more like films, where you play them on crazy super-high end hardware, even hardware that could never exist at home (like 8 GPUs or something stupid), maybe paying $10 for the first two hours of the game to experience something insanely unique. There’s so much potential, and even at the low price of $.25-$.50 per hour for a current mid-2009 high end rig, you’d have an order of magnitude more computation than we are currently using on the consoles. This really is going to be a game changer, but to take advantage of it we need to change as developers.

The main opportunity I see with cloud computing here is to reduce our costs or rather, improve our efficiency. We need our programmers and designers to develop more systems with less code and effort in less time, and our artists to build super detailed worlds rapidly. I think that redesigning our core tech and tools premises is the route to achieve this.

The basic server setup we’re looking at for this 1st cloud generation a few years out is going to be some form of multi-terraflop massively multi-threaded general GPU-ish device, with gigs of RAM, and perhaps more importantly, fast access to many terrabytes of shared RAID storage. If Larrabee or the rumours about NVidia’s GT300 are any indication, this GPU will really just be a massively parallel CPU with wide SIMD lanes that are easy to use. (or even automatic) It will probably also have a smaller number of traditional cores, possibly with access to even more memory, like a current PC. Most importantly, each of these servers will be on a very high speed network, densely packed in with dozens and hundreds of similar nearby units. Each of these capabilities by itself is a major upgrade from what we are used to, but taken all together it becomes a massive break from the past. This is nothing like our current hardware.

Most developers have struggled to get game engines pipelined across just the handful of hardware threads on current consoles. Very few have developed toolchains that embrace or take much advantage of many cores. From a programming standpoint, the key to this next generation is embracing the sea of threads model across your entire codebase, from your gamecode to your rendering engine to your tools themselves, and using all of this power to speedup your development cycle.

From a general gameplay codebase standpoint, I could see (or would like to see) traditional C++ giving way to something more powerful. At the very least, it’d like to see general databases, full reflection and at least some auto memory management, like ref counting at least. Reflection alone could pretty radically alter the way you design a codebase, but thats another story for another day. We don’t need these little 10% speedups anymore, we’ll just need the single mega 10000% speedup you get from using hundreds or thousands of threads. Obviously, data parellization is the only logical option. Modifying C++ or outright moving to a language with these features that also has dramatically faster compilation and link efficiency could be an option.

In terms of the core rendering and physics tech, more general purpose algorithms will replace the many specialized systems that we currently have. For example, in physics, an upcoming logical direction is to unify rigid body physics with particle fluid simulation in a system that simulates both rigid and soft bodies by large collections of connected spheres, running a massive parallel grid simulation. Even without that, just partitioning space amongst many threads is a pretty straightforward way to scale physics.

For rendering, I see the many specialized sub systems of modern rasterizers such as terrain, foilage, shadowmaps, water, decals, lod chains, cubemaps, etc, giving way to a more general approach like octree volumes that simultaneously handles many phenomena.

But more importantly, we’ll want to move to data structures and algorithms that support rapid art pipelines. This is one of the biggest current challenges in production, and where we can get the most advantage in this upcoming generation. Every artist or designer’s click and virtual brush stroke costs money, and we need to allow them to do much more with less effort. This is where novel structures like octree volumes will really shine, especially combined with terrabytes of server side storage, allowing more or less unlimited control of surfaces, object densities, and so on without any of the typical performance considerations. Artists will have much less (or any) technical constraints to worry about and can just focus on shaping the world where and how they want.

OnLive, OToy, and why the future of gaming is high in the cloud

For the last six months or so, I have been researching the idea of cloud computing for games, the technical and economic challenges, and the video compression system required to pull it off.

So of course I was shocked and elated with the big OnLive announcement at GDC.

If OnLive or something like it works and has a successful launch, the impact on the industry over the years ahead could be transformative. It would be the end of the console, or the last console. Almost everyone has something to gain out of this change. Consumers gain the freedom and luxury of instant on demand access to ultimately all of the world’s games, and finally the ability to try before you buy or rent. Publishers get to cut out the retailer middle-man, and avoid the banes of piracy and used game resales.

But the biggest benefit ultimately will be for developers and consumers in terms of the eventual game development cost reduction and quality increase enabled by the technological leap cloud computing makes possible. Finally developing for one common, relatively open platform (server-side PC) will significantly reduce the complexity in developing a AAA title. But going farther into the future, once we actually start developing game engines specifically for the cloud, we enter a whole new technological era. Its mind-boggling for me to think of what can be done with a massive server farm consisting of thousands or even tens of thousands of densly networked GPUs with shared massive RAID storage. Engines developed for this system will look far beyond anything on the market and will easily support massively multiplayer networking, without any of the usual constraints in physics or simulation complexity. Game development costs could be cut in half, and the quality bar for some AAA titles will eventually approach movie quality, while reducing technical & content costs (but that is the subject for another day).

But can it work? And if so, how well? The main arguments against, as expressed by skeptics such as Richard Leadbetter, boil down to latency, bandwidth/compression, and server economics. Some have also doubted the true value added for the end user: even if it can work technically and economically, how many gamers really want this?


The internet is far from a guaranteed delivery system, and at first the idea of sending players inputs across the internet, computing a frame on a server, and sending it back across the internet to the user sounds fantastical.
But to assess how feasible this is, we first have to look at the concept of delay from a pyschological/neurological perspective. You press the fire button on a controller, and some amount of time later, the proper audio-visual response is presented in the form of a gunshot. If the firing event and the response event occur close enough in time, the brain processes them as a simultaneous event. Beyond some threshold, the two events desynchronize and are processed distinctly: the user notices the delay. A large amount of research on this subject has determined that the delay threshold is around 100-150ms. Its a fuzzy number obviously, but as a rule of thumb, a delay of under 120ms is essentially not noticeable to humans. This is a simple result of how the brain’s parallel neural processing architecture works. It has a massive number of neurons and connections (billions and trillions respectively), but signals propagate across the brain very slowly compared to the speed of light. For more reference I highly recommend “Consciousness Explained” by Daniel C Dennet. Here are some interesting timescale factoids from his book:

saying, “one, Mississippi” 1000msec
umyelinated fiber, fingertip to brain 500msec
speaking a syllable 200msec
starting and stopping a stopwatch 175msec
a frame of television (30fps) 33msec
fast (myelinated) fiber, fingertip to brain 20msec
basic cycle time of a neuron 10msec
basic cycle time of a CPU(2009) .000001msec

So the minimum delay window of 120ms fits very nicely into these stats. There are some strange and interesting consequences of these timings. In the time it takes the ‘press-fire’ signal to travel from the brain down to the finger muscle, internet packets can travel roughly 4,000 km through fiber! (light moves about 200,000 km/s through fiber, or 200 km/msc * 20 msc) This is about the distance from Los Angeles to New York. Another remarkable fact is that the minimum delay window means that the brain processes the fire event and the response event in only a few dozen neural computation steps.

What really happens is something like this: some neural circuits in the user’s brain “make the decision” to press the fire button (although at this moment most of the brain isn’t conscious of it), the signal travels down through the fingers to the controller then on to the computer, which then starts processing the response frame. Meanwhile, in the user’s brain, the ‘button press’ event is propagating through the brain, and more neural circuits are becoming aware of the ‘button press’ event. Remember, each neural tick takes 10ms. Some time later, the computer displays the audio/visual response of the gunshot, and this information hits the retina/cochlea and starts propagating up into the brain. These events connect, and if they are seperated by only a few dozen neural computation steps (120 ms), they are connected and perceived as a single, simultaneous event in time. In another words, there is a minimum time window of around a dozen neural firing cycles where events are propagating around the brain’s neural circuits – even though it already happened, it takes time for all of the brain’s circuits to become aware of the event. Given the slow speed of neurons, its simply remarkable that humans can make any kind of decisions on sub second timescales, and the 120 ms delay window makes perfect sense.

In the world of computers and networks, 120 ms is actually a long amount of time. Each component of a game system (input connection, processing, output display connection) adds a certain amount of delay, and the total delay must add up to around 120ms or less for good gameplay. Up to 150ms is sometimes acceptable, and beyond 200ms we get quickly into rapid, problematic breakdown in the user experience as every action has noticeable delay.

But how much delay do current games have? Gamasutra has a great article on this. They measure the actual delay of real world games using a high speed digital camera. Of interest for us, they find a “raw response time for GTAIV of 166 ms (200 ms on flat panel TVs)“. This is relatively high, beyond the acceptable range, and GTA has received some criticism for sluggish response. And yet this is the grand blockbuster of video games, so it certainly shows that some games can get away with 150-200ms responses and the users simply don’t notice or care. Keep in mind this delay time isn’t when playing the game over OnLive or anything of that sort: this is just the natural delay for that game with a typical home setup.

If we break it down, the controller might add 5-20ms, the TV can add 10-50ms, but the bulk of the delay comes from the game console itself. Like all modern console games, the GTA engine buffers multiple frames of data for a variety of reasons, and running at 30fps, every frame buffered costs a whopping 30ms of delay. From my home DSL internet in LA, I can get pings of 10-30ms to LA locations, and 30-50ms pings to locations in San Jose. So now you can imagine lengthening the input and video connections out across the internet is not so ridiculous as it first seems at all. It adds additional delay, which you simply need to compensate for somewhere else.

How does OnLive compensate for this delay? The result for existing games is deceptively simple: you just run the game at a mucher higher FPS than the console, and or you reduce internal frame buffering. If the PC version of a console game runs at 120 FPS, and it still keeps 4 frames of internal buffering, you get a delay of only 32 ms. If you reduce the internal buffering to 2, you get a delay of just 16ms! If you combine that with a very low latency controller and a newer low latency TV, suddenly it becomes realistic for me to play a game in LA from a server residing in San Jose. Not only is it realistic, but the gameplay experience could actually be better! In fact, with a fiber FIOS connection and good home equipment, you could conceivably play from almost anywhere in the US, in theory. The key reason is that many console games have already maxxed out the maximum delay (when running on the console), and modern GPU’s are many times faster.

Video Compression/Bandwidth

So we can see that in principle, from purely a latency standpoint, the OnLive idea is not only possible, but practical. However, OnLive can not send a raw, uncompressed frame buffer directly to the user (at least, not at any acceptable resolution on today’s broadband). For this to work, they need to squeeze those frame buffers down to acceptably tiny sizes, and more importantly, they need to do this rapidly or near instantly. So is this possible? What is the state of the art in video compression?

For a simple, dumb solution, you can just send raw jpegs, or better yet, wavelet compressed frames, and perhaps get acceptable 720p images down to 1 Mbit or even 500Kbit for more advanced wavelets, using more or less off the shelf algorithms. With a wavelet approach, this would allow you to get 10fps with a 5Mbit connection. But of course we can do much better using a true video codec like H.264, which can squeeze 720p60fps video down to 5Mbit easily, or even considerably less, especially if we are willing to lower the fps in some places and or the quality.

H.264 and other modern video codecs work by sending improved JPEG key frames, and then sending motion vectors which allow predicted frames to be delta-encoded in far less bits, getting 10-30X improvement over sending raw JPGs, depending on the motion. But unfortunately, motion compensation means spikes in the bitrate – scene cuts or frames with rapid motion receive little benefit from motion compensation.

But H.264 encoders typically buffer up multiple frames of video to get good compression. OnLive has much less leeway here. Ideally, you would like a zero-latency encoder. H.264 and its predecessors have been designed to be used in video tele-conferencing systems, which demand low-latency. So there is already a predecent, and a modified version of the algorithm that avoids sending complete JPEG key frame images. Instead, using this low latency mode, small blocks of the image are periodically refreshed, but it never sends a complete JPEG key frame down the pipe, as this would take too long – creating multiple frames of delay.

There are in fact some new, interesting off the shelf H.264 hardware solutions which have near zero (1ms) or so delay, and are relatively cheap (in cost and power) – perhaps practical for OnLive. In particular, there is the PureVu family of video processors, from Cavium Networks. I have not seen them in action, but I imagine that with 720p60 at 5MBits/s, you are going to see some artifacts and glitches, especially with fast motion. But at least we are getting close, with off the shelf solutions.

But of course, OnLive is not using an off the shelf system(they have special encoding hardware and a plugin decoder), and improved video compression specific to the demands of remote video gaming is their central tech, so you can expect they have created an advancement here, but it doesn’t have to be revolutionary, as the off the shelf stuff is already close.

So the big problem is the variation in bitrate/compressibility from one frame to the next. If the user rapidly spins around, or teleports, you simply can not do better than sending a complete frame. So you either send these ‘key’ frames at lower quality, and or you spend a little longer on them, introducing some extra delay. In practise some combination of the two is probably ideal. With a wavelet codec or a specialized H.264 variant, key frames can simply be sent at lower resolution, and then the following frames will use motion compensation to start adding detail to the image. The appearance would be a blurred image for the first frame or so when you rapidly spin the camera, which would then quickly up-res in to full detail over the next several frames. With this technique, and some trade off of lowering the frame rate or adding delay a bit on fast motion, I think 5Mbps is not only achievable, but beatable using state of the art compression coming out of research right now.

The other problem with compression is the CPU cost for compression itself. But again, if the PureVu processor is indicative, off the shelf hardware solutions are possible right now with H.264 at very low power, encoding multiple H.264 streams with near zero latency.

But here is where the special nature of game video or computer generated graphics allows us to make some huge effeciency gains over natural video. The most complex CPU task in video encoding is motion vector search – finding the matching image regions from previous frames that allow the encoder to send motion vectors and do effecient delta compression. But for a video stream rendered with a game engine, we can output the exact motion vectors directly. This is a potential problem in that not all games necessarily have motion vectors available, which may require modifying the game’s graphics engine. However, motion blur is very common now in game engines (everybody’s doing it, you know), and the motion blur image filter computes motion vectors (very cheaply). Motion blur gives an additional benefit for video compression in that it generates blurrier images in fast motion, which are the worst case for video compression.

So if I was doing this, I would require the game to use motion blur, and output the motion vector buffer to my (specialized, not off the shelf) video encoder.

Some interesting factoids: it apparently takes roughly 2 weeks to modify the game for OnLive, and at least 2 of the 16 announced titles (Burnout and Crysis) are particularly known for their beautiful motion blur – and all of them, with the exception of World of Goo – are recent action or racing games that probably use motion blur.

There is however, an interesting and damning problem that I am glossing over. The motion vectors are really only valid for the opaque frame buffer. What does this mean? The automatic ‘free’ motion vectors are valid for the solid geometry, not all the alpha-blended or translucent effects, such as water, fire, smoke, etc. So these become problem areas. Its interesting that several of the GDC commentors pointed out ugly compression artifacts when fire or smoke effects were prominent in BioShock running OnLive.

However, many games already render their translucent effects at lower resolution (SD and even lower in modern console engines), so it would make sense perhaps to simply send these regions at lower resolution/quality, or blur them out (which a good video encoder would probably do anyway).

But in short, the video compression is the central core tech problem, but they haven’t pulled a miracle here – at best they have some good new tech which exploits some of the special properties of game video. And furthemore, I can even see a competitor with a 2x better compression system coming along and trying to muscle them out.

There’s one other little issue which is worth mentioning slightly, which is packet loss. The internet is not perfect, and sometimes packets are lost or late. I didn’t mention this earlier because it has well known and relatively simple technical solutions for real time systems. Late packets are treated as dropped, and dropped packets and errors are corrected through bit level redundancy. You send small packet streams in groups using bit association techniques such that any piece of lost data can be recovered, at the cost of some redundancy. For example, you send 10 packets worth of data using 11 packets, and any single lost packet can be fully reconstructed. More advanced schemes adaptively adjust the redundancy based on measured packet loss, but this tech is alreadly standard, its just not always use or understood. Good game networking engines already employ these packet loss mitigation techniques, and work fine today over real networks.

The worst case is simply a dropped connection, which you just can’t do anything about – OnLive’s video stream would immediately break and notify you of a connection problem. Of course, the cool thing about OnLive is that it could potentially keep you in the game or reconnect you once you get your connection back.

Server Economics

So if OnLive is at least possible from a technical perspective (which it clearly is), the real question comes down to one of economics. What is the market for this service in terms of the required customer bandwidth? How expensive are these data centers going to be, and how much revenue can they generate?

Here is where I begin to speculate a little beyond my areas of expertise, but I’ll use whatever data I’ve been able to gather from the web.

A few google searches will show you that US ‘broadband’ penetration is around 80-90%, and the average US broadband bandwidth is somewhere around 2-3 Mbps. This average is somewhat misleading, because US broadband is roughly split between cable (25 million subscribers), and DSL (20 million subscribers), with outliers like fiber (2-3 million subscribers currently) and the DSL users often have several times lower bandwidth than the cable. At this point in time, the great majority of American gamers already have at least 1.5 Mbps, perhaps half have over 5 Mbps, and almost all have a 5 Mbps option in their neighborhood, if they want it. So OnLive is in theory will have a large potential market, it really comes down to cost. How many gamers already have the required bandwidth? And for those who don’t, how cheap is OnLive when you factor in the extra $ users may have to pay to upgrade? And to point out, the upgrade really will be for the HD option, as the great majority of gamers already have 1.5 Mbps or more.

BandWidth Caps

There’s also the looming threat of American telcos moving towards bandwidth caps. As of now, Time Warner is the only American telco experiementing with caps low enough to effect OnLive (40 Gigs/Month for their highest tier). Remember that using the HD option, 5 Mbps is the peak bandwidth, the average useage is half that or less, according to OnLive. So Comcast’s cap of 250 Gigs/Month isn’t really relevant. Time Warner is currently still testing its new policy in only a few areas, so the future is uncertain. However, there is one interesting fact to throw into the mix: Warner Bros, the Time Warner subsidary, is OnLive’s principle investor. (the other two are AutoDesk and Maverick Capital) Now conser that Warner cable is planning some sort of internet video system for television based on a new wireless cable modem, and consider that Perlman’s other company was Digeo, the creator of Moxi. I think there will be more OnLive suprises this year, but suffice to say, I doubt OnLive will have to worry about bandwidth caps from Time Warner. I suspect Time Warner’s caps really are more about a grand plot to control all digital services in the home, by either direclty providing them or charging excess useage fees that will kill enemy services. But OnLive is definetly not their enemy. In the larger picture, the fate of OnLive is entertwined into the larger battle for net neutrality and control over the last mile pipes.

Bandwidth Cost

OnLive is going to have to partner with backbones and telcos, just like the big boys such as Akamai, Google and YouTube do, in what are called either transit or peering arrangements. A transit arrangement is basically bandwidth wholesale, and we’ll start with that assumption. A little google searching reveals that wholesale mass transit bandwidth can be had for around or under 10$ per Megabit/s per month (comparable to end broadband customer cost, actually). Further searching suggests that in some places like LA it can be had for under 5$ per Mbs/month. This is for a dedicated connection or peak useage charge.

Now we need some general model assumptions. The exact subscriber numbers don’t really matter, what critically matters are a couple of stats: how many hours a month does each subscriber play, and more directly, what is the typical peak fraction of users online at a given time. The data I’ve found suggests that 10 hours per week is a rough gamer average, or 20 hours per week for an MMO, 10% occupancy is typical for regular games and 20% peak occupancy is typical for some MMOs. Using the 20% peak occupancy means that you need to provide enough peak bandwidth for 20% of your user base to be online at a time – a worst case. In a potential worse case scenario, every user wants HD at 5 Mbits/s and the peak occupancy is 20%, so you need essentially a dedicated 1 Megabit/s for each user or $10/month per user in bandwidth cost alone. Assuming a perhaps more realistic scenario, the average user bandwidth is 3Mbps (not everyone can have or wants HD), peak occpuancy is 10%, and you get $3 per month in bandwidth cost per user.

Remember, in rare peak moments, OnLive can gracifully and slowly degrade video quality – so the service will never fail if they are smart. The worst case at terrible peak times is just a little lower image quality or resolution.

So roughly, we can estimate bandwidth will cost anywhere from $3-10 per month per user with transit arrangements. Whats also possible, and more complex, are peering arragnements. If OnLive partners directly with providers near its data centers, it can get substantially reduced rates (or even free) if the traffic stays with just that provider. So realistically, i think $5 per month in bandwidth per user is a reasonable upper limit on OnLive’s bandwidth charges based on today’s economic climate – and this will only go down. But 1080p would be significantly more expensive, and it would make sense to charge customer’s extra. I wouldn’t be surprised if they have a tiered charge based on resolution – as most of their fixed costs scale linearly with resolution.

Dataroom Expense

The main expense is probably not the bandwidth, but the per server cost to run a game – a far more demanding task than what most servers do. Lets start with the worst case and assume that OnLive needs at least one decent CPU/GPU combination per logged on user. OnLive is not stupid, so they are not going to use typical high end, expensive big iron, but nor are they going to use off the shelf PC’s. Instead I predict that following in the footsteps of google they will use midrange, cheaper, power effecient components, and get significant bulk discounts. Lets start with the basic cost of a CPU/motherboard/RAM/GPU combo. You don’t need a monitor and the storage system can be shared between a very large number of servers – as they are all running the same library of installed games.

So lets take a quick look on pricewatch:
Core 2 Quad Q6600 Cpu fan + – 4GB RAM DDR2 $260
GeForce GTX280 1 GB 512-Bit DDR3 602/2214 Fansink HDCP Video Card $260

These components are actually high end, far more than sufficient to run the PC versions of most existing games at 90-150fps at 720p, and yes even crysis at near 60fps at 720p.

If we consider that they may have researched a little longer and undoubtedly get bulk discounts, we can take $500 per server unit as a safe upper limit. Amortize this over 2 years and you get $20 per month. Factor in the 20% peak demand occupancy, and we get a server cost of $4 per user per month.

This finally leaves us with power/cooling requirements. Lets make an over-assumption of 600watt continous power draw. With power at about $0.10 per kilowatt/hour, and 720 hours in a month, we get roughly $40 a month per server in power draw. Factor in the 20% peak demand occupancy, and we get $8 per user per month. However, this is an over-assumption because the servers are not constantly using power. The 20% peak demand figure means they need enough servers for 20% of their users to be logged in at once – but most of the time not all of the servers are active. The power required would scale with the average demand, not the peak, so its closer to $4 per user per month in this example (assuming a high average 10% occupancy). Cooling cost is harder to estimate, but some google searching reveals its roughly equivalent to the power cost, assuming modern datacenter design (and they are building brand new ones). So this leaves us with around $12 per user per month as an upper limit in server, power, and cooling cost.

However, OnLive is probably more effecient than this. My power/cooling numbers are high because OnLive probably spends a little extra on more expensive but power effecient GPU’s that save power/cooling cost to hit the right overall sweet spot. For example, nvidia’s more powerful GTX 295 is essentially two GTX 280 cores on a single die. Its almost twice as expensive, but provides twice the performance (so similar performance per $) and draws only a little more power (twice as power effecient). Another interesting development is that Nvidia (OnLive’s hardware partner), recently announced virtualization support so that multi-GPU systems can fully support multiple concurrent program instances. So what it really comes down to is how many CPU cores and or GPU cores you need to run games at well over 60fps. Based on what I can see from recent benchmarks, two modern intel cores and a single GPU are more than sufficient (most console games only have enough threads to push 2 CPU cores). Nvidia’s server line of GPU’s are more effecient and only draw 100-150 watts per GPU, so 600 watts is a high over-estimate of the power required per connected user.

But remember, you need a high FPS to defeat the internet latency – or you need to change the game to reduce internal buffering. There are many trade offs here – and I imagine OnLive picked low-delay games for their launch titles. Apparently Onlive is targeting 60fps, but that probably means most games usually get even higher average fps to reduce delay.

Overall, I think its reasonable, using the right combination of components (typically 2 intel CPU cores and one modern nvidia GPU, possibly as half of a single motherboard system using virtualization) to have the per user power cost down to something more like 200 watts to drive a game at 60-120fps (remember, almost every game today is designed primarily to run at 30fps on the xbox 360 at 720p, and a single modern nvidia GPU is almost 4 times as powerful). Some really demanding games (crysis), get the whole system – 4 cpus and 2 GPU’s – 400 watts. This is what I think OnLive is doing.

So adding it all up, I think 10$ per month per user is a safe upper limit for OnLive’s expenses, and its perhaps as low as 5$ per month or less, assuming they typically need two modern intel CPUs and one nvidia GPU per user logged on, adequate bandwidth and servers for a peak occupancy of 20%, and power/cooling for an average occupancy of 10%.

Clearly, all of the numbers scale with the occupancy rates. I think this is why OnLive is at least initially not going for MMOs – they are too addictive and have very high occupancy. More ideal would be single player games and casual games that are played less often. Current data suggests the average gamer plays 10 hours a week, and the average MMO players plays 20 hours per week. The average non-MMO player is thus probably playing less than 10 hours per week. This works out to something more like 5% typical occupancy, but we are interested more in peak occupancy, so my 10%/20% numbers are a reasonable over-estimate of average/peak. Again, you need enough hardware & bandwidth for peak occupancy, but the power & cooling cost is determined by average occupancy.

$10 per month may seem like a high upper limit in monthly expense per user, but even at these expense rates OnLive could be profitable, because this is still less than the cost to the user of running comparable hardware at home.

Here’s the simple way of looking at it. That same $600 server rig would cost $1000-1500 for an end user, because they need extra components like a hard drive, monitor, etc which OnLive avoids or gets cheaper, and OnLive buys in bulk. But most importantly, the OnLive hardware is amortized and shared over a number of users. The user’s high end gaming rig sits idle most of the time. So the end user’s cost to play at home on an even cheap $600 machine amortized over 2 years is still over $30 per month, three times the worst case per user expense of OnLive. And that doesn’t even factor in extra power expense for gaming at home. OnLive’s total expense is probably more comparable to that of xbox 360. A $500 machine (include necessary periphials) amortized over 5 years is a little under $10 per month. And then xbox live gold service is another $5 a month on top of that. OnLive can thus easily cover its costs and still be less expensive than 360 and PS3, and considerably less expensive than PC gaming.

The game industry post Cloud

In reality, I think that OnLive’s costs will be considerably less than $10 per user per month, and will be increasingly less over time. Just like the console makers periodically update their hardware to make the components cheaper, OnLive will be constantly expanding its server farms and always buying the current sweet spot combination of CPU’s and GPU’s. But Nvidia and Intel refresh their lineups at least twice a year, so OnLive can really ride moore’s law continously. Every year OnLive will become more economical and or provide higher FPS and less delay and or support more powerful games.

So its seems possible, even inevitable that OnLive can be economically viable charging a relatively low subscription fee to cover their fixed costs – comparable to Xbox Live’s subscription fee (about 5$/month for xbox live gold) . Then they make their real profit on taking a console/distributor like cut of each game sale or rental. For highly anticipated releases, they could even use a pay to play model initially, followed up by traditional purchase or rental later on, just like the movie industry does. Remember the madness that surrounded the Warcraft3 Beta, and think how many people would pay to play Starcraft2 multiplayer ahead of time. I know I would.

If you scale OnLive’s investment requirements to support the entire US gaming population, you get a ridiculous hardware investment cost of billions of dollars, but this is no different than a new console launch, which is exactly what OnLive must be viewed as. The Wii has sold 22 million units in the Americas, the 360 is close behind at 17 million. I think these numbers represent majority penetration of the console market in the Americas. To scale to that user base, OnLive will need several million (virtual) servers, which may cost a billion dollars or more, but the investment will pay for itself as it goes – just as it did for Sony and Microsoft. Or they simply will be bought up by some big deep pocket entity which will provide the money, such as Google, or Verizon, or Microsoft.

The size and quantity of the datarooms OnLive will have to build to support even just the US gaming populations is quite staggering. We are talking about perhaps millions of servers in perhaps a dozen different data center locations, drawing the combined power output of an entire large power plant. And thats just for the US. However, we already have a very successful example of a company that has built up a massive distributed network of roughly 500,000 servers in over 40 data centers.

Yes, that company is Google.

To succeed, OnLive will have to build an even bigger and more massive supercomputer system. But I imagine Google makes less money per month for each of its servers than OnLive will eventually make for each of its gaming servers. Just how much money can OnLive eventually make? If OnLive could completley conquer the gaming market, than it stands to completely replace both the current consoles manufacturers AND the retailers. Combined, these entities take perhaps 40-50% of the retail price of a game. Even assuming OnLive only takes a 30% cut, it could thus eventually take in almost 30% of the game industry – estimated at around $20 billion per year in the US alone, and $60 billion world-wide, eventually turning it into another Google.

Another point to consider is that most high end PC sales are mainly used for gaming, and thus the total real gaming market (in terms of total money people spend for gaming) is even larger, perhaps as large as 100 billion worldwide, and OnLive stands to rake a chunk of this in and change the whole industry – further reducing the end consumer PC market and shifting that money into OnLive subscriptions, game charges, etc. part of which in turn covers the centralized hardware cost. NVIDIA and ATI will still get a cut, but perhaps less than they do now. In other words, in the brave new world of OnLive, gamers will only ever need a super-cheap microconsole or netbook to play games, so saving money on consoles and rigs will allow them to buy more games, and all this money gets sucked into OnLive.

Now consider that the game market has consistently grown 20% per year for many years and you can understand why investors have funnelled hundreds of millions into OnLive in order to make it work. And eventually, OnLive can find new ways to ‘monetize’ gaming (using Google’s term), such as ads and so on. Eventually, it should make as much or more per user hour as television does.

Now this is the fantasy of course, but I doubt OnLive will grow to become a Google any time soon, mainly because Nintendo, Sony, Microsoft, and the like aren’t going to suddenly dissappear, bringing me to my final point.

But What about the games?

In the end people use a console to play games and thus the actual titles are all that really matters. In one sense part of the pitch of OnLive – ‘run high end PC games on your netbook’ – is a false premise. Most of OnLive’s lineup is current gen console games, and even though OnLive will probably run them at higher fps, this is mainly to compensate for latency. Video compression and all the other factors discussed above will result in an end user experience no better, and often worse than simply playing the console version. (especially if you are far from the data center) OnLive’s one high end PC title – crysis – is probably twice as expensive for them to run, and will be seen as somewhat inferior to gamers who have high end rigs and have played the game locally. It will be more like the console version of Crysis. But unfortunately, Crytek’s already working on that.

This is really the main obstacle that I think could hold OnLive back – 16 titles at launch is fine, but they are already available on other platforms. Nintendo dominated this current console generation because of its cheap, innovative hardware and a lineup of unique titles that exploit it. I think Nintendo of America’s president Reggie Aime was right on the money:

Based on what I’ve seen so far, their opportunity may make a lot of sense for the PC game industry where piracy is an issue. But as far as the home console market goes, I’m not sure there is anything they have shown that solves a consumer need

What does OnLive really offer the consumer? Brag Clips? The ability to spectate any player? Try before you buy? Rent? These are nice(especially the latter two), but can they amount to a system seller?. Its a little cheaper, but is that really important considering most gamers already have a system? It seems that PC games could be where OnLive has more potential, but how much can it currently add over Steam? If OnLive’s offerings expanded to include almost all current games, then it truly could acheive a high market penetration, as the successor of Steam (with the ultimate advantage of free trial and rental – which steam can never do). But Valve does have the significant advantage of having a variety of exclusive games built on the Source Engine, which all together (Left for Dead, CounterStrike, Team Fortress 2, Day of Defeat, etc) make up a good chunk of the PC multiplayer segment.

The real opportunity with OnLive is to have exclusive titles, which takes advantage of OnLive’s unique super-computer power to create a beyond next gen experience. This is the other direction in which the game industry expands, by slowly moving into the blockbuster story experiences of movies. And this expansion is heavily tech driven.

If such a mega-hit was made, such as a beyond next gen Halo, or GTA, it could rapidly drive OnLive’s expansion, because OnLive requires very little user investment to play. At the very least, everyone would be able to try or play the game on some sort of PC they already have, and the microconsole to play on your TV will probably only cost as much as a game itself. So this market is a very different beast than the traditional consoles, where the market for your game is determined by the number of users who own the console. Once OnLive expands its datacenter capacity sufficiently, the market for an exclusive OnLive game is essentially any gamer. So does OnLive have an exclusive in the works? That would be the true game changer.

This is also where OnLive’s less flashy competitor, OToy & LivePlace, may be going in a better direction. Instead of building the cloud and a business based first on existing games, you build the cloud and a new cloud engine for a totally new, unique product, which is specifically designed to harness the cloud’s super resources and has no similar competitor.

Without either exclusives or a vast, retail competitive game lineup, OnLive won’t take over the industry.