Physical Constraints on Intelligent Machines

Just how large is the space of viable computational architectures, and within that how large is the space of general intelligence algorithms which can be physically simulated?

In one sense the space of viable computing systems is as large as the configuration space of matter itself, as physics itself can be formulated as computation and even a rock is thus performing some form of computation.  However, we are interested in the vastly smaller configuration space of physical systems that can function as programmable turing machines, and the related, but distinct space of intelligent computing systems.  Naturally, any general programmable turing machine of sufficient capacities can run a program simulating any physical system, so it can ipso facto, simulate any other turing machine and any intelligent system within the space defined by its storage capacity.

At a really fundamental level, there are only two fundamental important quantities of computational systems (brains included): their storage capacity, and their speed.  There are of course other important measures as well, such as network topological considerations such as fan-out and derived measures such as latency and so on, but all of these can be shown to boil down to speed in the end.  The storage capacity determines the size of program-space a system can run, and this applies equally to intelligent algorithms such as brains running intelligence ‘programs’ – that is minds.  The space of gigabit minds includes insects, most reptiles and a wide swath of programs and intelligences beyond description here, but its strictly smaller than the space of roughly petabit minds such as those of humans.

Nonetheless, the properties of minds are not derived from the underlying computational system they are encoded on – the basic premise of substrate independence.  And if you study the brain, you will find that linguistic minds such as our are a new layer of phenomenon that supervenes on the brain in the same way that programs supervene on computers and words supervene on books.  Our minds are complex memetic-linguistic programs that have been evolving for tens of thousands of years at a speed vastly, exponentially acceleratingly faster than genetic evolution.

You are not your brain anymore than Hamlet is the ink and pages it is written on.  You are not decades or even a century old, not matter your physical age: that is the age of the book on which you are currently written.  If you have even a vague education, you are orders of magnitude older than that.  Its convenient and typical to think of oneself as the physical book copy and its pages instead of the words, but this is simply due to the fact that up until now we only have ever witnessed a single unique copy of each mind.

When we upload, that will change massively, and all will come to intuitively understand that we are not the ink, but instead we are that which is written.

Nonetheless, it is important to understand the physical medium in which we are encoded, and how that will change in the near future.  Surprisingly, as Feynman observed, there is a huge amount of space for computation as far down as we can see.  However, the space of viable technological paths down that rabbit hole seems surprisingly constrained: a narrow funnel.

Physically Realizable Intelligent Agents

The earlier abstract model of intelligence I have discussed does not take into account the practicalities of physics.  Indeed, the mapping from possible algorithms, existing in their own nebulous mathematical platonic ensemble, to actual practical implementations of algorithms in the world of physical matter has always been (and still is) something of a conceptual obstacle to mathematicians and computer scientists alike.  The main branches of mathematical dogma in computer science, from turing machines to complexity analysis, traditionally considers computability in an abstract world free from the practical constraints of geometry, mass, and energy.  The traditional computer architecture, with a serial central processing unit and separate storage ‘tape’ for storing instructions and data carries the baggage of the turing machine abstraction.

In the real world, the physical interactions usable for computation and storage are all simultaneously both localized and distributed.  No matter where you currently are on the technology shrinking scale, each bit of data must be stored on physical blocks which take up space.  Likewise, each computational element takes up space, computation uses energy, and moving information to distant locations uses both energy and space in proportion.  They are only a couple of fundamental circuit elements, from which any larger computational system can be composed.  Relevant for us, computers today are built largely out of transistors and capacitors.

In simplified form, physical computational or circuit complexity theory is concerned with how any computational system (from the mind to an abstract algorithm) can be broken down into a large network of very simple basic computational elements arranged into a physical network.  Going up a level of abstraction beyond the fundamental circuit elements, we can consider a small computing block which takes in a few bits of input, produces a few bits of output, and has a few bits of data storage – the minimal computer.  Given the input bits and the stored bits, the output of such a node is a simple deterministic function that can perform the set of  fundamental computations (binary AND, OR, XOR).  The input and output wires can be connected to other nearby elements.  Any algorithm in theory can be reduced down to a sufficiently sized network of these tiny computers in at least 2 dimensions  (1D is not sufficient, and higher dimensions have better scaling features, which we’ll look at later).  What is of utmost importance is how the algorithm scales: what is the minimum circuit to implement it, and how does performance scale as we increase the size of the network?  We are conditioned to thinking of performance scaling with speed, but the speed is largely a fixed parameter of the substrate.  Moore’s law models the increase in the density of circuit elements per chip area, so scaling with Moore’s law is scaling with circuit size, not speed.  In fact, speed has flatlined in the gigahertz range and is unlikely to increase significantly as we go forward (and may in fact decrease).

If you were to zoom in and fly around a modern computer, you can imagine it looking something like a massive futuristic city.  At the macro-scale, you’d see a clear downtown area bustling with activity – this is the CPU.  A network of highways branching out of the CPU lead to the equivalent of the suburbs – the RAM chips where data is stored.  Zooming in closer to one of these suburbs, you can imagine the fundamental circuit blocks that do the computation or store data as being small little buildings (no skyscrapers) interspersed with several layers of roads and highways to carry data, all arranged into highly regular and repetitive patterns.  On the simpler scale of the spectrum, a typical memory chip would look like an iconic 60’s suburb with a very simple regular pattern of a few buildings and roads repeated exactly stretching off into the distance in either direction.  The function of a memory chip is not to compute, but to cheaply store masses of data.  The design problem then is to have a way of exactly specifying a location and routing the flow of traffic to get data into or out of that location.  This is accomplished in binary form with a series of branching switches, not unlike traffic lights.  In our little computer-world analogy, you can imagine at any one time the central control (the CPU – downtown) requests a small block of residents to come in to the city for work, bringing their data bits, and at the same time is sending another group back (with altered data bits).  These suburbs would look strangely empty, with all the traffic lights arranged to carefully route the data between these precise locations and the highways leading out of the chip.

The storage capacity of a 2D system scales directly in 2D with the total surface area and thus the number of component elements, but the aggregate bandwidth (the rate at which data can move flow in or out), scales only with the perimeter of the storage unit – 1D scaling.  As a more concrete example of the scaling limit, consider that a typical 2010 RAM chip might store a gigabyte of data, is clocked at around a gigahertz, but can only move data at speeds measured in gigabyte’s per second – at any one time it can only access a small fraction of its data.  Visualizing the circuit as a city again, you are already familiar with this scaling law – you probably have experienced it as traffic: essentially the same phenomena on a different physical scale.  If everyone (every piece of data) left their buildings at the same time and headed for CPU-city, the constraints of geometry tell us the result will be a massive traffic jam.  To alleviate this problem to some tiny degree modern RAM chips fetch data in blocks and they also take advantage of the 3rd dimension in a fashion, with highways stretching out over the city, but the scaling problem still persists in 3D.  The same exact dilemma effects the CPU.  And on the topic of the CPU, you can visualize it as a similar cityscape, but vastly more complex and irregular to route the flow of data through its series of valves and switches in order to perform arbitrary computations.  The more rich and generic the set of instructions the CPU supports, the larger its cityscape grows in terms of surface area.

For most of the history of computation Moore’s law reliably shrunk the features of these computing cities each year, with every 25% reduction in feature width halving the surface are of these miniature building blocks, allowing twice as many to be packed into the same area, while also reducing the power dissipation of all the components.  The smaller components could thus run faster, leading to the familiar escalation of clock rates, which then quickly ran into the problem of latency: the CPU can only compute dependent instructions after the required data is fetched from memory, and as clock rates escalated the fetch times escalated in proportion – the speed of light is fixed.  CPU’s used the extra transistor space to grow massive additional features to hide the latency: caches, logic predictors, and so on.  The end result was that programmers could continue to work in the convenient fantasy realm that abstracted away the physical constraints of computation.  Meanwhile, the smaller set of scientists working largely on scientific simulations starting built super-computers out of larger clusters of CPUs, and they ran up against the real world performance constraints of physics, where you have to design algorithms to be efficient considering parallelism, networking, latency, and so on.  Now CPU makers have finally hit the barrier where further clock speed gains no longer justify their power cost, and instead are using escalating transistor budgets to pack more and more small CPUs together, going the way of the super-computers.  GPUs are a good example of packing alot of little cores together along with some nifty solutions to latency (massive hardware multi-threading: effectively they time-slice and simulate a much larger number of virtual cores), but they still have the bandwidth piping constraints – which are increasingly being overcome by moving into the 3rd dimension.

As computers scale down to the information densities of brains (around 2 or 3 more orders of magnitude to go – 4 to 6 more doublings ahead) and eventually beyond, they will face scaling and quantum limits.  Parallel chips such as GPUs (which have several orders of magnitude higher compute/memory density) already have around 1,000 little processing cores, but conceptually simulate a machine that has hundreds of thousands of cores in order to hide latency.  Programming for these machines is fundamentally different, they are effectively super-computers on a chip.  We are at a scaling wall where individual von neumman processors can not be made larger to any benefit: all we can do is make vastly more of them.  A future GPU mid next decade will probably look something like a current 1,000 GPU super-computer shrunk down to a PC size: millions of individual cores.  Such a model is considerably more complex than current parallel designs: you have numerous separate memory banks and network links to consider.

Going farther forward, our computers must become more like the brain’s 3D design.  Going back to our cityscape visualization analogy, the brain would look like something totally different.  It would be more like a giant tree.  Where our flat 2d computer city had a population on the order of billions, this great world tree would be vastly larger, with a population in the hundreds of trillions – hundreds of thousands of times more populous than the computer city.  But instead of fast automobile commuters zipping around at a hundred kilometers per hour to carry all the data, in this great tree the computation would be performed by tiny, slow ants.  If you have a hard time visualizing how that could ever lead to computation, now imagine speeding it all up by a million times.  The ants now zip along at a couple hundred kilometers per hour, and the cars in the cityscape approach the speed of light.

The brain is 3D, but the cortex is conceptually more like a 2D sheet which has been folded up and compressed into the skull’s small space.  The fundamental computational limits in 3D (of which 2D is a worse special case) follow directly from geometry.  The information storage and total maximum usable computational power scales directly with volume, but to use this a program must distribute itself across the vast number of local computational elements.  Communication – sending bits to other nodes – has a latency penalty, a space cost, and an energy cost proportional to the distance between the nodes.  In general this means that communications must largely be local, with the distance distribution tapering off according to something like a power law.  The total non-local bandwidth scales with the surface area: 2D vs 3D scaling.  This is reality: a vastly different and highly constrained computational space to work within.  The space of efficient algorithms that can scale well in these constraints is narrow.

Take a concrete example: the class of 3D rendering algorithms known as ray tracers.  They are powerful yet simple, and very easy to parallelize: each pixel of the output image can be computed independently by tracing a ray through the scene.  For each pixel, you simulate a ray (or rays) moving through the environment, performing an ordered search through an object database to find and evaluate potential intersections.  On intersections you evaluate optic surface interactions, possibly spawning additional secondary rays which continue onwards until an adequate estimate of the incoming light is reached.  This algorithm is easy to map to a serial single CPU model, and likewise just as easy to map to a simple parallel GPU model where you have N processing threads, as long as N is less than the number of pixels.  GPUs are pushing hundreds of threads per core with up to a thousand cores, so they are already approaching an initial scaling limit for most graphics algorithms.  But this is still *easy* because we have a single shared memory.  How would you scale the algorithm to a GPU of the mid 2020’s timeframe – which will be constrained to resemble something like a current super computer shrunk down?  You now have millions of cores, perhaps hundreds of millions of hardware threads to occupy (far more than pixels), and much worse: you now have to face a distributed memory system with countless separate memory banks and a hierarchy of bandwidth and latency constraints depending on a 3D topology.  As you approach this final inevitable level of hardware complexity, the best solutions tend to map the problem’s virtual topology to the computer’s physical topology.  Such an algorithm might entail chopping up the 3D view volume looking into the camera into some sort of grid and dividing up the 3D space amongst the many cores.  Each core would store the objects and rays entering its virtual 3D cell in its own local memory, and would be responsible for computing intersections for any rays entering its space.  As rays and objects move between cells, they are shuffled to other computing nodes across the network.  This class of algorithms is about the best you can do given the physical scaling constraints discussed earlier, and is considerably more complex to implement.

Moore’s law says that transistor density increases exponentially, but this doesn’t directly map to any statement about effective processor speed.  Moore’s law can not be simplified into a general exponential speedup of all algorithms over time.  As transistor density increases going forward physical constraints will limit further performance improvements to an increasingly narrow branch of the space of computer architectures, and thus the space of algorithms.

  • Physics imposes the computational constraints of locality and the speed of light
  • As computers shrink, they necessarily must become massively parallel, increasingly taking on the programming complexity formerly reserved for super-computers
  • The space of all programs that implement a general AI may be large, but only a tiny sub-fraction scale forward with massive parallelization
  • The road ahead necessarily leads to something brain-like: a massively parallel 3D computer operating in a noisy environment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s