AGI is Near(er)

A ‘common’ simple fermi estimate for the net computation required to produce AGI is something like:

HB * HL * X

Where HB is the compute equivalent of a human brain in ops/s, HL is the human lifetime, and X is some nebulous multiplier representing the number of experimental trials required to discover the correct architecture. Let’s call this the HumanBrain*HumanLifetime*X model.

HL – the human lifetime, shouldn’t be too controversial: it’s about 10^9 seconds for 32 years.

Joseph Carlsmith from Open Philanthropy has penned an impressively encyclopedic article seeking solely to estimate HB, resulting in a (sort of distribution with a ) median estimate of 10^15 op/s. That’s not far from the simple estimate of average synaptic spike rate * number of cortical synapses (ie ~1hz * 10^14 synapses =~ 10^14 op/s). My own estimate is similar, but with less variance (higher confidence).

A current high end GPU, the RTX 3090, has about 3×10^9 transistors that cycle around 1.5 x 10^9 hz for a maximum total circuit binary op throughput of ~5×10^18 ops/s. Is this surprising? However in terms of actual useful logic ops the max throughput (for 1 bit matrix multiplication using tensorcores) is closer to 10^15 op/s, or 10^14 flop/s for half floating point.

Although it may seem surprising at first to suggest that a current consumer GPU has raw compute power equivalent to the human brain, keep in mind that:

  1. Synaptic ops are analog operations and thus intrinsically more equivalent to and expensive as lower bit flops – ie equivalent to 10^3 to 10^4 simple bit ops
  2. The brain only uses 10 watts vs 350 for the GPU
  3. Energy cost depends on wiring transit length which is longer in the brain
  4. Consequent on 3.), the GPU max performance listed here only applies when using registers or smem, corresponding to minimal (and less useful) wiring length
  5. Moore’s law is approaching end game for most relevant dimensions ( including the crucial bitop/J )
  6. It’s fairly obvious that the brain pushes physical limits (biological cells are actual practical nanobots operating near the Landauer Limit1)
  7. The convergence of 5 and 6 in the near-term implies AGI

The GPU has about 3 OOM less total memory capacity, but GPUs can be networked together at vastly higher bandwidths, allowing one to spread a large model (virtual brain) across many GPUs (model parallelism) while also running many ‘individual’ AI instances simultaneously on each GPU (data parallelism). A fully analysis of how various net bandwidths compare (including compression) and will differentially shape AGI vs brain development is material for another day, but for now, just trust me that memory and interconnect aren’t hard blockers.

The HB*HL component forms a useful 2D product space. For the same given amount of compute, you can scale up your model complexity by a factor of X at the cost of reducing training data/time by X or vice versa.

AlphaZero models are much small than the brain, but are trained much faster than real-time, reaching superhuman capability in Go or Chess after <24 hours of wall-clock training time on a few thousand TPUs (~GPUs for our purposes), or < 100,000 GPU-hour equivalent, which is fairly close to HB*HL. (Note that AlphaZero did not use human knowledge, which makes the comparison more impressive)

GPT-3 is also smaller than the brain, and was trained for much longer in virtual time/data, but in terms of total compute was trained using just a few hundred GPU-years (ie, within one OOM of HB*HL). No, GPT-3 is clearly not a proto-AGI, as it wasn’t trained to be that. It was instead trained on the naively narrow task of sequential token prediction (and without the obvious human-like advantage of symbolic grounding through a sensori-motor world model – which yes obviously Google and OpenAI are now working as it’s obvious).

EfficientZero achieves superhuman test performance on the Atari 100k Benchmark, after only about 2 hours of virtual playtime (and thus superhuman sample efficiency!), which is especially impressive given that it doesn’t use visual pretraining (whereas humans have years of pretraining before ever trying Atari). It does this by learning an approximate predictive world model (of Atari) – ie using model-based RL. (the devil of course is in the details, but it’s much less obvious that EfficientZero isn’t simple proto-AGI). However it runs at about 0.3x real-time, so that 2 hours of virtual training takes roughly 6 hours of real time on 4 GPUs – so again within an OOM of the human training compute (ignoring human pretraining) .

The general lesson here is that DL systems are approaching human capabilities on interesting tasks when trained using roughly comparable amounts of total compute (with much more flexibility in the 2D space of model size vs training time tradeoffs).

So with those examples fresh in mind, what is a good guess for X – remaining experiments until AGI-GO-BOOM?

One simple prior (I internally name the Doomsday prior, but aka as the Lindy Prior), is to assume the # remaining experiments is similar to the # of similar-experiments conducted to date. We could roughly estimate #experiments by equating experiments to DL papers!

Eyeballing the graph above suggests an X estimate < 100k or less. (For point of comparison, naively applying the doomsday prior directly to years of research (assuming DL is a path ending with AGI) suggests AGI in a decade or two). If we instead use total number of papers (in any field) ever uploaded to arxiv as our prior for X, that only gets us up to about 1M.

Plugging 100k for X into HB*HL*X gives an estimate of ‘only’ about 28 billion GPU hours, or about $28 billion using (currently high) prices of $1/hr for an RTX 3090 on (recall 1 brain ~ 1 GPU ). For point of comparison, this is similar to the total amount spent on useless ethereum hashes in just one year. And naturally the larger estimate of $280 billion still isn’t so humongous in the grand schemes of nation-states and trillion dollar megacorps. These costs estimates are also conservative in assuming no further improvements from Moore’s Law.

At the end of the day though, what will matter most is net efficiency of spend. There is enormous room for further software improvements, such that these estimates are probably wildly over conservative. Besides further obvious significant improvements to low level matrix operations on GPUs (leveraging sparsity), we can vastly accelerate progress through advancements in meta-learning, and by doing most of the architecture search over small fast models/worlds. That latter sentence is mostly just a re-statement of the current reality, but the intertwined research tracks of sparsity and meta-learning will truly revolutionize research this decade – sharply curving all trendlines. Much more on that later.

1 As a first hint, the energy currency of biology is ATP, and 1 ATP ~ 0.3 eV, within an OOM of room temp Landauer Limit of ~0.03 eV. Cells only use a few ATP to copy each base pair during DNA replication. As one electron volt is an electron with potential energy of 1 volt, the 0.03 eV Landauer Limit is a thermal noise barrier of 30 mV for electric computational systems like neural circuits. The typical neuron membrane resting potential is -70 mv, or only about 2X the limit, and it only cycles by another factor of 2x or so during action potentials. Digital computers operate at 10x or higher voltages, for higher speed and reliability.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s