We live in curious times. Deep Learning is “eating software”, or at the very least becoming the latest trend, the buzziest word. Its popularity now permeates even into popular culture.
At the same time, GPUs (which power Deep Learning) are in scarce supply due to the fourth great cryptocurrency price bubble/boom (and more specifically due to ether).
All the worlds compute, wasted?
How much computation is all this crypto-mining using? Most all of it. As of today the ethereum network is producing around 250 terrahashes/second. A single GTX 1070 produces around 27 megahashes/second, so currently right this moment there are the equivalent of about 10 million GTX 1070 GPUs dedicated to just ether mining. This is a 5 terraflop card, so we are talking about around 50 exaflops. If you include the other cryptocurrencies and roundup, the world is currently utilizing burning roughly a hundred exaflops of general purpose compute for random hashing. For perspective, the most powerful current supercomputer rates at 1000x less, around 100 petaflops. So the ether network uses a great deal of compute. As we will see this in fact is also most of the world’s compute. The price/cost of this compute is about 3 cents per hour per GTX 1070, or about 600 petaflops/$, or about 8 gigaflops/s/$ amortized over two years.
Buying this much compute up front would cost about ~$10 billion, so we can be reasonable confident that Google or other private companies don’t have compute on this scale. Firstly, most of the corporate compute capacity is still CPU-based, which has one to two orders of magnitude less flop/$ efficiency and is thus irrelevant. Secondly, we can more directly estimate GPU corporate compute by just looking at Nvidia’s earnings reports. It is the main provider, it rather forcefully separates its consumer and corporate product lines, overcharges for the corporate compute by 5x to 10x, and the consumer division still provides most of the revenue.
Google has TPU’s, but they are more specialized for accelerating only dense tensor ops, whereas GPUs are now general accelerators for more arbitrary parallel C++ code. Each 4 chip TPUv2 accelerator board provides about 180 tflops/s for dense matrix mult for about $7/hour, and thus about 92 petaflops/$, or about 1.3 gigaflops/s/$ amortized over two years. So google would need to be overcharging by an order of magnitude more than Nvidia is overcharging (very unlikely for various reasons) for the TPUv2 ASIC to be more price effective than a general purpose GPU. It’s highly unlikely that google has produced anything close to the $10-100 billion worth of TPUv2 hardware required to rival the ether GPU network. The same analysis applies for microsoft’s limited use of FPGAs, and so on. Consumer GPUs utterly dominate ops/$, ASICs aren’t yet close to changing that, and Nvidia is already starting to put a TPU-like ASIC on its GPUs anyway with Volta.
Progress in AI ~= Compute spent on AI
“Deep learning requires lots of compute”, although surface level accurate as a statement, doesn’t quite grasp the actual reality. Many tech business uses of compute have a ‘good enough level’. There is a certain amount of compute you need to control an elevator, decode a reasonable video stream, parse a web-page, and so on. Having dramatically more compute than that required is a resource in need of a use. Many non-technical folks seem to think of compute in these terms. A perhaps deeper statement would be that Deep Learning is compute. It is computation directed towards a goal, evolving a system towards some target. Learning is a computational process. So there is no ‘good enough’ here, the amount desired is always more. We can debate what intelligence is a bit, but general intelligence – you know what I mean human – requires continual learning. In fact, that is perhaps its most defining characteristic. So intelligence also is (a form of) compute.
If learning is compute, then progress in AI should track growth in compute, which is in fact the case (we could debate what our measure of ‘progress’ is, but let’s not). The typical ‘start date’ for the deep learning revolution is the first deep CNNs (Alexnet) trained on the large Imagenet database, which also naturally was when someone actually bothered writing and debugging all the code needed to train neural networks in cuda to run on a GPU for the first time. It was a sudden leap only because there was an effective gap in compute applied to AI/ML due to the end of dennard (clockspeed) scaling and the overhead/delay in moving towards highly parallel programming (aka GPU programming). Once everyone migrated to GPUs the gap closed and progress continued.
But why has nobody done this yet
Moving to GPUs provided roughly a 10x to 100x one-time boost in compute/$ for AI/ML progress. There is room for perhaps another 10x or more efficiency jump from algorithm/software level improvements at the tensor math level, but that is another story. There is a ~5x ish gap between the price of GPU compute on the ether mining network and the low end price of AWS/google/etc cloud compute. So in theory that gap is large enough for a decentralized AWS like service to significantly outcompete the corporate cloud, take us another large step on the path to AI, and also make a bunch of money.
However, a decentralized compute cloud has some obvious disadvantages:
- Privacy becomes . . . hard
- Requires low-overhead decentralized trust/reliability
- Overall less network bandwidth and higher latency per unit compute
- Potentially more headache to interface
First, let us just ignore privacy for the moment. There is of course interesting research on that issue, but much of the training use case workload probably doesn’t really require any strong privacy guarantees. Justifying that statement would entail too many words at the moment, so I’ll leave it for later. I’ll also assume that the interface issue isn’t really a big deal – like it should be possible to create something on par with the AWS/GCE front end interface for renting a real or virtual machine without much end user hassle but decentralized under the hood.
The reduced network connectivity between node is fundamental, but it also isn’t a showstopper. Most of the main techniques for parallelizing DL today are large batch data parallel methods which require exchanging/reducing the model parameters at frequency near once per update. However recent research can achieve up to 500x compression for the distributed communication at the same accuracy, which corresponds to a compute/bandwidth ratio of ~ 22 million flops/ network byte, or about 250 kilobytes/second per 5 terraflop GTX 1070 GPU. So with SOTA gradient compression, it does look feasible to use standard large batch data parallel training methods across consumer internets. Even a 100 GPU farm would ‘only’ require gigabit fiber, which is now becoming a thing. Of course there are other techniques that can reduce bandwidth even farther. Large-batch data parallelism probably makes the most sense over local ethernet for a GPU farm, but the diminishing returns with larger batch size means at some point you want to evaluate different model variants using something like population level training, or other ‘evolutionary stuff’. In general given some huge amount of parallel compute (huge relative to the compute required to run one model instance) , you want to both explore numerous model parameter variations in parallel and train the most promising candidates in parallel. The former form of parallelization uses only small/tiny amounts of bandwidth, and the latter still uses reasonable amounts of bandwidth. So consumer internet bandwidth limitations are probably not a fundamental issue (ignoring for the moment the engineering challenges).
So that leaves us with the issue of trust or reliability. If you rent a TPU/GPU on GCE/AWS, you can reasonably trust that 1.) it will compute what you want it to, and 2.) it won’t leak your precious code/model/data whatever to unscrupulous rivals/ISIS etc. Again at the moment lets leave out issue 2.) and focus on reliable computation.
The ideal crypto approach to solving reliability involves trustless automated (dumb) mechanisms: ideally an algorithm running on the blockchain something something with some nice proofs that workers can’t find exploits to get paid for fake/forged work or at least that any such exploit is unprofitable. Truebit is probably a good example of SOTA for this approach. Truebit’s full solution is rather complex, but basically task providers submit tasks on the blockchain, workers claim solutions, and verifiers and judges arbitrate disputes using log(N) binary search between merkle trees of workers and verifiers. The scheme induces extra compute overhead for recomputing tasks redundantly across workers and verifiers and also considerable network/blockchain bookkeeping overhead. The authors estimate that the overhead for all this is on the order of 500% to 5000% (section 4.2). Unfortunately this is far too much. The decentralized cloud has hope for a a 5x compute/cost efficiency advantage in the zero overhead case, and it has some inherent disadvantages (cough privacy). So really the overhead needs to be less than 200% to be competitive, and ideally much less than that. If anyone is actually planning a business case around a truebit based solution actually competing with the corporate cloud in general, they are in for a rude awakening. End users of bulk compute will not use your system just because it is crypto/decentralized, that is not actually an advantage.
In the following technical part of this post I’ll describe some partially baked ideas for improving over Truebit and moving closer to the grail of epsilon overhead even in the ideal ‘trustless’ crypto model. However, before getting in to that, I should point out that the common sense solution to this problem may actually be the best, business wise. By common sense here I mean “whatever actually works”, rather than what works subject to some imagined constraints such as in the trustless crypto model. As far as I know, Amazon does not offer a reliability proof for the compute it sells on AWS, but it still works just fine as a business. So the simplest low overhead (ie actually usable) solutions to trust for decentralized computing probably look more like AirBnB than proof-of-work. Instead of trying to get machines to automate something hard that humans can do reasonably easily . . . just have humans do it.
To unpack that idea, consider the conditions leading to clients trusting AWS. As far as I know, AWS has not yet attempted computational fraud on any large scale. Yet it would be easy for them to do so .. they could sell 10% of a machine’s compute resources and claim the user is getting 50%, for example. What prevents this? Competition and reputation. Competition is easy to achieve in a crypto-system, hard to avoid actually. Reputation, on the other hand, is a bit trickier. One of the key ideas in crypto land is that everyone is anonymous and nodes can join/depart at will – which obviously prevents any significant build up of or reliance on reputation. But there is no fundemental reason for this in the context of a cloud computing grid. Do the (human) compute providers really need to be anonymous? Probably not. So a reputation solution is probably the most net efficient here. AirBnB or Ebay work just fine running reputation and arbitration computations mostly on human minds. Of course, reputation is also something perhaps that increasingly can be automated to various degrees (ie, uport). Golem is also apparently relying on a partially automated reputation system.
So after acknowledging that fully general automatic trustless secure outsourced computation may be impossibly hard; and perhaps unnecessary given workable ‘hacky’ alternatives, let’s give it a shot anyway.
The Verification Problem
A core problem in outsourced computation is verification: automatically determining whether the claimed output of a function is actually correct. If we involve multiple parties (which seems necessary) this just becomes another form of generalized consensus. The setup: we have a number of agents which can communicate facts in some formal logic like language; agents have limited knowledge; agents can lie (make statements for which they can locally prove to be false) , and we seek efficient algorithms agents can use to determine the truthhood of a statement, or at least estimate the probability thereof, subject to some convergence constraint. Naturally this is a big complex problem. A key component of any solution usually (necessarily?) involves a verification game played by two or more agents. The game is a contest over the veracity of a statement, the outcome of the game provides key evidence agents can use to update their beliefs (assuming the agents are bayesian/probabilistic, in the weaker model where agents are dumb non-probabilistic reasoners, updates are of course binary).
Consider the specific example where Alice hires Bob to render an image using some expensive ray tracing function. We can describe this as:
Alice -> { if (Bob->{ this, #Y } : Y = F(X)) then Alice -> { send(coin3423, Bob) } }
In other words, Alice says that if Bob says (provides) Y, such that Y = F(X), then Alice sends a coin to Bob. This is a contract that Alice signs and sends to Bob who can then compute Y and sign the hash of Y appended to the contract (this) to make a payment claim. Bob can then provide this contract claim to any third party to prove that Bob now owns coin3423. You get the idea.
Suppose in this example that X is an input image (identified by a hash of course), F describes the fancy ray tracing function, and Y is the output image. To redeem the coin Bob ‘earned’, Bob needs to send the signed contract thing to some other party, Charlie. This new party then faces a verification burden: Bob is the legitimate owner of the coin iff #Y is a valid hash of the output of F(X), which requires that Charlie recompute F(X). Assume we already have some mechanism to prove ownership of the coin previous to this new statement, then the verification burden is still quite high: F(X) needs to be recomputed on order O(T) times, where T is the subsequent number of transactions, or alternatively F(X) needs to be recomputed O(N) times, where N is the number of peers in a consensus round. This is basically how verification works in systems like ethereum and why they don’t really scale.
We can do much better in some cases by breaking the function apart. For example, in the ray tracing example, F(X) is highly parallel: it can be decomposed like so F(X) = Y[i] = F(X[i]), where i is an index over the image location. The global function decomposes simply into a single large parallel loop over independent functions: it is trivially parallelizable.
In this scenario a probabilistic verification game can converge to e accuracy in just one iteration using order -log(e) subcomputations/subqueries, which is a huge improvement: or more specifically the burden is only ~ -log(e) / N, where N is the parallel width and e is an error (false positive) probability. For example, instead of recomputing all of F(X), Alice or Charlie can just pick a handful or i values randomly, and only recompute F(X[i]) at those indices. If Bob faked the entire computation, then verifying any single index/pixel sub-computation will suffice. If Bob faked 60%, then testing k locations will result in a (1.0 – 0.6)^k probability of false positive, and so on. So the parallel parts of a computation graph permit efficient/scalable probabilistic verification games. But this doesn’t work for serial computation. What we need is some way to ‘unchain’ the verification, hence the name of this idea.
Unchained Deterministic Trustless Verification
Consider a serial function of the form Y[i+1] = F(Y[i]). Assume we have access to a fast probabilistic micropayment scheme. Instead of one contract conditional on the overall function, Alice decomposes that into a larger number of probabilistic micropyaments conditional on each sub-computation in the graph, at some granularity. The probabilistic micropayment can just be a payment conditional on a timelocked random process (something nobody could know now, but becomes easy to know in the future, such as the hash of some future ether block). Now the wierd part: instead of contracting on the ‘correct’ inputs to every subcomputation (which Alice doesn’t know without computing the function in the first place), Alice allows Bob to pick them. For every subcomputation, Bob claims what the inputs and output were at that step, wraps that up in a hash commitment, and exchanges that for a tiny p-payment with Alice. At the time of the exchange neither Alice nor Bob knows which of the p-payments (lottery tickets) will actually cash out in the future. Bob keeps the p-payments and checks them in the future. Some tiny fraction of them are winners, and Bob then ‘claims’ just those coins by using them in a transaction (sending them to Charlie say), which then requires verification – but crucially – it only requires verification of that one sub-computation, independent of the whole graph. Thus the chain has been broken. The cost of verification can be made rather arbitrarily small up to the limits of the p-payment scheme, because now only a fraction of the computations matter and require subsequent verification.
Unfortunately this doesn’t quite work as stated, at least not for all functions. Instead of computing the function on the actual inputs, Bob could instead substitute some other inputs. For example, instead of computing and signing Y[i+1] = F(Y[i]), Bob could compute and claim Y[i+1] = F(trash). This is valid because the unchaing necessarily broke the dependencies which constrain the inputs – Bob can pick the inputs. In a typical scenario the sub-computations will repeat heavily, so Bob could get a big savings by computing each newly encountered sub-function once, memorizing the inputs, and then recycling those over and over. Let us call this the memoization attack.
Fortunately there is a modification which seems to make it all work again, at least for certain functions. In machine learning all operations which cost anything are tensor operations on large arrays of real values. In this setting, Alice can submit a special small seed input for a noise function which Bob is required to add to each input. So each contract job now is something like Y[i+1] = F(Y[i] + noise(a[i])), where noise is some known good fast deterministic noise generator, and a[i] is a seed alice picks. Note that the seeds can be chained and even the whole decomposition into subtransactions can be implicitly compressed to save any bandwidth cost.
The noise should be small enough such that it doesn’t hurt the overall learning process (which is easy enough, some amount of noise injection is actually useful/required in DL), but still just large enough to always effect the output hash. The noise inputs from Alice prevents Bob from using memoization and related attacks. In fact, if the function F is ‘hard’ in the sense that the fastest algorithm to compute F for input X + noise takes the same running time for all X, then Bob has nothing to gain by cheating. On the other hand, if there are algorithms for F which are must faster for particular inputs then the situation is more complex. For example, consider matrix multiplication. If we constrain the outputs to fairly high precision (say at least 32 bit FP say), then it seems likely that only a small noise pertubation is required to force Bob to use dense matrix multiplication (which has the same run time for all inputs). However as we lower the required output precision, then the noise variance must increase for dense matrix multiplication to still be the fastest option. For example if the hash only requires say 8 bits of output precision and the noise is very small, Bob may be able to compute the output faster by quantizing the input matrices and using a sparse matrix multiply which may take much less time on zero + noise than X + noise (X being the correct input). The noise must be large enough such that it changes some reasonable number of output bits for even the worst fake inputs Bob could pick (such as all zeroes).
Now in practice today dense matrix multiplication with 16 or 32 bit floating point is dominant in machine learning; but dense matrix mult is solved and the harder sparse case continues to improve, so naturally a scheme that only works for dumb/hashlike subfunctions is not ideal. However, for an early half-baked idea I suspect this line of attack has some merit and perhaps could be improved/extended to better handle compression/sparsity etc.
All payments are bets
Another interesting direction is to replace the basic “p-payment conditioned on correct output” style contract with a bidirectional bet on the output. Alice sends Bob requests which could be in the form of “here is a function I would be on: F(X)”, and Bob sends back a more concrete “I bet 100 coins to your 1 that F(X)=Y”. Having Bob put up funds to risk (a kind of deposit) strengthens everything. Furthemore, we could just extend bets all the way down as the core verification/consensus system itself.
Good Bob computes Y=F(X), Bad Bob computes Y != F(X) but claims Y=F(X). Bob then tries to claim a resulting coin – spend it with Charlie. Charlie needs to evaluate the claim Y = F(X) … or does she? Suppose instead that Charlie is a probabilistic reasoner. Charlie could then just assign a probability to the statement, compute expected values, and use value of information to guide any verification games. You don’t avoid verification queries, but probabilistic reasoning can probably reduce them dramatically. Instead of recomputing the function F(X), Charlie could instead use some simpler faster predictive model to estimate p(Y = F(X)). In the case where the p value here is certainly large enough, then further verification is financially/computationally unjustified. Charlie can just discount the expected value of the coin slightly to account for the probability of fraud. Charlie can pass the coin on with or without further verification, depending on how the expected values work out. Better yet, instead of doing any of the actual expensive verification, Charlie could solicit bets from knowledgeable peers. Verification is then mostly solved by a market over the space of computable facts of interest in the system. Investigation – occasionally verifying stuff in detail by redoing computations – becomes a rare occurrence outsourced to specialists. I suspect that the full general solution to consensus ends up requiring something like this.
This is more like how the real world works. Purchasing a property does not entail verifying the entire chain of ownership. Verification always bottoms at some point where the cost is no longer justified. Naturally this leads to a situation where fraud is sometimes profitable or at least goes undetected, but it is far from obvious that this is actually undesirable. One possible concern is Black Swans: some large fraud that goes undetected for a long time only to be revealed later causing a complex unwinding of contracts and payments. A natural solution to this is to bottom out complex contractual chains by swaps or conditional overwrites. For example, any of the conditional payment mechanisms discussed earlier should have a third option wherein if both parties agree/sign to a particular outcome than that short-circuits the bet condition and thus prevents graph complexity explosion. To this we can append a dispute resolution chain using a hierarchy of increasingly slower/larger/more trustworthy entities, using predesignated arbiters and so on, such that all ownership chains simplify over time. This also happens to be a solution feature we see in the real world (arbitration/judicial review hierarchies).