I’ve always had a strong interest in the brain, and lately I’ve been reading as much as I can to catch up in the fields of AI and computational neuroscience in particular. The end result of my most recent reading is the accumulation of a perspective somewhat different than that which I started with. Consider this then the high level introduction to the brain that I wish I would have had years ago.
Before one Begins
Before delving into current data and any particular theories, its probably best to understand the general shape of the approaches to understanding intelligence. At a very high level of abstraction, the approaches can roughly be categorized into what I would call the functionalist view and the emergent view. These are more strategies for understanding rather than particular classes of theories, although we can then roughly divide the ontology of brain knowledge into computational and biological subcategories that map to the functionalist vs emergent views. There is of course overlap and a huge amount of cross-fertilization, but fundamentally a computer scientist and a neuroscientist understand or ‘see’ the brain in different ways. That doesn’t mean that their theories and knowledge can’t converge, its more of an observation about the fundamental differences in the entire methodology and thinking apparatus one uses to analyze the data and form theories. Coming from a computer science background, I naturally aligned more with the functionalist/computationalist camp. After reading and learning a great deal more about the brain, I now have a much stronger appreciation for the biological/emergent approach, and both schools of thought are necessary and mutually supportive. Computer science is important for understanding intelligence in the abstract and the brain in particular, and neuroscience is important for AI.
Functionalist-Computational School: This is the dominant, classical view in the field of AI, exemplified in textbooks such as “AI: a modern approach”. From an economic or utilitarian perspective, the functionalist approach is well grounded: it is focused on finding practical algorithms and techniques for intelligence which can solve real-world business problems on today’s computers. From this perspective the brain is only useful to the extent that it provides inspiration for economically viable AI systems. A persistent trend in the computational school is to view the brain as fundamentally too messy and chaotic, and place a low value on reverse engineering it. This school of thought has continued (and continues) to grossly underestimate the difficulty of creating true human-equivalent AI. In the old days this school of thought quantitatively underestimated the brain’s computational capacity, but today it is more likely to grossly overestimate it. More recently there appears to be a growing recognition that the problem is more ‘software’ than ‘hardware’, that we probably already have the computational capacity if we only had the right algorithms, and a gradual shift towards the biological school. Much of whats wrong with this school of thought can be gleamed from one of its persistent analogies: the analogy of flight.
Emergent-Biological School: This school of thought understands the brain as a complex adaptive system, and intelligence and learning in particular as an emergent phenomenon. The brain is understood not only by analyzing the computations it performs (functionalist) , but also through understanding the lower-level biological processes, the overall interaction within the environment (physical, social, mental, etc) and the complete evolutionary history. In other words, to really understand human intelligence, you may have to understand everything. There is something deeply revolting about this statement on the one level, but the more I’ve come to learn about intelligence the more I believe it to be largely true.
However, accepting the emergent viewpoint by no means forces one to drop the functionalist approaches, as it turns out the two are quite synergistic. For example, on the purely theoretical side the AIXI agent model appears to be a good framework for formalizing the notion of intelligence, and whats particularly interesting about that formulation is that it takes a systemic and we could say almost biological approach: defining a learning agent in terms of an environment, the agent’s interactions with and within the environment, and learning as some meta-algorithm which allows the agent to simulate the environment (in AIXI’s case by literally exploring the space of environment simulating programs). AIXI is well loved because it takes numerous philosophical concepts or memes that were already well established in the cybernetics/systems view of intelligence and formalizes them:
- thought is a form of highly efficient simulation
- which when ran over learned knowledge acquired from sensors thus allows environment prediction
- and through this allows effective search through the landscape of futures
- and thus guides goal-fulfilling actions
The dawning realization from the biological school is that real learning (the murkiest and most mysterious of the above concepts) is an emergent phenomenon of the actual patterns within the data environment itself. In a nutshell, the biological approach says that learning, and neural organization in particular, can emerge spontaneously just from the interaction of relatively simple localized computational elements and the information streaming in from the environment.
Self-organization is the key takeway principle from real biology, but its impact on AI to date has been rather minimal. I think this will have to change for us to reverse engineer the brain. Thinking about the brain in terms of algorithms is not even the right approach. One needs to think about how the brain’s cortical maps automatically self-organize into efficient algorithm implementations just through the process of being exposed to data. That is what learning is. Real learning is always unsupervised and self-organizing.
A Good Start: the Visual Cortex
The primate visual cortex is a good starting point for understanding the brain. This section is mainly a summary of Poggio et al of MIT’s work on the feedforward visual stream. If you are really familiar with this already, you may want to skip down to “Emergent Theories of Learning”.
The cortex is largely self-similar, so if we can understood how one region works, that same model can then be applied to understanding the rest. The visual cortex is a good place to start mainly because its the primary entry point for data coming into the system, so it allows the chain of information processing to be more easily mapped out and understood. As a result we have a great deal of accumulated data, which has led up to some larger-scale algorithmic models that seem to be a good fit for how the visual cortex processes information: the models can predict phenomenon from the neural level up to even the pyschological level (with the algorithm models performing similar to humans or monkeys in well-controlled psychological visual tests).
MIT’s aptly named “Center for Biological and Computational Learning” has developed and tested this model, a good overview is “A quantitative theory of immediate visual recognition”. Whats particularly interesting is that in these cases where we have a very accurate model (such as the quick feedforward ventral path), the model performs best in class compared to other known AI approaches. In fact, according to the MIT model and data, their biologically inspired vision system is the benchmark for quick recognition. And this however is just a piece of the visual system; once you add in the rest of the components, such as attentive focus, saccades, retinal magnification, motion, texture and color processing, the dorsal stream, etc. etc you get a full system which is leaps and bounds beyond any current machine vision system.
Now this is all interesting, but whats far more interesting is that it appears that less than none of this complex system appears to be specifically genetically coded – the cortical neurons somehow just self-organize automatically into configurations that perform the desired computation at each step. So its not just a clever algorithmic solution, its the one clever trick to rule them all: a meta-algorithm which somehow magically produces clever algorithmic solutions.
From a biological perspective, this is actually to be expected, as biological programs are all about maximizing functional output while minimizing explicit information. Our DNA codes for somewhere around only 10,000 to 100,000 proteins, and not much of that is brain specific. The DNA codes first (both in terms of developmental history and evolutionary history – as ontogeny recapitulates phylogeny) for proteins that can self-organize into cells, then the minimal changes to get those cells to self-organize into organs, and then the minimal changes on top of all that to get those organs to self-organize into organisms, and so on.
Now, the really brief summary of the feedforward ventral path: This pathway is like a series of image filters that transform a raw 2D image into an abstracted statistical ‘image’. The final output can be thought of as an ‘image’ of sorts where the activation of small regions (the pixels) corresponds to or represents the presence of actual objects in the scene. Its not exactly a 1 neuron = 1 pixel = 1 object map, but its effectively similar and can be imagined as a map where each pixel (or more accurately, small local statistical patterns of activation) correspond to identification of particular objects in the scene.
For example, in the final output layer, an individual neuron (pixel) may turn on only when there is a car in the image coming in to the retina. This pathway is not concerned with the location of objects, quantity, etc, its only concerned with rapid identification – answering the question – what am I seeing? This information is of obvious importance to organisms. So how does it work? Surprisingly, it doesn’t appear to be all that complex:
Retina/LGN: High Pass / Low Pass Filterbanks: The 1st stages of processing occur in the retina itself. Each neuron has dendrites which connect to something like a small circular window of the input space. The synapses at each connection have some variable multiplicative effect on signal transmission, and then the dendritic branches and cell body sum these responses. This leads to the familiar simple integrate-and-fire neuron model where the neuron performs essentially some matrix multiplication of its input data I and its set of synaptic weights W. This can just as easily be thought of as a customizable filter bank.
In the retina, the synapses arrange to perform simple low or high pass filters. The typical pattern is positive weights in a circular region in the middle surrounded by a larger region of negative weights. This looks like a large black circle with a smaller white circle embedded in it. The other typical pattern is the just the reverse. These patterns come in various sizes, from tight small white circles to larger diffuse ones. What does this do to the image? These are basically high to-low pass filters which essentially break the image up into a set of multi-resolution bands very similar to the 1st stages of multiresolution image compression ie wavelet analysis. This is not entirely surprising, as the optic nerve has a much lower bandwidth than the retina’s input – image compression makes sense. The output would look very much like taking an image and band pass filtering it in photoshop. The output you get is largely the edges at different scales – a sparse encoding of the input and a simple yet effective form of compression.
V1: The V1 region is the largest single cortical region in primate brains, and it performs another simple image filtering step. The input image coming in to V1 is more or less the edges at various scales, so quite naturally V1 identifies edges. The cortex has a laminar (sheet-like) structure at the large scale. If you zoom in closer you’ll see that it has a layered organization, sort of like a layer cake, with five to six layers depending on how you count them (they are not all that clearly delineated – remember, the brain is stochastic ). Neurons in a particular localized region seem to redundantly code the same thing – this small level of scale is called the micro-column. Individual output neurons in a micro-column have nearly identical receptive fields and appear to code equivalent responses (things they respond to, incoming and outgoing connections, etc). It appears you can thus functionally reduce down to the micro-column level as the fundamental unit of computation in the cortex. Micro-columns are loosely arranged then into macro-columns. Neighboring micro-columns in the larger macro-column have very similar receptive fields but can have quite different responses. These micro-column ‘patches’ in the V1 have synaptic weights that correspond to oriented edge filters of several different scales and orientations. The orientations and scales are rather quantized – with something like 4-6 orientations and a similar or less number of scales. The output of V1 then is best visualized as a set of NxM smaller subimages. A lit pixel in a subimage (coded as an active micro-column) represents the presence (or likelihood) of a line of a particular direction and size in some small neighborhood of the original image. Each V1 reigon (one on each hemisphere) has perhaps a milion sub-columns, so its quite reasonably sized.
V2: The input from V1 goes to V2, which performs another simple filtering step. It performs something very similar to just taking a set of max filter across the output of V1, effectively a max filter on each of the NxM subimages. Each micro-column in V2 has an orientation and scale preference just like V1, and activates when any edge of its preferred orientation and scale comes in. The response doesn’t change much when there are multiple matching edges in its filter window. Its not exactly a max operation, but its close – Poggio et al model it as a softmax operation. The output of V2 then is a smaller condensed set of NxM subimages where each pixel represents the presence of an edge of a particular orientation and scale in a wide sub-window of the image.
V4/PIT/AIT: At the next and higher stages in V4 and up, the neural responses become somewhat more specific and begin coding for common patterns of edges: basic shapes. According to the theory of Poggio et al, the cortical units can be roughly classified into two types: simple and complex. The simple cells perform the typical synaptic-weighted summation and adjust their synaptic weights over time to match frequently occurring input patterns. The complex cells perform the max-like operation on a local spatial window of similarly tuned simple cell inputs as described for V2 earlier. The simple and complex cell types alternate in layers. After two or three such iterations you will have units which code for particular common patterns of edges appearing anywhere in the image. The layered hierarchy is not strict, and some connections bypass layers. By the time you get to the top of this hierarchy there is enough information for cells in higher decision regions (such as the prefrontal cortex), to make reasonable quick identifications of objects. There is enough information for cells to code for location-dependent arrangements of edges, but this is balanced by the need for invariance to rotations. For example, its easy to identify a car shape from numerous angles at a glance, but its much more difficult for us to recognize text characters or faces that are flipped 180 degrees – simply because we rarely encounter those patterns at such unusual orientations.
Emergent Theories of Learning
How can this system of edge-filters and shape pattern dictionaries develop automatically?
It appears that it self-organizes based on some simple local rules, very much like a cellular automata. This was recognized more than a decade ago. The short paper that really put together for me is called “A SELF-ORGANIZING NEURAL NETWORK MODEL OF THE PRIMARY VISUAL CORTEX“. The key idea is rather simple. Take a prototypical 2D laminar neural network like the simple cortical model discussed above. A 2D input pattern flows into the neural array from the bottom, and each neuron forms a bunch of connections across the input grid forming something like a circular pattern centered around the neuron (with synaptic weights falling of with a Gaussian like pattern) .
Mathematically, the neuron performs something like a matrix-multiplication of a local patch of the input with its synaptic weights. If you apply an appropriate simple hebbian learning rule to a random initial configuration of this system (synaptic weights increase in proportion to a presynaptic-postsynaptic coincidence), then these neurons will evolve to represent frequently occurring input patterns.
But now it gets more interesting: if you add an additional set of positive and negative lateral connections between neurons within a layer, then you can get more complex cellular automata-like behavior. More specifically, if the random lateral connections are picked from a distribution such that short-range connections are more positive and long-range connections are more likely to be negative, the neurons will tend to evolve into small column-like pockets where neurons are mutually supportive within columns but are antagonistic between columns. This representation also performs a nice segmentation of the hypothesis space. The model developed in the paper – the RF-LISSOM model – and later follow-ups provides a very convincing account of how V1’s features can be fully explained by the evolution of basic neurons with simple local hebbian learning rules and a couple of homeostatic self-regulating principles.
Can such a simple emergent model explain the rest of the ventral visual pathway?
It seems likely. If you took the output of V1 and fed it to another layer built of the same adapting neurons, you’d probably get something like V2. It wouldn’t be the exact softmax operation described by Poggio et al, but that is something of an idealization anyway. The V2 layer would organize into micro-columns which would tune to frequent output patterns of V1. The presence of an edge of a particular orientation is a good predictor of an edge of the same orientation activating somewhere nearby – both because the edge may be long and because as the image moves across the visual stream edges will move to nearby neuron populations. It thus seems likely that V2 neurons would self-organize into microcolumns tuned to edges of a particular orientation anywhere in their field – similar to the softmax operation description. As you go higher up the hierarchy, the tuning would be more complex, and you would have micro-columns adapting to represent more complex common edge collections.
The self-organizing model discussed so far is missing one important type of connection pattern found in the real cortex, which is feedback connections which flow from higher regions back down towards the lower regions close to the input. These feedback connections tend to follow the feedforward connections bringing processed visual input up the hierarchy, but they flow in the opposite direction. These feedback connections seem pretty natural if we think of a pathway such as the visual system as a connected 3d region instead of a collection of 2d patches. If you took the various 2D patches of V1,V2, etc and stacked them on top of each other, you’d get some sort of tapered blob shape – kind of like a truncated pyramid. It would be wide at the base (v1 – the largest region) and would then taper as the layers are smaller as you go up the hierarchy. If you arranged the visual stream into such a 3D volume, the connections could just be described by some simple 3D distribution. Visual input comes in from the bottom and flows up the hierarchy, but information can also flow laterally within a layer and back down from higher to lower layers.
What is the role of the downward flowing feedback connections?
They help reinforce stable hypothesizes in the system. An initial flow of information up the hierarchy may lead to numerous competing theories about the scene. Feedback connections tracing the same paths as the inputs will tend to bias for the supportive components. For example, if the higher regions are expecting to see a building, this would then flow down the feedback connections to bias neurons representing appropriate collections of right angles, corners, horizontal and vertical edges, and numerous other unnameable statistical observations that lead to the building conclusion. If these supporting beliefs are strong enough vs their competition, the ‘building’ pathway will form a stable self-reinforcing loop. This is essentially very similar to Bayesian Belief Propagation – of course without necessarily simulating it exactly (which could be burdensome).
Its also interesting to note that the feedback connections will perform something similar to backpropagation. When a neuron fires, the hebbian learning rule will up-regulate any recently active synapses that contributed. With the feedback connections, this neuron will send back a signal down to the lower layer input neurons. As the system evolves into mutually supportive pathways, the feedback signal is likely to closely associate with the input neurons that activated the higher level synapses. The feedback signal will thus trace back the input and reinforce the contributing connections.
From cortical maps to a full intelligence engine
Reading this far, and if you’ve read my other short bits about the brain or much better yet the literature they derive from, you have a pretty good idea of how self-organizing hierarchical cortical maps work in theory and understand their great power. But there’s still a long way to go from there to a full scale intelligence engine such as a brain. In theory, one of these hierarchical inference networks can also, operating in reverse flow, translate high level abstract commands into detailed motor control sequences, very much like the hierarchical sensor input stream but in reverse. Hawkins gives some believable accounts of how such mechanisms could work.
Whats missing then? A good deal. There is much more to the brain than just a hierarchical probabilistic knowledge engine – although that certainly is a core component. One familiar with computer architecture would next ask, “what performs data routing?”. This is a crucial question, because its pretty clear you can’t do much useful computation with a fixed topology – to run any interesting algorithms you need some way for different brain regions to communicate to other brain regions dynamically. A fixed topology is less than sufficient.
That functionality appears to be provided by the thalamus, one of the oldest brain regions still part of the core networks. Its also perhaps the most important. Damage to the thalamus generally results in death or coma, which is to be expected if it is a major routing hub (vaguely equivalent to a CPU). For example, when you focus your attention on a speaker’s words, the first stages of processing probably flow through a fixed topology of layered computation, but once those are translated into the level of abstract thoughts, they need to be routed more widely to many general cortical layers that deal with abstract thinking – and this can not use a fixed topology.
At this apex level of the hierarchy, it doesn’t much matter whether the words originated as audio signals, visual patterns, or even from internal monologue, they need to eventually reach the same abstract processing regions for semantic parsing, memory recall and the general mechanisms of cognition. This requires at least some basic one to many and many to one dynamic routing. Selective attention requires similar routing.
The visual system performs selective attention and dynamic routing mechanically by actually moving the eye and thus the fovea, but consider that you need that same mechanism in many domains where the mechanical trick doesn’t apply. For instance, your body’s proprioception (sense of touch) sensor network also uses selective attention (focusing a large set of general processing resources on a narrow input domain) and this suggests a neural mechanism of dynamic routing.
Internal Monologue and the Core Routing Network
Venturing out of the realm of current literature and into my own theoretical space, I have the beginnings of a meta-theory concerning the brain’s general higher level organization which centers around a serial core routing network. We tend to think of the brain as massively parallel, which is true at the level of the cortical hierarchy described earlier. But the fact is that at the highest level of organization, at the apex of the cortical pyramid you have a network involving largely the hippocampus, cortex, and the thalamus which is functionally serial. We have a serial stream of consciousness which makes some sense for coordinating actions, language through a serial audible stream, and so on. Our inner monologue is essentially serial at the conscious level.
Note that having a serial top level network is not in any sense preordained. We could have evolved vocal cords which encoded two or more independent audio streams and had a community of voices echoing in our heads. Indeed, the range of human mind space already encompasses such variants on the fringe.
In my current simple model, the (typically) serial inner core routing network would mostly function as a simple broadcast network which connects the highest layers of the cortex, hippocampus, and thalamus. This core network maps to both the task-positive and task-negative networks in the neuroscience literature.
What types of messages are broadcast on the core routing network? Thoughts, naturally.
The neuro-typical experience of a serial inner monologue is the reverberations of symbolic thoughts activating the speech and auditory pathways. For most of us, we first learn to understand and then speak words through the audio interface, and then learn to read well after. As you are reading these words, you are probably hearing a voice in your head. Your projection of my voice to be exact. In a literal sense, I am programming your mind right now. But don’t be alarmed, this happens whenever you read and understand anything.
Perhaps if one learned words first through the visual senses and then later learned to understand speech, one would ‘see’ words in the mind’s eye. I’m not aware of any such examples, this is just a thought experiment.
Its difficult to image pre-linguistic thoughts, raw thoughts that are not connected to words. Its difficult to project down into that more constrained, primitive realm of mindspace. Certainly some of our thought streams are directly experiential (such as recalling a visual and tactile memory of walking barefoot on a sunny tropical beach), but its difficult to imagine a long period of thinking constrained to this domain alone.
The core routing network allows us to take words and translate them into patterns of mental activation which simulate the state of mind which originally generated the words themselves. This sounds interesting, its probably worth reading again.
Imagine the following in a little more detail:
You are walking on a deserted jungle beach somewhere in Costa Rica. The sun is blazing but a slight breeze keeps the air pleasant. Your feet sink gently into the wet sand as small waves lap at your ankles. A lone mosquito nibbles on your shoulder and you quickly brush it off.
Those are just words, but in reading them you recreate that scene in your mind as the words activate specific high level cortical patterns which cascade down into the lower levels of the sensory and motor pyramids using the feedback path discussed earlier. The pattern associations were learnt long ago and have been reinforced through numerous rapid replays coordinated by the hippocampus during your sleep. If you were to actually look at your thought patterns as visualized with a high resolution scanner, you would see a trace very similar to the trace of your brain actually experiencing the described scene. Its different of course, not quite as detailed, and the task-negative network does not activate motor outputs, but at the neural level thinking about performing an action is just a tad shy of performing said action.
This is the power of words.
So for a brain architecture, the high level recipe looks something like this: take a hierarchical feedforward and feedback (dual directional) multi-sensory and motor cortex, combine in a hippo-cortical-thalamic core routing network, add in an offline selective memory optimization process (sleep), and finally some form of widely parallel goal directed search operating in compressed cortical symbolic space, and you have something interesting. This of course is an over-simplification of the brain, it has many more major circuits and pathways, but nonetheless we don’t need all of the specific complexity of the brain. Whats more important are the general mechanisms underlying emergent complexity – such as learning.
Of course, the devil is in the details, but it looks like the main components of a brain architecture are within reasonable reach this decade. I see the outline of a next step where you take the components discussed above and integrate them into a AIXI like search optimizer – but crucially searching within the extremely compressed abstract symoblic space at the apex of the cortical pyramid.
Simulating and searching in such extraordinarily compressed spaces is the key to computational effeciency in the supremely complex realities the brain operates in, and AIXI can never scale by using actual full blown computer programs as the basis for simulation. The key lesson of the cortex is that intelligence relies on compressing and abstracting away nearly everything. Efficiency comes from destroying most of the information.