After 15 happy years on WordPress.com, I’ve moved to Substack at neuroai.science. I’ve already published more than a dozen long-form articles on neuroscience and AI over there in newsletter format. You can subscribe or read the archives completely free. Substack makes it easy to create sticky experiences for audiences and reach them where they are–in […]
Show full content
After 15 happy years on WordPress.com, I’ve moved to Substack at neuroai.science. I’ve already published more than a dozen long-form articles on neuroscience and AI over there in newsletter format. You can subscribe or read the archives completely free.
Substack makes it easy to create sticky experiences for audiences and reach them where they are–in their email. This is all the more important as traditional social media channels have de-emphasized link sharing, making it harder to maintain an audience. The archives of xcorr.net will remain here. Thank you for supporting the blog over its long run!
NeuroAI is the budding research field at the intersection of neuroscience and artificial intelligence. One of the core concepts used in the field is that artificial neural networks can act as good models of the brain. For example, it’s often claimed that convolutional neural networks can account for the response of the ventral visual stream […]
Show full content
NeuroAI is the budding research field at the intersection of neuroscience and artificial intelligence. One of the core concepts used in the field is that artificial neural networks can act as good models of the brain. For example, it’s often claimed that convolutional neural networks can account for the response of the ventral visual stream to images. Similarly, large language models have been found that capture what’s going on inside the language network of the brain. When I have discussions with people outside of our narrow field, this often triggers surprise. Wait, how can a deep neural network be like the brain? What does it mean to be a good model? How can the squishy stuff be like the silicon?
Here I want to get into the specifics of how, mechanically, you compare a brain to an Artificial Neural Network (ANN). I give some historical background first, focusing on classic results and methods that originated in the field of visual perception. I will explain the nitty gritty of how correspondence scores between brains and ANNs are calculated. I discuss some of the conceptual difficulties inherent in the classic methods of linear regression and RSA, and explain some of the proposed alternative metrics in Williams et al. (2021). You’ll get the most out of these sections if you’re coming from a math background: statistician, data scientist, computational neuroscientist, etc. Nevertheless, if you’re less math-oriented, I will give intuitive explanations so you can follow along. I conclude with a call to create and use more nuanced and detailed comparisons between brains and neural nets.
History: the visual ventral stream as a convolutional neural net
Let’s start with the classic example of convolutional neural networks vs. the ventral visual stream. In the late 80’s, Yann LeCun was inspired by the classic work of Hubel and Wiesel on the physiology of the visual cortex. He created a neural network consisting of sandwiched layers of selectivity and invariance operations, not unlike the simple and complex cells of the primary visual cortex. His network, LeNet, was the first example of a convolutional neural net (CNN) trained with gradient descent, and it could classify handwritten digits. Two decades later, this work was then greatly scaled up by Alex Krizhevsky and co. in Geoff Hinton’s lab at the University of Toronto. AlexNet led to the ImageNet moment, in 2012, where a CNN did far better than state-of-the-art classical machine learning methods at image classification, showing that “deep learning has arrived”.
If CNNs were inspired by the brain, and they do the same thing as a brain (image classification), could they be a little like the brain? If it talks like a duck, and it walks like a duck, is it a duck? This is a surprisingly subtle question, and two teams (Dan Yamins in Jim DiCarlo’s lab, and Khaligh-Razavi in Niko Kriegeskorte’s) answered this question with a vigorous “yes, maybe!?” in 2014. They looked at the responses of the ventral visual stream—areas of the brain traditionally associated with shape perception and image classification—and compared them against the ANNs of the time.
The mechanics of comparing a brain to an ANN
So how did they compare a brain to an ANN? They followed what’s now considered a classic recipe. You need three inputs:
an ANN trained for some task (e.g. visual classification)
a brain (human or non-human)
a set of benchmark stimuli (e.g. a set of images, sentences, videos, etc.)
You then proceed as follows:
You probe the ANN with all the benchmark stimuli. You obtain a matrix of responses X. Each row is one stimulus (=one image, one movie clip, one sentence, etc.). Each column corresponds to a subunit of the neural network (e.g. the collected intermediate activations of the ANN).
You do the same with the brain. That means, for instance, having someone sit inside the scanner, looking at the same set of images that the neural network was exposed to, and you recording their functional responses. You collect the data into a new matrix. The rows are again exemplars, but now the columns represent something else: physical neurons, EEG sensors, fMRI voxels, etc. This gives us a matrix of responses Y.
By construction, X and Y have the same number of rows, but different column counts. We’ve thus reduced the problem of comparing a brain and an ANN to the problem of comparing two matrices of different shapes.
There are two now classic ways of doing this:
Linear regression: Do multiple linear regression to map one matrix onto the other. Learn a weight matrix W such that the residual ||Y – XW|| is minimized. This requires some regularization: Tikhonov regularization, which penalizes the sum-of-squares of W, might help here. An alternative is to require that the mapping is low-rank using partial least-squares with a limited number of components. This is the path used by Yamins & DiCarlo. The final score is the R2 of the linear regression, perhaps calculated through cross-validation (CV).
Representational Similarity Analysis (RSA): One difficulty with the previous method is that it requires learning a weight vector W. This is necessary, in part because the dimensionality of the two matrices might be different. If we form the matrices XX’ and YY’, however, we obtain two square matrices of the same size. Furthermore, these matrices are invariant to a relabeling (permutation) of the columns. Thus, we can compare the elements of the two similarity matrices via a correlation coefficient. Technically, you could have a negative score, but you could threshold or square to get a score in the 0-1 range. This is the method pioneered by Niko Kriegeskorte back in 2008.
These procedures will result in two distinct scores for the similarity of matrices: the linear regression score (really, CV R2) and the RSA score. Both have the property that 0 is maximally misaligned and 1 is maximally aligned. This way, we have reduced the difficult philosophical question of what it means for a brain to be like an ANN to a problem of big number good, small number bad.
The sophisticated logic behind ANN/brain comparisons
Using these tools, the two historical papers came to similar conclusions: deep neural nets trained on images have similar representations to the ventral visual stream of the brain. For the Yamins paper, it was with linear regression and with single-cell neurophysiology; for Khaligh-Razavi, it was RSA on fMRI data.
Conceptual difficulties with comparing brains and ANNs
Procedures to map brains to ANNs like linear regression and RSA swap out deep philosophical issues about the nature of perception with a technocratic procedure. In fact, there’s a lot that hides under these scoring procedures.
Both methods are correct at the extremes: if you try to compare two random matrices against each other, you’ll get a score of 0; and if you compare a brain (or an ANN) against itself, you’ll get a score of 1. So a brain is similar to itself but not to random noise: cool. That’s a pretty low bar to achieve, and it’s in the middle scores that we run into conceptual difficulties. What does a 0.5 similarity between the brain and an ANN mean? What are we actually trying to quantify?
One rubber duckThe same rubber duck, from another angle. Or is it a different one?
There are different ways in which we can conceptualize how the brain can be like an ANN. Let’s name some of these ways:
We could ask for a 1-to-1 correspondence: each subunit in the ANN should correspond to a neuron in the brain. That’s a very high bar to clear! Let’s call this 1-to-1 correspondence.
We could ask that distances are preserved in the brain and in the ANN. An analogy in 3d will help. Two three-dimensional shapes (say, rubber duckies, above) can be similar, regardless of their (arbitrary) orientation. Distances between two points on the surface of the duckies are preserved: they are invariants. Let’s call this orthogonal correspondence.
We could ask that one manifold of responses can be morphed into another through a linear transformation. Let’s call this linear correspondence.
It’s not clear how traditional linear regression and RSA scores map to these desiderata. You can make a verbal argument that linear regression is similar to linear correspondence (2), while RSA is most similar to orthogonal correspondence (2). However, there are complications in real implementations: regularization in linear regression, selection of voxels with sufficient signal-to-noise ratio in RSA. This means that we might not capture our (unstated) goals in complex ways.
Traditionally, these concerns have been more or less swept under the rug, and each subfield has converged on its own widely agreed-upon scores: RSA for human neuroscience, linear regression for single-cell neurophysiology. The argument goes that whatever score we choose, alternative scores would correlate with it. This encourages papers from using whatever is the most commonly used score in their subfield, which allows the scores to be compared more readily from paper to paper. A perfectly reasonable heuristic, but a little unsatisfying.
Williams et al.’s solution: computational shape analysis
Williams et al. (2021) offer a nice treatment of these issues with some good conceptual solutions, casting the problem as one of computational shape analysis: analyzing shapes in high-dimensional spaces with statistical tools.
Different metrics are tolerant or not of rescaling along arbitrary axes. Top: orthogonal correspondence. Bottom: linear correspondence. From Williams et al. 2021.
First, they project the two representations (brain and ANN) onto a fixed, common-sized representation. You could use random projections, subsampling, PCA, etc. to get two matrices with different widths to the same width. Call the resulting matrices and . One of their proposed distances is . T is a transformation within some group G that captures what is it that we mean by “same”. Some potential choices:
If you want one-to-one correspondence between the brain and ANN, G can be the set of all permutation matrices
If you want distances to be preserved, then G can be the orthonormal matrices
If you want to allow squishing along arbitrary linear dimensions, you can let G be arbitrary linear transformations
It turns out the resulting scores are proper distances that respect the triangle inequality, which has some nice benefits for clustering. They also introduce more metrics for more unusual scenarios. For example, for convolutional neural networks, you’d want something that allows remapping along the channel dimension, but not along the space dimension. Although each metric seems like it requires a brute-force search, it turns out there are clever ways of calculating the optimal transformations in each scenario.
The advantage of axiomatic methods
Overall, Williams et al. take cues from the recent success of geometric deep learning (e.g. Bronstein et al. 2021) to ask deep questions about the metrics we use. More than just an incremental numerical improvement, I think it’s a big conceptual improvement over linear regression and RSA: you know what you’re getting into. An axiomatic approach tells you very explicitly what the underlying assumptions are.
Ironically, this makes it easier to poke holes in some of these scores. For instance, one of the core desiderata that Williams et al. fulfill is that their chosen score should be symmetric. Now, I would argue that a brain can be more similar to a neural network than vice-versa. Modern large language models (LLMs) represent more stuff than humans: GPT-4 knows markdown, latex, English, German, how to write listicles and poems and sound like a pirate. Most people can only do a subset of these things. Divergences for distributions are naturally asymmetric, so there’s nothing inherently fishy about a score that is asymmetric. I’m sure there is a variant of linear regression’s asymmetric R2 which can be properly axiomatized according to the framework laid out by Williams et al.
In the meantime, one can use combinations of these well-justified metrics to answer interesting questions about how brains vs. ANNs represent information. For example, the delta between linear correspondance (metric 3) and the orthogonal correspondance (metric 2) is an index about how much warping is necessary to get two latent spaces to match. This could help reveal whether a brain’s representation is a subset of or a noisy approximation of a particular artificial neural net, which is ambiguous for each individual metric. Williams et al. have made their metrics available in this Python package, so you can try this out yourself.
Discussion
For all their justification, we haven’t seen much empirical work using well-axiomatized scores beyond linear regression and RSA. I think the reason for the status quo is that we’ve been focused on coarse characterizations thus far. This will surely need to change as we go beyond ANNs as metaphors for the brain and start using them as in silico models of the brain.
For instance, I’m interested in using ANNs as models for the brain for the purpose of neural engineering. What I really want is that when I create a virtual lesion in a neural network, it predicts how the real brain will react to a real lesion. Such a causal manipulation score doesn’t yet exist, but I think it would be both conceptually and practically useful. The axiomatic approach of Williams et al. points us towards ways of building this type of score.
For other purposes, however, it may be that the metrics we have are good enough. This recent paper from Tuckute et al. shows that one can learn a transfer function between a large language model and the brain’s language network using linear regression. The model can predict which sentences drive or don’t drive the brain. Using in silico models to predict the response of the brain is helpful in this scenario, and in a certain meaningful sense it means that the brain is like a neural network, along this prediction axis, but perhaps not according to more stringent criteria.
I received this intriguing question from Daria Marchenko, who is a contributor on Moteur de Recherche, on Canadian public radio. I went into a deep rabbit hole trying to find a good answer–especially since she also interviewed the wonderful Sasha Luccioni, an expert on ML and climate change. Most of the estimates I’ve found online […]
Show full content
I received this intriguing question from Daria Marchenko, who is a contributor on Moteur de Recherche, on Canadian public radio. I went into a deep rabbit hole trying to find a good answer–especially since she also interviewed the wonderful Sasha Luccioni, an expert on ML and climate change. Most of the estimates I’ve found online seemed off by several orders of magnitude, so I reproduce the analysis so other journalists and curious people can find this information easily.
TL;DR: ChatGPT’s energy use might have peaked in February, with a trillion tokens produced by a server farm of several thousand A100 GPUs. This would have cost 6-figure hosting bills per day and lead to 7 to 15 metric tons of CO2 per day, or about the equivalent of 400-800 households in the US. That’s not nothing, but in the grand scheme of things is fairly modest, especially compared to more profligate uses of computing like cryptocurrency. What most surprised me about this is the sheer speed of adoption of ChatGPT, which shifts the bulk of the cost from training to inference. In February, ChatGPT saw about 1% of Google’s traffic, or 25% of Wikipedia’s, frankly shocking for a three-month-old product.
Calculation
There’s nothing very open about OpenAI, and most details are under wraps. We can make estimates of its energy use from what we know about probable architecture, cost to users, and public releases, as well as what we know about similar architectures.
The original ChatGPT is powered by GPT3.5, a version of GPT-3 retrained with reinforcement learning from human feedback to be more useful. We’ll assume that this model was a 175 billion parameter model similar to davinci.
Patterson et al. (2022) estimated that the original GPT-3 cost 502 tons of CO2 to train. RLHF would add a bit of overhead on that, perhaps on the order of 1% of the original cost.
The more interesting question is the cost of serving the model. It’s been reported that ChatGPT is the fastest app to reach 100 million users. Similarweb reported 1 billion visits in February to chat.openai.com, compared to about 4 billion for Wikipedia or 80 billion for Google. With an average visit of just under 10 minutes, it seems reasonable to assume that 1000 tokens or so were generated by each visit, or 1 trillion tokens total.
We can estimate how these 1 trillion tokens would cost in inference from the performance of the similarly-sized, and open-source Bloom. It can produce about 1,400 tokens per second on a server with 8xA100 GPUs. To scale up to a trillion tokens in 28 days, we would need 300 8xA100 servers on average, or 2,400 A100 GPUs. That’s enormous!
Obviously, the capacity would need to be higher at the end of the month than at the start as more people adopt the product. Given the rapid growth, I’ll assume that at the end of February capacity was 1.4X higher than the average during the month. We can plug this information into the CO2 emissions calculator from Luccioni et al. I assumed inference was done on A100 SXM4 80GB on Azure cloud in the West US region. This comes to 7 metric tons of CO2 per day at the end of February. As a side note, on the public cloud, with a 3-year lease, it would cost OpenAI $90,000 per day to lease!
On March 1st, OpenAI switched to a new model for ChatGPT, gpt3.5-turbo. This new model is 1/10th the cost of davinci to the end user, and it is reasonable to assume that it has, consequently, 1/10th the carbon intensity. My hunch is that gpt3.5-turbo is a RLHF version of the curie model. Thus, in March, OpenAI’s emissions must have gone down drastically. This created extra capacity for GPT-4, which is likely at least as carbon-intensive as the original davinci GPT-3, and possibly more. Because GPT-4 is gated behind a payment, total energy outlay must have gone down in March, but in the long term should reach its past peak and exceed as higher capacity encourages new use cases. This is a good example of Jevons’ paradox in action.
As an aside, I came up with an alternative number of 15 tons of CO2 per day based on extrapolating Tom Goldstein’s work. So I think this is in the right order of magnitude.
Discussion
I was surprised by the numbers I came up with. Inference is now a big chunk of the cost of LLMs, whereas most of the literature has been focused on training. In the current regime, with these numbers, total carbon output from serving will overshadow training by a large margin over the lifetime of a model. It’s much higher than some fanciful numbers that are in the top 5 results on Google, and much lower than other ones, by several orders of magnitude. I won’t link to the specific estimates, but I am frustrated by OpenAI’s lack of transparency about its hardware use; it has encouraged a cottage industry of data scientists on medium multiplying big numbers by tiny numbers and not double-checking their results.
For what it’s worth, I think the estimate is bounded by the number of GPUs involved, and based on availability, it cannot be much more than 10,000 A100’s (somebody has to pay that capital expense at the end of the day, and 10k A100’s is 100M$). It also can’t be much less than 1,000, since ChatGPT is reported to be costing OpenAI 6 or 7 figures per day.
In terms of cheer expensive stuff that needs to be bought or leased, 2,400 GPUs is a very large amount. The Jean-Zay supercomputer that was used to train Bloom used about 3000 GPUs. Granted, OpenAI’s servers don’t need very high bandwidth, as they would in training, but it’s a lot of machines going brrr.
You can listen to the segment from Daria Marchenko that aired on Radio-Canada here. In addition to being a founder of a non-profit on digital sobriety and radio contributor, Daria is a talented photographer. She took the professional headshots that grace this website; you can see her portfolio here, featuring the likes of Justin Trudeau and Charles Aznavour.
There’s been remarkable progress in Masked Language Modeling (MLM) and Reinforcement Learning from Human Feedback (RLHF), taking us much closer to generally useful AI than I could have imagined. GPT-4 exhibits impressive capabilities such as passing the bar exam, navigating virtual mazes, and even demonstrating what seems like a theory of mind. This brings us […]
Show full content
There’s been remarkable progress in Masked Language Modeling (MLM) and Reinforcement Learning from Human Feedback (RLHF), taking us much closer to generally useful AI than I could have imagined. GPT-4 exhibits impressive capabilities such as passing the bar exam, navigating virtual mazes, and even demonstrating what seems like a theory of mind. This brings us to a fascinating question: does GPT-4 possess common sense?
Exploring GPT-4’s common sense is not just an intellectual curiosity; it has practical implications. As a powerful large-scale language model (LLM), GPT-4 will inevitably be used to generate content and automate decisions. Will these decisions stem from “an alien mind”, or will they be grounded in common sense?
To investigate this question, I assessed GPT-4’s ability to categorize everyday objects. There is abundant data on how humans naturally group objects, such as tools, body parts, colourful items, and so forth. Surprisingly, GPT-4 doesn’t perform particularly well on this simple task. It exhibits different default categories and ways of understanding common objects compared to humans: its intuition for grouping and organizing the world is distinct from ours. In this blog post, I’ll delve deeper into the implications of these findings and discuss their potential impact on the Natural Abstraction Hypothesis and AI alignment.
An artificial intelligence doing a hard classification task. From the movie Wall-E by Pixar.
Subscribe to xcorr and be the first to know when there’s a new post
Many researchers, most recently Tony Zador and colleagues, have argued that a true AI needs grounding in the real world to display physical common sense. Some aspects of semantics that can be easily inferred in the real world may be difficult to deduce from text on the internet. This is especially true for very basic things about tools and common objects, which often go unsaid in online text. Despite all its impressive abilities, it’s not too challenging to get GPT-4 to say outrageous things about everyday items.
For instance, let’s see what happens when I ask it about eating soup without a spoon:
I forgot to bring a spoon with my lunch. However, I have a fork. Will I be able to eat my soup?
While it might be challenging to eat soup with a fork, it is not impossible. You could try the following techniques:
Scoop and sip: Tilt the bowl and try to scoop as much soup as you can with the fork, then sip the soup from the fork. This might be time-consuming and not as effective as using a spoon, but it can work in a pinch.
Let the soup cool: If the soup contains solid ingredients like vegetables or meat, let it cool down a bit and use the fork to pick up the solid pieces. You can then sip the remaining broth directly from the bowl.
Make a makeshift spoon: If you have access to materials like aluminum foil, plastic wrap, or paper, you can try fashioning a makeshift spoon by wrapping the material around the fork’s prongs. This might not be the most effective solution, but it could help you consume more soup than with just a fork.
In the end, the most effective solution might be to ask a colleague, friend, or nearby cafe if they have a spare spoon you could borrow.
I would rate this response a 2 out of 4: two fine suggestions (ask a friend or drink the soup) and two absurd ones (use the fork like a spoon or make a makeshift spoon). The current GPT-4 demonstrates impressive performance on difficult tasks but can fail significantly on simple ones.
Anecdotes like these help refine our mental model for how GPT-4 works, but it would be valuable to supplement them with data. Let’s do some science!
The latent semantics of GPT models
To determine the extent of intuitive physics and semantics absorbed by GPT, I used the THINGS database. This database contains 1,854 objects, each accompanied by an image. These objects are concrete and visualizable (think: aardvark, not democracy). THINGS consists of several datasets, but one, in particular, caught my attention: a similarity dataset featuring 3 million judgments on an odd-one-out task presented visually.
The task works as follows: the MTurk worker sees three images, such as a skateboard, a burrito, and a chihuahua. Which is the odd one out?
From the THINGS similarity paper. I don’t know the right answer, but I’m on team skateboard.
You could argue it’s the skateboard, because the other two are associated with Mexico; the burrito, because the others can move; or the chihuahua, because the other two are non-living. It’s precisely because there’s no right answer in an objective sense that the answers are fascinating. There’s some tortured logic behind the judgements that seem to be reproducible amongst people. In this way, it’s similar to the Number Game, how you can get people to reliably complete sequences of numbers; in theory, any set is as good as any other set. Yet, people display clear patterns.
Although the original task is presented visually, it’s straightforward enough to ask using just words. We can then compare the model’s responses against human judgment and analyze the underlying thought process of the machine. I used testset2.txt as the test case, which contains 1,000 odd-one-out judgments, repeated multiple times across a large population of MTurk workers.
GPT-3.5
I started by testing GPT-3.5. Although GPT-3 is quite good at naming properties of objects, suggesting it should handle the odd-one-out task easily, it performed surprisingly poorly. Initially, it mostly refused to do the task. Its answers were biased, rarely choosing the first item in a list as the odd one out (<10% of cases). I experimented with different variants of the task, such as choosing the odd-one-out without justification, thinking step-by-step, considering visual attributes, and using a rubric.
The only way to achieve decent accuracy was to compute GPT-3.5’s answers across three orderings of the items and use a majority vote. This approach yielded a 55% correct score (similar numbers with and without thinking step-by-step). For comparison, chance is 33%, and an oracle is 92%*. While it’s common practice in more challenging tasks to generate multiple completions from a language model and only accept the best ones (as with AlphaCode), I don’t believe this is a difficult task. In fact, I wouldn’t be surprised if young children could perform better than GPT-3.5. I could expand to all 6 orderings and do multiple rollouts, but the fact remains: it doesn’t really get this task.
The model’s reasoning for its choices often seemed ad hoc and sometimes incorrect. For example, when asked to choose the odd-one-out among 1) aardvark, 2) branch, and 3) outlet:
To determine which of the three objects is the least similar to the other two, we can compare their characteristics. Object 1 (aardvark) is an animal, while objects 2 (branch) and 3 (outlet) are inanimate objects. Object 2 (branch) is a natural object found in nature, while object 3 (outlet) is a man-made object used for electricity. Based on these characteristics, it seems that Object 1 (aardvark) is the least similar to the other two. Therefore, the answer is 1.
In this case, the model fails to use the fact that branches and aardvark are living things. In other instances, it correctly identifies the categories for the objects but refuses to compare them, claiming they have nothing in common and seemingly choosing based on a whim.
I wouldn’t put too much stock in its purported reasoning, as there’s no significant difference between think-step-by-step and direct mode. The step-by-step thinking might just be fluff. But clearly, this is a hard task for a (rather sophisticated) AI and an easy task for a human.
GPT-4
GPT-4 does a better job than GPT3.5. For one, it actually tries to solve the task, and I report the results from just one (alphabetical) ordering. On this dataset, it gets 63% right – better than GPT3.5, but without the need for averaging. The prompt I used is as follows:
Report which of these objects, 1, 2 or 3, is the least similar to the other two. 1: {word1}, 2: {word2}, 3: {word3}. Focus your judgement on the objects. Think step-by-step. Finish with the phrase: The answer is digit.
Some positives:
It actually attempts to do the task. Its distribution of answers (first, second or third item) is 37%/40%/23%. It could still use some averaging over word orderings but I didn’t want to spend 100$ on this.
It is very good at following instructions, unlike GPT3.5. If you tell it you need a particular format, it almost always does it. Parsing the output is much less of a pain.
It lists some real properties of objects
However, it reliably comes up with real headscratchers in its arguments:
Anchor: heavy, metal, used for holding boats in place
Banana peel: organic, biodegradable, related to food
Recorder: musical instrument, lightweight, used for playing music
Based on the characteristics, the anchor is heavy and made of metal, while the banana peel and the recorder are lightweight. The banana peel is related to food and the recorder is a musical instrument, while the anchor has no such connections. The anchor and the recorder have more specific purposes (holding boats and playing music), while a banana peel’s purpose is mainly as food waste.
The answer is 1
It’s trying really hard, but it doesn’t see the fact that banana peels are organic, while anchors and recorders are man-made.
Here are ten pairs it gets wrong (its choice in italics, human in bold). I picked among non-controversial examples (where humans pick the same answer > 75% of the time):
anchor, banana peel, recorder
ashtray, easter egg, goldfish
banana, crow, scorpion
barcode, cart, laptop
bark, parrot, tree
barrel, blinder, tiger
baton, leopard, tumbleweed
bell pepper, curling iron, shower curtain
belt, coral, gondola
boar, brace, trough
It feels like it doesn’t have a rank ordering of properties that are important in intuitive categorization. This is especially noticeable when the three objects are in very different categories, forcing it to use intuitive categories it doesn’t have access to. The authors of the THINGS database have worked on learning labelled features using the triplet human answers, first in the SPoSE paper, then in the VICE paper. The 11 most important dimensions as revealed by the SPoSE paper are:
Metal tools
Food
Mammals
Clothes
Furniture
Green leafery
Man-made-garden-related
Cars/trucks
Wooden
Body parts
Colorful things
This seems like a weird list – not unlike the categorization of animals offered in Borges’ Celestial Emporium of Benevolent Knowledge. When you think about humans as agents, however, they make a lot more sense. The top categories correspond to things that might be very important for survival (think: zombie apocalypse), and that have useful and varied affordances. Interestingly, many of these categories also seem to be represented by distinct chunks of the brain (e.g. body parts, tools, faces of animals).
If you cross-reference GPT-4’s errors against the SPoSE dimensions, you realize why it makes the mistakes it makes. It doesn’t always put tools and metal stuff together (e.g. barrel, blinder vs tiger), or organic things together (baton, leopard and tumbleweed).
What does it mean for AI alignment?
The Natural Abstraction Hypothesis, proposed by John Wentworth, states that there exist abstractions (relatively low-dimensional summaries which capture information relevant for prediction) which are “natural” in the sense that we should expect a wide variety of cognitive systems to converge on using them. —TheMcDouglas on LessWrong
Some people In rationalist/XAI circles think that if the natural abstraction hypothesis is true, we might get AI alignment “for free”. In other words, if we can get intuitive semantics right, then perhaps we might get intuitive values right.
The results I present here go against the Natural Abstraction Hypothesis. It shows that GPT is not very well grounded, despite having been trained on just about all the text that’s ever been written. This doesn’t necessarily mean that it will have bad ethics and be misaligned. But if it doesn’t grasp a lot of deeply ingrained, intuitive categories of objects, it feels doubtful that it would also recapitulate intuitive ethics, especially the very core stuff that goes unsaid. It also, of course, doesn’t mean that it’s not useful: it’s simply different from how a human is.
As we refine our mental models for what LLMs are, that’s important to keep in mind. These are skilled token manipulators, not unlike the operator of the Chinese Room in Searle’s metaphor (albeit a stochastic version of the Chinese Room operator). Its tokens are not very well grounded, and it doesn’t have access to a simulation environment (i.e. the equivalent of what humans call the real world).
That means LLMs can fail when tested out-of-distribution on stuff that often goes unsaid. These are things that are deeply ingrained within us as humans because of shared DNA and environment: gravity points down, grass is on the ground, cats are soft, etc.
A consequence of this is garbage-in-garbage-out. If something is not reflected within its training set, it is unlikely that it will figure it out. If we want it to be aligned to human values and semantics, we probably have to feed those in using well-curated datasets, either during pretraining or RLHF, or both. And writing down human values in all their complexities is difficult! I would like to see large-scale capture of human values and evaluation, perhaps in a similar way to THINGS.
Here’s the code I wrote for this. I was assisted by Github Copilot and ChatGPT4. I used the OpenAI API to make API requests. I wrote this article in bullet form and had GPT-4 interpolate the rest. I also used Grammarly to refine the text. GPT-4 generated copy for a tweet thread. The featured image was generated by Dall-E 2 (although it really didn’t want to draw a spork, which was disappointing).
* The THINGS similarity paper refers to noise ceilings, which is (simplifying a bit) how well a single human can predit another human’s response. What I’ve calculated is instead an oracle score, which is how well an agent could predict the wisdom of the crowd answer (i.e. majority vote), given the sampling noise of the average response. This oracle score would go to 100% if we had infinite ratings per triplet; currently, it’s closer to .92-.94 given a few dozen ratings per repeated triplet. I would argue that the machine could, in theory, achieve the oracle score if it categorized things like a crowd of humans.
Now, GPT-4 is stochastic, and we might need to average and take a majority vote of itself to calculate its true ability to predict the wisdom of the crowd answer. To test this, I calculated the majority vote for 12 different repeats (with different orderings) of the first 101 triplets. Majority vote does help but it satures at about 6 samples. Overall, majority voting gives an absolute improvement on the order of 0.06, which is not bad at all, but it doesn’t go anywhere close to oracle score.
2022 was the year of generative AI models: DALL-E 2, MidJourney, Stable Diffusion, and Imagen all showed that it’s possible to generate grounded, photorealistic images. These generative AIs are instances of conditional denoising diffusion probabilistic models, or DDPMs. Despite these flashy applications, DDPMs have thus far had little impact on neuroscience. I want to change […]
Show full content
2022 was the year of generative AI models: DALL-E 2, MidJourney, Stable Diffusion, and Imagen all showed that it’s possible to generate grounded, photorealistic images. These generative AIs are instances of conditional denoising diffusion probabilistic models, or DDPMs. Despite these flashy applications, DDPMs have thus far had little impact on neuroscience.
An oil painting of a cute sloth birthday party. Generated by Dall-E 2, a conditional DDPM.
I want to change that! I think DDPMs are very cool models. Not only can they be used to generate oil paintings of cute sloth birthday parties, but they’re also useful, general-purpose generative models: tractable, and easy to train. Perhaps surprisingly, they’re also related to a lot of common models in neuroscience, like hierarchical Bayesian predictive coding models and Hopfield networks. They contain a lot of interesting ideas about how the brain might implement complex generative models, and I think they would benefit from being studied from a neuroscience perspective.
I wrote this article both for neuroscientists who want to keep up with the latest in ML, and for ML practitioners who are curious about neuroscience. First, I give a high-level overview of diffusion models, how to train them and how they compare and contrast to other generative models. I won’t get into the math but I’ll give references to tutorial introductions if you’re interested in that. I’ll discuss current applications in neuroscience, and speculate about some potential future applications. Finally, I’ll relate the structure of DDPMs to some common neuroscience models like Hopfield networks and Bayesian predictive coding. Let’s get started!
Subscribe to xcorr and be the first to know when there’s a new post
DDPMs are generative Probability Models: they learn a complex probability distribution p(x0) from empirical data. Here x is a D-dimensional vector representing an image, sounds, time series, graphs, etc. To generate a new sample from a learned DDPM, you repeatedly apply a Denoiser – a deep neural network like a U-Net or a transformer – to noise vectors. Think of these noise vectors as the latent variables of the model. At the end of the process, a sample is generated. It’s a little bit like pareidolia (seeing faces in clouds): if I present a bunch of noise stimuli (e.g. clouds) to a human, and ask them what they see, they’ll generate a probability distribution over familiar things (e.g. faces). This process thus maps noise to a learned probability distribution.
But how do you learn a DDPM? Rather than repeat all the math from other excellent tutorials, I will to give you a flavour of why DDPMs are built the way they are, how they’re trained, and give you a bit of wayfinding so you can learn independently.
A DDPM in action. Data is on the left, noise on the right. We learn a denoising DNN μ(xt, t). From Kevin Murphy’s book.
Training DDPMs involves two diffusion processes. First, a word of warning: the diagram that you will see in every DDPM paper to illustrate these processes, by convention, has samples on the left (x0), and noise on the right (xT). I think that’s very confusing, because usually we’re going from noise to sample (right to left), and the diagram breaks expectations about how you should read it, so watch out for that.
Let’s talk about the forward process first (left to right): Take any probability distribution and add a little bit of normal noise to it, while also rescaling it; if you repeat this process enough times, you will obtain a normal distribution. This is a consequence of the central limit theorem. This process is Markovian (q(xT) = Π q(xt|xt-1) q(x0)). Each of the probability distributions in this expression has a simple Gaussian functional form, save for q(x0).
Our goal is to learn the reverse process: going from noise to samples, right to left. We write another Markov chain, p(x0) = Π p(xt-1|xt) p(xT). If we want p(x0) to approximate q(x0), the diffusion kernel p(xt-1|xt) has to be different for different xt: it has to be a biased diffusion process, otherwise we’d just turn noise into noise. We keep the same functional form for the diffusion kernel – a normal distribution – but we make its mean depend on xt via a deep neural network μ(xt, t).
The denoising objective comes from approximating getting the forward densities q to match the backward densities p. The pink and green arrows must match. From Calvin Luo’s blog post.
It doesn’t seem like we’ve accomplished much thus far, but here comes the magic: you can write down an expression for the likelihood of different samples (q(x0)) in terms of ratios of p’s and q’s. You can lower-bound these expressions using Jensen’s inequality – the ELBO, or evidence lower bound, the same trick used in VAEs – and now you’ve got an expression involving the KL divergences of different Gaussian distributions. All of these KL divergences can be calculated symbolically, a bunch of stuff cancels out, you’re left with a really simple expression for the ELBO. It turns out you can then optimize using this straightforward algorithm:
Here, ϵ(x) is a deep neural network (a U-Net or transformer) that takes a sample and predicts the noise that was added into it: the denoiser. The DDPM learning algorithm holds in five lines of pseudo-code. It’s quite accessible for us mere mortals: you can code and train a DDPM in an afternoon. Yet, the structure of a DDPM is quite rich, and it is very flexible. That’s a big reason why I find DDPMs so interesting.
Sampling
Sampling is done by running the learned denoising process many times, starting from noise samples p(xT). You denoise, add a little bit of noise, denoise, add a little bit of noise, etc. until you have a sample. That can be pretty slow, but there are strategies to accelerate. The most common is probably DDIM (the I is for implicit), which removes the continuous addition of noise with a modified sampling chain. With DDIM, you only take one noise sample p(xT), and the rest of the chain is deterministic. It’s possible to take DDIM samples from a vanilla-trained DDPM; you can reduce the number of sampling steps from, say, 1024, down to a more manageable 16 or so without much loss in sample quality. Because DDIM sampling is deterministic after the initial noise injection, it preserves the latent structure of the model, so you can use it for things like latent space interpolation.
Walking through the latent space of a diffusion transformer (DiT). From Peebles and Xie (2022).
So far I’ve focused on unconditional DDPMs, but it’s just as easy to build a conditional DDPM. The most common way to condition during training and generation is to take the conditioning information, pass it through an MLP, and let it bias the denoiser every denoising iteration. The timestep t is also embedded in this way. Biasing can be done by changing the mean and scale of different feature maps of the U-Net, or by biasing attentional blocks. This mechanism is very flexible; to take a specific example, latent diffusion, which drives Stable Diffusion, can condition on:
text via a dense CLIP-based embedding
one-hot encoded image categories
segmentation masks
other images (e.g. for image-to-image translation or upsampling)
Latent diffusion can accommodate multiple types of conditioning information (right). From Rombach et al. (2022).
There’s an additional way of conditioning DDPMs only at generation time. This late-binding is a bit of a mindfuck so bear with me or skip to the next section. The DDPM likelihood (reverse chain) can be multiplied with other distributions (e.g. priors) easily. Because each diffusion step moves the distribution only a little, it turns out that multiplying by a prior ends up only changing the mean of each diffusion step (see Appendix Table 1 in Sohl-Dickstein 2015). That means you can add a late-bound generation “penalty” (the log of the prior) to bias the generation. This trick is used in both classifier-guided diffusion and classifier-free guided diffusion to improve visual quality of generation. Basically, they bias the generation process away from low-quality, easily confusable images towards canonical, visually striking poses.
Left: DDPM samples with no guidance, right: DDPM samples with classifier-free guidance. Class is malamute in both cases. From Ho and Salimans (2022).
This late binding opens up many possibilities: sometimes you need to generate images which are just slightly off of the image manifold. Here I show a toy example of using a late-bound critic to make a batch of samples more easily discriminable from each other. I trained a vanilla DDPM on Google Fonts to generate lowercase letters. Next, I used the intermediate layer of an AlexNet to generate a confusion matrix to determine how likely AlexNet is to confuse these letters (see Janini et al. 2022 for background information). Then I nudged the generation so that letters are less confusable according to this critic, creating a modified font with improved readability in peripheral vision. This kind of late binding is very hard to do with other types of deep generative models.
Left: a sample from a DDIM trained on sans-serif Google fonts. Right: a sample from the same model, nudged so the letters are more distinct according to a critic. The critic is aligned to the brain.
How DDPMs compare and contrast to other generative models
Common generative architectures, from Kevin Murphy’s book (2023).
It can be hard to keep track of all the deep generative architectures and all their plusses and minuses: when should you use one versus another? DDPMs are frequently used in lieu of a GAN or a VAE. Compared with GANs:
DDPM samples, like those from GANs, can be of very high visual (or auditory…) quality
DDPMs can sample from the actual distribution of the data; GANs can suffer from mode collapse
It’s really easy to train a DDPM, unlike a GAN which can suffer from instability
You can calculate a likelihood lower bound for a given sample in a DDPM, unlike a GAN
However, DDPMs are slower to sample (though tricks like DDIM help)
DDPMs are quite similar under the hood to hierarchical VAEs. Compared with vanilla VAEs however, DDPM samples are sharper.
Another thing that differentiates DDPMs is that they have some unusual capabilities out of the box. That means that, for instance, they can denoise images out of the box. Upsampling, inpainting and outpainting are also straightforward.
They do have one big drawback compared to GANs and VAEs: the latent space of a vanilla DDPM has the same dimensionality as the data (e.g. the number of pixels). This is different than GANs or VAEs which typically have compressed latents. For some applications, compression is the whole point: see my previous post on dimensionality reduction in neuroscience. A notable exception to this rule is latent diffusion, which uses a complex pipeline involving a VQ-GAN in addition to diffusion in the compressed space; but technically, it’s not the diffusion model that learns the compressed latent, it’s the GAN. That being said, as discussed above, while DDPMs don’t have compressed latents, they do have high-dimensional latents which can be manipulated and interpolated.
TL;DR: DDPMs are easy to train, they generate high-quality samples, and they have some unusual properties which allow them to be used in interesting scenarios. However, they don’t have compressed latents, so we can’t use them for dimensionality reduction. That’s unfortunate for us neuroscientists because we love dimensionality reduction. So what can we use them for?
What are they good for (in neuroscience)?
Generating brain-data-conditioned samples
There are neuroscience applications which require generating high-quality samples. Brain decoding is a good example: you attempt to back out what a person saw (or imagined, or even dreamt) from the pattern of their brain activity. Ideally, you’d like to generate natural-looking samples. Two recent papers (Chen et al. 2022, Takagi & Nishimoto 2022), which I reported on in the last post, demonstrate these ideas. They mapped fMRI data to the latents of a latent diffusion model to decode images from the brain. You could imagine doing this with single neuron data as well (e.g Bashivan et al. 2019, Ponce et al. 2019).
Brain decoding with DDPMs, from Chen et al. (2022).
This Brain DALL-E idea is very cool, but you might think it’s a curiosity rather than a practical idea. However, there’s a domain where generating good, brain-conditioned samples is essential. Patients with ALS or brainstem strokes can end up in a locked-in state, where they’re unable to communicate despite remaining fully conscious. This was eloquently documented in the book The Diving Bell And The Butterfly, which was dictated, letter by letter, through a series of eyeblinks by the author Jean-Dominique Bauby. It’s been recently demonstrated that we can an invasive brain-computer interface can directly read attempted speech from a patient’s brain. Moses et al. (2021) demonstrated decoding brain activity in speech-motor cortex in a locked-in patient, classifying attempted spoken words and displaying them on a screen, thus allowing the patient to communicate.
You could imagine pushing this idea forward by making a custom voice box for the locked-in patient. First, capture a patient’s voice and intonation in a generative model, similar to the recent VALL-E (VALL-E is a VQ-VAE, but you could also do this with a DDPM). In the case of a neurodegenerative disease, it should be possible to record a patient’s voice after the diagnosis but before being locked-in. Then one could create a voice-box BCI that sounds like the patient’s own voice, which can be deployed once the patient is unable to communicate. It sounds like science-fiction, but I don’t think this is far off.
MRI, medical images and data augmentation
We’ve seen a number of papers applying DDPMs to medical imaging and MRI data. This is less neuroscience-proper and more neurology, but you can see how these ideas could be adapted and used as neuroscience tooling. Consider a compressed sensing scenario, where one wants to infer images from sparse measurements, for example to accelerate an MRI or PET scan. It’s easy to rig up a conditional diffusion model that takes, for example, an undersampled k-space representation of an MRI and spits out a plausible corresponding brain image. Not only are these images visually plausible, they come with their own error bars! It’s indeed possible to run the diffusion process several times to get multiple samples from the posterior and figure out where the model is certain of its reconstruction and where the model is just spitballing. A related application is in denoising images, whether these images are MRIs, PET scans or microscopy images (including 2p-imaging).
Because we can evaluate the likelihood of an image through a diffusion model, we can do anomaly detection, for instance to determine whether there’s a tumour in an MRI. Another application area is label-efficient segmentation. The image-space diffusion process means that the network must learn foreground/background relationships or object boundaries implicitly (this is a bit of foreshadowing for our section on how DDPMs are brain-like). Baranchuk et al. (2022) show how you can use this to do label-efficient segmentation, which could be expanded to medical images or MRIs, which are normally extremely expensive to annotate.
Indeed, data augmentation is something that DDPMs excel at. Learn an unconditional generative model for segmentation masks, learn a conditional generative model for an MRI, and you’ve got yourself a sequence of models that can generate fake (segmentation, MRI) pairs, which you can then use for downstream classification, defining biomarkers, etc.
Why DDPMs might be a little brain-like
We’ve covered sober applications of DDPMs for neuroscience. If DDPMs become just another tool in the toolbox of neuroscience, that’s a win for neuroscience. However, I’d like to go further and claim that DDPMs might be a little brain-like. This is still highly speculative (read: half-baked), but I think there’s a real opportunity to use DDPMs as a wedge to build new neuroAI models of the brain. I’m putting this out there not as a fully worked-out proposal but to make the community aware of the opportunity and find potential collaborators.
Hierarchical Bayesian predictive coding
Let’s consider the current state of visual neuroAI. Convolutional neural networks trained for image classification on ImageNet have been the de facto default models of the ventral stream of the visual cortex – V1, V2, V4 and IT – over the last decade. Self-supervised models have recently been shown to be just as good as supervised CNNs at explaining the ventral stream, and are more biologically plausible (see previous post on this subject). However, we know many ways in which the ventral stream is different than feedforward neural nets trained with supervised or self-supervised learning. Here are 4 facts we need to reconcile with the ventral stream:
The ventral stream has feedforward, recurrent and feedback connections, while CNNs only have feedforward connections
The ventral stream (and all of the brain) is noisy, CNNs are deterministic
The ventral stream is involved in visual imagery during waking, and in dreams; I have no idea how to get a CNN to dream
We can add more features to a basic CNN to better match the ventral stream. For instance, recurrent connections allow computations to unfold in time. While there’s nothing wrong with empirically motivated additions to CNNs, it’d be nice for additions to be more theoretically grounded.
Prior to CNNs taking over the study of the ventral stream, a dominant view of the visual cortex was as a hierarchical Bayesian inference machine (see this workshop for historical perspectives). A prototypical example is the proposal of Mumford and Lee (2003):
In this framework, the recurrent feedforward/feedback loops in the cortex serve to integrate top-down contextual priors and bottom-up observations so as to implement concurrent probabilistic inference along the visual hierarchy. We suggest that the algorithms of particle filtering and Bayesian-belief propagation might model these interactive cortical computations.
In Mumford and Lee’s model, the brain is Bayesian (point 4), it contains a generative model (point 3), and inference of the contents of an image from noisy measurements from the retina involves recurrent noisy algorithms (points 1 and 2). These recurrent noisy algorithms include particle filtering, loopy belief propagation, or MCMC. There was some early progress in scaling up this idea: the restricted Boltzmann machine (RBM) papers from Bengio & Hinton from the late 2000’s cite Mumford & Lee. However, the idea fizzled as discriminative approaches to classification became popular. Generative approaches are currently not competitive in explaining the ventral stream compared to other image-computable models (Brain-Score and the like; see also Conwell et al. 2021, Zhuang et al. 2021).
I think there’s an opportunity to resurrect and modernize this class of models using DDPMs. The most straightforward mapping to the ventral stream is to think of the first half of the denoising U-Net as equivalent to a feedforward pass up the ventral stream, the second half as a backward pass; the multiple iterations of the denoiser correspond to recurrent activity. The information communicated at every denoising step back to V1 is the delta between an image and its projection on the image manifold at this point in the process, producing a nice link to predictive coding models.
Regardless of the exact way in which we embody these ideas, DDPMs give us access to a lot of new machinery to work with, in both discrete & continuous formulations. That’s on top of related ideas of denoising score matching on probabilistic graphic models and related progress in hierarchical VAEs (see Kevin Murphy’s book for more on these subjects).
Content addressable memories and the hippocampus
DDPMs are also related to ideas about content addressable memories and the hippocampus. Consider the famous Hopfield network, a recurrent neural network which stores discrete memories inside of its weights. It can retrieve memories at will as the steady state of recurrent activity: starting from an incomplete pattern, it can complete the pattern by repeated application of a simple rule. Such a content-addressable memory is often a core component of computational models of the hippocampus, for instance the Tolman-Eichenbaum machine.
Hopfield networks can fill in missing information to retrieve partial memories; so can DDPMs. From this excellent blog post on modern Hopfield networks.
I like to think of DDPMs as continuous content-addressable memories. Corrupted images can be retrieved by the repeated application of the denoiser. Importantly, the network stores continuous memories: instead of representing discrete memories (mixture of Dirac deltas) like a Hopfield network, they represent an ensemble of memories (continuous distribution). DDPMs can recover from different corruptions, including additive noise and masking.
Lest we think that this analogy is a bit vacuous, DDPMs are capable of complex tasks traditionally ascribed to the hippocampus. A striking example is the recent work from Harvey et al. 2022, who showed that you could train a DDPM to generate hour-long videos. They took videos of a car driving in a virtual environment, and let the DDPM learn conditional generation, for example predicting the next frame in a sequence from the previous 4. By repeating this process with different temporal horizons, the model could generate hour-long videos of a car driving through the town, starting from a random location hallucinated by unconditional sampling (see samples here).
Generated sequences from a video DDPM mimic navigation in the real world. From Harvey et al. (2022)
It’s really quite striking that this simple model learned to generate complex sequences from scratch. In fact, it’s possible to map back hallucinated sequences to locations within the virtual town where the sequence was generated, and in most cases, the sequences are spatially coherent. Occasionally, the denoiser gets confused and warps to a different part of the map, relying on landmarks to do its thing. I think it’s really interesting how the generative task learned by the DDPM coaxes it to memorize and navigate through an environment, and how this could relate to the hippocampus’ role in navigation and memory.
As separate objects of study
I hope to have convinced you that DDPMs are sufficiently analogous to the brain to advance the neuroAI research programme. However, if it turns out, after careful study, that these machines are mechanistically quite different than the brain (Marr’s level 3), they might still contain some insights about the brain’s goals and potential algorithmic solutions (Marr’s levels 1 and 2). I think that neuroAI and AI itself would benefit from studying how DDPMs work and what they’re capable of. There are some really interesting potential insights into human visual cognition lurking inside these models.
For example, I’ve mentioned previously that DDPMs are useful for unsupervised segmentation. As part of their objective to model the distribution of natural images, they perform what appears to be implicit, approximate segmentation, propagating information about object relationships across long distances, all in the service of generating spatially coherent images. This opens up a number of follow-up questions, for example, are DDPMs susceptible to foreground/background illusions, like Kanisza squares? Can they solve Mooney images and follow Gestalt rules? We can take recent criticisms of feedforward CNNs as models of vision as compendia of interesting phenomena that should be investigated in DDPMs. Tantalizingly, DDPMs can be easily made robust to adversarial examples, one of the commonly highlighted drawbacks of discriminative accounts of perception.
Of key importance in these investigations is carefully controlling the natural image dataset DDPMs are trained on. ImageNet is not representative of what’s ecologically relevant to primates: it has too many dogs and not enough faces and body parts. Using ImageNet might be acceptable when the task is just a means to an end (e.g. learning good general-purpose features in a self-supervised manner), but not when the task is learning the manifold of images. We should carefully comparatively study DDPM behaviour trained on ImageNet or LAION vs. ecologically motivated datasets, e.g. EcoSet and SAYCam.
Conclusion
DDPMs are a class of conditional generative models that have found widespread use in generating images, sounds and time series. They strike a nice balance between complexity of implementation, flexibility, ease of sampling and evaluation. They may be useful in neuroscience where we need to generate fake data – for semi-supervised learning or for decoding. However, there’s a wider and more speculative horizon of possible links to neuroscience: hierarchical Bayesian models and content-addressable memories. It’ll be really interesting to break these models apart and figure out how they really work.
2022 has come and gone and we’ve learned a lot about how the brain is – or isn’t – like an artificial neural network. There isn’t a single journal or venue that focuses exclusively on neuroAI so the literature is spread across preprint servers, conference proceedings and the occasional prestige journal. As an end-of-year treat, […]
Show full content
2022 has come and gone and we’ve learned a lot about how the brain is – or isn’t – like an artificial neural network. There isn’t a single journal or venue that focuses exclusively on neuroAI so the literature is spread across preprint servers, conference proceedings and the occasional prestige journal. As an end-of-year treat, I challenged myself to gather all papers in neuroAI that were published this year and tell you all about them. I skimmed through ~130 abstracts and I read a few dozen to find the best, most influential, and most interesting, and extract this year’s zeitgeist. A systematic review this is not; rather, these are my humble musings on the state of neuroAI. Hope you learn something useful!
What counts as neuroAI anyway?
NeuroAI is the intersection of AI and neuroscience, but that’s a little vague. I couldn’t find a good comprehensive survey of neuroAI that delineated what is and isn’t neuroAI, and I wanted to circumscribe my search. So I enumerated all the different flavours of neuroAI that I’ve seen in the wild and placed them in a two-dimensional rubric:
On the horizontal axis, one defines the researcher’s “home turf” or “application area”: AI (left) or neuroscience (right). Usually, papers import ideas from an outside field to their home field. Do you want to understand how to the brain works, or do you want to make better AI?
On the vertical axis, one defines whether one is motivated by applications (”doing useful things”, “engineering”, etc.) at the top vs. curiosity driven at the bottom.
My classification of neuroAI- and neuroAI-adjacent research. I tried to limit this yearly review to core neuroAI, but there’s always a bit of leakage on the sides. You can download an editable and citable version on FigShare.
Within this framework we can identify major flavours of neuroAI: mining neuroscience to find ways of making AI better ends up in the top left corner, while using artificial neural nets as models of the brain ends up in the bottom right. We also find less common lines of research, for example animats: brain-AI hybrids, at the bottom. The diagram also identifies big chunks of research that use AI in an instrumental fashion to study brains; these fall outside of core neuroAI. For instance, automatic spike sorting or biomarkers for brain health fall outside of core neuroAI in this classification. I’ve found this diagram useful in explaining what our field is to researchers outside of our field and to understand how different pieces of research fit together. It’s still a work-in-progress so please do let me know if you have better ideas.
Even with this fairly circumscribed definition of neuroAI, it turned out a lot of neuroAI research was published this year! I turned to Twitter and Mastodon to ask people for their favorite publications, then I looked through my Twitter bookmarks, then I searched on Google Scholar for people that I knew. I dedupped, removed papers outside of core neuroAI, then applied a completely arbitrary “does this pique my interest?” filter, and ended up with ~130 publications, excluding abstracts from the SVRHM workshop at NeurIPS which may get its own separate blog post later. To sort through this, I tried using sentence embeddings + a UMAP to visualize the results, but eventually resorted to the tried and true method of clustering by hand. Let’s start with the big ideas.
Big ideas about the future of neuroAI
Two whitepapers came out to define neuroAI research programmes: ambitious, decade-long, multi-lab research endeavours to solve big questions. The two are not strictly in opposition – they share a couple of co-authors – but they nevertheless represent different visions of where to take neuroAI.
Subscribe to xcorr and be the first to know when there’s a new post
Doerig et al. (2022) argue from the bottom right corner of the neuroAI diagram. They define neuroconnectionism as modeling the brain with artificial neural networks (ANNs) in a goldilocks zone of biological detail. There’s a healthy dash of philosophy of science thrown in: it’s fruitful to model the brain with ANNs because it helps us uncover deep truths about how the brain works. The core of the theory is not falsifiable, but the belt can be.
The neuroconnectionist research programme
Findings where ANNs differ from the brain in key ways, e.g. adversarial attacks, are envisioned as opportunities to change the belt of the theory. This process can continue until the core theory stops being useful as a scientific paradigm. I think this stance is a default stance of many in the field, and it is aggressively neutral: we do neuroAI because we think that brains are cool and we should study them and we now we have a useful tool to do so, so let’s do that!
Toward Next-Generation Artificial Intelligence: Catalyzing the NeuroAI Revolution
Tony Zador and a who’s-who of AI people and computational neuroscientists argue from the top left corner of the neuroAI diagram: let’s learn about the brain so we can solve AI. Current generation AI is specialized; even a modest mouse shows far more adaptive behaviour than a 300B parameter NLP model trained on the entire internet. They propose a framework and a challenge to measure adaptive behaviour: the embodied Turing test. In their words:
An expanded “embodied Turing test” would benchmark and compare the interactions with the world of artificial systems versus humans and other animals. […] An artificial beaver might be tested on its ability to build a dam, and an artificial squirrel on its ability to jump through trees. Nonetheless, many core sensorimotor capabilities are shared by almost all animals.
The way they propose doing this is by building up a series of incremental challenges, starting from simple behaviours (e.g. locomotion) and organisms (e.g. nematodes) and building up on those systems one by one by a process of phylogenetic refinement. The paper that embodies the clearest this vision this year is this wacky NeurIPS paper from Tony Zador, Bhattasali and Engel where they show how to build an RL agent that can swim by reverse engineering the nematode (C. Elegans) swimming circuit.
C. elegans go voom
Yes, and…
I am enthusiastic about each of these research programmes. I think they’ll be reinforced by an additional focus on applications of differentiable brain models to brain health (top right corner). With a perfect digital twin of a brain, one could figure out how to stimulate the brain in the right way to give relief to people with neurological disorders. This corner of the neuroAI diagram has deep intellectual roots in cybernetics, but it’s broadly underexplored. Developing the tools to study the brain’s reaction to causal manipulations might give modellers precisely the data they need to make progress in other corners of neuroAI. I wrote a long read about this idea here.
Skeptical outlooks
Not everybody came out as pro-neuroAI this year: we saw a lot of skeptical takes and monster Twitter threads. Is neuroAI broken? Let’s find out.
High-performing neural network models of visual cortex benefit from high latent dimensionality
This paper from Elmoznino and Bonner (2022) looked at the dimensionality of convolutional neural networks acting on images. Here, they defined dimensionality as the effective number of non-zero dimensions via the singular values of the encoding matrices across the ImageNet validation set. They found that networks with high dimensionality correspond better with area IT of the visual cortex. Super interesting idea and pretty convincing.
Effective dimensionality correlates with match to IT
This paper was interpreted by some skeptics to mean that it’s not really the network that matters; it’s the fact that the network projects to some high-dimensional space. In other words, one could construct a network that projects input to random directions in a high-dimensional space and that would match well with the brain. That would be a blow to the interpretation that IT is similar to a high-level layer of a deep convolutional neural net
However, that’s not what the paper found: instead, they found that in their sample of neural nets, the ones with higher dimensionality had compact class manifolds. That means that images with the same label were close in feature space. That would not be true for random projections. This paper adds to the body of evidence that there’s something special about the way the brain iteratively refolds image manifolds into beautiful origami swans useful representations; see also Arna Ghosh’s paper for an application of this idea to DNN regularization.
There’s still a possibility that it’s easier to match a low-d brain to a high-d network because of a needle-in-a-haystack effect; there’s just more potential matches within a high-d representation. I think we’ll need more subtle metrics than R2 from linear regression to see that. Elizabeth Dupre has a beautiful presentation at MAIN2022 on the myriad bespoke metrics fMRI people use that we should probably import for neurophysiology.
Brain hierarchy score: Which deep neural networks are hierarchically brain-like?
This paper from Nonaka et al. is from 2021, but I heard about it in 2022 so it makes it into this review. They measured the match between a unit in a given layer of a CNN and the best matching visual cortex voxel and vice-versa to create brain hierarchy scores: how close is the match between V1-V2-V4-IT of the visual cortex and layers 1-2-3-4 of the CNN? This is a harder criterion than to ask that on average the layers of the CNN match those of the visual cortex. They found that shallower models with fully connected layers performed best. Furthermore, even the best DNNs only recapitulated the hierarchy of the brain coarsely.
This does highlight a failure mode of the correspondence between DNNs and brain: DNNs trained end-to-end are underspecified in their correspondence between a specific brain area and an intermediate layer. To resolve this, I think we’ll need to directly supervise intermediate DNN layers by sprinkling in brain data to constrain specific layers to match specific areas. St-Yves et al. (2022) found that you can train a big DNN with multiple readout heads to account for responses in V1, V2, V3, V4h. This doesn’t do better than training separate networks according to traditional criteria like R2 or RSA, but it does better respect the hierarchical structure of the visual cortex.
Reassessing hierarchical correspondences between brain and deep networks through direct interface
Sexton & Love (2022) propose a stricter evaluation of the correspondence between brains and deep neural networks, what you might call the old switcheroo: if the brain really is a proxy for a neural network (or vice versa), then it should be possible to replace the input to an intermediate layer of a neural network with a brain-based prediction of that input. They did this and it didn’t work very well unless the target was one of the very last few layers of the neural network. They conclude from this that the hierarchical correspondence between ANNs and brains must have been incorrectly evaluated in the past, since all brain areas – including primary visual cortex – best correspond to higher-level layers of ANNs.
The old switcheroo
On the surface, this paper makes the same point as the brain hierarchy score paper. I liked the brain hierarchy score paper, but I’m skeptical about this one. My criticism is pretty simple: you can’t just plug a brain into an ANN and expect it to work out of the box. The ANN is not trained to deal with the noise inherent in brain data, the linear regression doesn’t take into account that noise (error-in-variables model), nor does it attempt to mimic the spatial structure of the mismatch (assumption of iid noise in multiple linear regression). A quick sanity check would have been to learn the weights by minimizing the multinomial classification loss on ImageNet rather than using a sum-of-squares loss for the intermediate activations. I bet it would reverse the conclusions in Figure 2. That being said, the paper is based on open data, so I think it would be a feasible and enlightening project for an enterprising student to come up with a method that could solve the brain-DNN impedance mismatch.
Edit: the senior author Bradley Love let me know in the comments that they did run this sanity check. The plot thickens! I’d still like to see normative simulations to verify that, for example, if you plug a CNN with one seed into another CNN with the same architecture and training but a different random seed. I think the CKA work from Kornblith et al. (2019), as well as follow-up work with Alex Williams give a really nice template for how to validate new metrics comparing brains and neural networks.
But wait, there’s more!
After I first published this, a few people pointed out in the comments and on Twitter other papers with a skeptical outlook. Whatever doesn’t kill neuroAI will only make it stronger (cf. Doerig et al. 2022), so here they are:
Deep Problems with Neural Network Models of Human Vision: there’s a long history of human psychophysics and vision psychology. A lot of that has been brushed aside in neuroAI with a narrow focus on accounting for R2. This paper makes the point that we need more subtle metrics, which is always a winning argument in my book. In particular, we should be able to replicate subtle effects measured with parametric laboratory stimuli through classical psychophysics. I’m on board with this idea. Where I have a difference of opinions is in the proposed alternative: non-image-computable, bespoke computational models. I can’t do encoding, decoding, optimization or control with this set of models, so I don’t think they’re viable candidates for best-in-class models of visual processing. Perhaps hybrid models will come along.
No Free Lunch from Deep Learning in Neuroscience: A Case Study through Models of the Entorhinal-Hippocampal Circuit: This paper makes the point that some neuroAI papers have to bake in a lot of assumptions, fine-tuning research degrees of freedom until they get the desired result. They show this for models of grid cell formation. I like this paper a lot: it shows a particular trap in modelling that we should avoid. One way to show that our modelling choices are sensible and robust: doing sensitivity analyses (inspired by Bayesian modelling).
Auditory cortex and language
We saw a slew of papers that link auditory and language processing areas to deep neural nets. Some rapid-fire highlights:
Self-supervised audio models match auditory cortex: Vaidya et al. (2022) found that self-supervised auditory models (SSL-audio), especially HuBERT, are a match to auditory cortex during story listening. They saw a recapitulation of the auditory hierarchy, with lower levels of HuBERT matching lower auditory areas and higher layers matching more conventionally semantic areas. This is consistent with a similar study from Millet et al. (2022).
Self-supervised large language models match language processing areas: A number of studies found that transformers trained for masked language modelling or next token predictions (e.g. GPT-2) were aligned with responses of the brain to natural language. This includes Schrimpf et al. (2021), which came out late last year, as well as Caucheteux & King (2022) and Heilbron et al. (2022). Hosseini et al. (2022) found that this was true even with developmentally realistic amounts of training data: does it mean that children are little GPT-2 language prediction machines? It’s certainly an intriguing idea.
Decoding speech from brain data: Last year, Eddie Chang and his group made a big splash demonstrating that a locked-in person could communicate by decoding signals in their intact speech motor cortex (disclosure: my old group at Facebook funded this work). They used an invasive ECoG array to record from this patient’s brain. This year, multiple groups have attempted to reconstruct perceived or attempted speech non- or minimally-invasively, which is hugely ambitious: Tang et al. (2022) from fMRI recordings, Defossez et al. (2022) from MEG, and Kohler et al. (2022) from depth electrodes. Silent speech is surprisingly amenable to decoding because it has wide brain coverage (it seems the whole brain does semantic encoding), it’s relatively low bandwidth, and doesn’t have extremely high accuracy requirements to be useful. Perhaps in a not-too-distant future I can silently speak to my device
Digital twins, BCIs, stimulation
I mentioned earlier that differentiable brain models (i.e. digital twins) could be used to predict the effect of stimulation on the brain and tune that stimulation through gradient descent (top right corner of the neuroAI diagram). I saw two papers that really ran with this idea this year:
Differentiable models of retinal implants: Granley et al. (2022) showed that one could tune the stimulation pattern of a retinal implant to best stimulate the retina using a differentiable model of the effect of the stimulation. They showed in simulation this could give clearer phosphenes that are closer to the ideal of a dot-matrix display. This was one of my favorite papers of the year: obviously important idea, subtle modelling, great proof-of-concept.
Differentiable, personalized models of cochlear implants: Drakopoulos and Verhulst (2022) take this idea one conceptual step forward by demonstrating that one could tune a cochlear implant to one’s particular dysfunction, using a differentiable model of the cochlea.
This idea of optimizing stimuli using a proxy model also has applications for fundamental research: Gu et al. (2022) demonstrate making maximally activating stimuli (i.e. localizers) for particular brain areas measured through fMRI, while Cobos et al. (2022) build image metamers of the visual system of mice that transfer across individuals.
Non-convolutional and quasi-convolutional architectures for vision
I wrote last year about an apparent mismatch between convolutional neural nets (CNNs) and the visual cortex: brains don’t have a mechanism for weight sharing that we know of. This year, a number of papers came out investigating this and proposing solutions:
Convolutional structure emerges in fully connected networks: Ingrosso and Goldt (2022) show that localized filters with convolutional structure can emerge in fully connected networks trained to solve a task on inputs with non-Gaussian statistics. They then develop a theory for when this these localized filters may emerge. I’d like to see this type of work expanded beyond toy problems and with known cortical architectural biases baked in, namely sparse, locally connections.
IT domains emerge in topographic layers: Blauch, Berhmann & Plaut (2022) propose a non-convolutional model for the emergence of domains in IT. Basically, they use a standard convolutional base for V1-V2-V4, and allow the V4-IT connection to be fully connected. IT is instead modelled as a topographic layer with local connections instead of a convolutional layer. They show this arrangement can learn domains specialized to domains, faces or scenes.
Visual streams emerge in topographic networks: Finzi et al. (2022) take a similar approach, this time learning a self-supervised representation of images with a topographic layer on top of a ResNet-18 backbone. They show that this can account for the segregation of different visual streams in visual cortex. This is broadly consistent with our 2021 NeurIPS paper.
Topographic maps through kernel averaging: Bashivan et al. (2022) build a new type of topographic layer by averaging over the kernel dimension. They find filters which slowly vary spatially, not unlike the pinwheels and orientation domains of primary visual cortex.
Pinwheels? From Bashivan et al. (2022)
Food for thought
Finally, here are some papers that either made me smile or made my blood boil, but definitely didn’t leave me indifferent:
The MouseNet architecture
MouseNet: fantastic idea: create a CNN that mirrors the architecture of the mouse visual cortex. Lots of technical detail in there about circuits and connection patterns. Did they find this network to be a better match to mouse visual cortex? No! In fact, they found that a plain VGG was a better match to the areas they were looking at. Big caveat, however: they used ImageNet as the task to train the network. Points towards the need to train neural nets with more ecologically relevant tasks or with direct supervision for intermediate areas.
Astrocytes and neurons implement transformer architecture: this is a sort of left field proposal for how neurons and astrocytes might implement self-attention. The idea is that an astrocyte can implement a shared multiplier across multiple synapses. The astrocyte here is viewed as a computational element that uses calcium to maintain state. There’s definitely a few fishy things in this paper, like the softmax implemented through random cosine projections (how would a neuron implement a cosine function?). More generally, I can’t make up my mind about whether this is truly biologically plausible or not, because I don’t know anything about astrocytes, but I think it’s very ambitious and interesting.
Sentient neurons in a dish learn how to play Pong: boy I didn’t like this paper. Published in a prestige journal that shall not be named, this paper demonstrated a hybrid computational architecture where a set of biological neurons in a dish solved a Pong. This is a very cool idea, a follow-up to Steve Potter’s work on animats. There’s a genuinely new method proposed to interface biological neurons and virtual environments so the neurons actually learn something based on the free energy principle. However: this paper was super hype-y (”the hybrid NN achieved sentence”, come on), the results were barely above chance, and the core idea, that the HNN would arrange itself so as to make the environment more predictable, untested. Someone (not me!) took the Twitter threads and put it on PubPeer, if you want to see other people’s takes.
Brain Dall-E: two of the biggest stories this year in AI were diffusion models (Dall-E 2, Stable Diffusion, MidJourney and the rest) and ChatGPT. It was only a matter of time before someone would take one of these and mix them with brain data. This paper uses a Latent Diffusion Model (Rombach et al. 2022), an open source model distinct from, but in the same model family as the closed source Dall-E. They use this to decode brain data while people were viewing images in the scanner. Basically, a text-conditional diffusion model normally works by taking a sentence, encoding it into a fixed length vector, then using that conditioning information to bias a denoiser (for instance by changing the scaling of normalization layers). It’s possible to learn a model which uses brain data to create desired fixed length conditioning vector instead, hence Brain + Latent Diffusion. This works remarkably well. I would like to use this while sleeping so I can finally remember my dreams. After I published this, a reader let me know of a similar paper from Takagi and Nishimoto which uses Stable Diffusion checkpoints; the difference between the LDM used in the first paper and this one, AFAICT, is that the first LDM is trained on ImageNet, whereas this one is trained on LAION.
Predictive coding: there were a couple of papers on predictive coding, but none more elegant than this one from Tim Kietzmann’s group which came out late last year. They show that predictive coding can emerge by minimizing the total energy needs of a network. They use a new form of regularization that takes into account the cost of EPSPs. I thought this was very elegant and I’d love to see it expanded to more realistic scenarios involving naturalistic movies.
Letter perception as neuronal recycling: if there’s one big mystery in visual perception, it’s how do we read? Clearly we haven’t had the time to evolve brain processes specific to reading, because writing is such a recent invention, so we must in some sense recycle existing hardware to do so. This excellent paper from Talia Konkle’s group tests this hypothesis. They train different neural nets on either generic object recognition (i.e. ImageNet) or specifically letter recognition. They find that generic networks are a better match than specifically letter-trained networks. Then they ask whether they can get an even better fit by fine-tuning an object-recognition network for letter perception, and they find this is indeed the case. This is a great strength of neuroAI approaches: we don’t have time machines to study human evolution, we can’t do causal experiments on natural selection, but we can study the evolution of brain processes in silico.
Images reconstructed with Latent Diffusion from brain data
Onwards to 2023
So what to make of all this? In terms of insider baseball: If there’s one lesson I saw this year repeated over and over again, it’s that architectural choices alone have an overall relatively small influence on how well an ANN is matched to a brain. Brain-Score is getting pretty hard to beat. I think we will see a paper (or papers) that 0) use an architecture that is brain-like, 1) carefully select datasets to match ecologically relevant and 2) use auxiliary brain data to enforce that one brain area = one DNN layer to make a big dent in Brain-Score.
I’m seeing tantalizing hints that people are really starting to think about applications. I also see that neuroAI is becoming bigger than any particular lab, with multi-lab research programmes being put forward. We may see a BRAIN initiative-level effort across multiple labs come to fore. It’s a great time to jump into the fray!
It’s tough being an academic these days. Between the tiny number of permanent positions and the systemic exploitation of cheap labour, people are looking for a way out. Students, postdocs, and even PIs are asking themselves: should I join a startup? Should I make my own startup? It sounds seductive—change the world! Make money! Keep […]
Show full content
It’s tough being an academic these days. Between the tiny number of permanent positions and the systemic exploitation of cheap labour, people are looking for a way out. Students, postdocs, and even PIs are asking themselves: should I join a startup? Should I make my own startup? It sounds seductive—change the world! Make money! Keep doing research, but in a less toxic environment! Who wouldn’t want to jump in at the chance of working on something innovative and useful?
People often ask me about my thoughts on industry: after my PhD and postdoc, I’ve worked in industrial R&D at big tech companies [Meta, Google], I’ve founded a medtech startup, I run a consulting business [xcorr.dev], I advise and have equity in early-stage startups [Artificio], I was the founding CTO for an education non-profit [Neuromatch]. Having experienced all these things, I want to take a step back to write this long note, so that you may learn from my tribulations.
There’s a lot of misleading information out there in the popular media: startups are depicted as exciting places where innovative research takes place outside the constraints of academia. I want to reset expectations and in particular explain how money flows in a startup. Startups lose money, at least for a while, and if they don’t figure out how to make money soon enough, they will go broke. Most PhD students & many postdocs are isolated from the $ and ¢ aspects of running a lab, and academics sometimes express disdain at commercialism. However, for-profit startups, are, well, for-profit, and they need cash to keep going. If startup life appeals to you, you need to understand the money flow.
This is particularly important to understand the role of research in a startup. You may ask yourself: will I be able to do research if I join startup X? The answer is: it depends. I end with a checklist of things you might want to consider before you decide to dedicate multiple years of your life to a startup.
Subscribe to xcorr and be the first to know when there’s a new post
A startup is a newly existing business that seeks to scale a given business model. For example, let’s say you have an idea for a new widget and you do market research to show people will buy said widgets. You may then convince investors to give you money to create a widget factory and start selling widgets. Congratulations! You have a startup.
Watercolor of a widget factory. Generated by DALL-E2.
Startups often start by losing money on salaries, capital expenditures (i.e. stuff like servers, factories, and so forth), lawyer fees, etc. This is known as the valley of death. Many startups are default-dead: to continue operating, they have to replenish their coffers with a constant influx of capital from external sources. These investments are conventionally structured in rounds, occurring every 12-18 months, starting with a pre-seed round, then going on to a seed round. Starting from series A onwards, startups often display revenue, though not enough to fund their massive expansion. Scaling is often necessary to make a startup ultimately profitable–a widget factory might cost the same regardless of whether it produces 1,000 widgets or a million. Therefore, by selling lots and lots of widgets, you spread the fixed cost of the widget factory over many widgets.
remain private for a long time, for various strategic reasons
To raise money, they sell off chunks of themselves to investors. Doing so dilutes the ownership of the original owners. Instead of owning 100% of my business, if I sell off 20% for a chunk of change (say, 200k$), then my business is valued 1M$ and I own 80% of it. Dilutive funding is often offered by venture capitalists (VC). Non-dilutive funding, which doesn’t cost any ownership, is done by government orgs (e.g. the NSF, various agencies in Canada, etc.).
We most often hear of the startups that survive, not the ones that get lost along the way. From Wikipedia.
Importantly, most startups don’t succeed – the general rule of thumb is that out 10 startups that receive seed investment, 7 will close down, 2 will break even, and only 1 will see a significant return on investment to the investor.
What determines a startup’s value?
Startups survive by selling chunks of themselves to investors. How do you value a chunk of a startup?
If I try to buy the poutine stand down the street to become a restaurateur, there’s a few well-established ways to price that: I might look at the price of poutines they sell a year (gross income), or their net income, that is, poutine sales minus salaries minus other expenditures. Then I could apply a multiplier to get a ballpark valuation number. If the industry-wide revenue multiple for an independent restaurant is .3, and they’re selling 300k$ worth of poutine a year, a smooth 45k$ might get me into the restaurant business.
But multipliers don’t work if a startup is making no money! You might be asking: how can you ask people for money to own something that makes no money? How is it that the poutine stand down the street—bringing real value to the people in the form of delicious carbs—can be bought for less than the price of a Tesla, while a guy with a half-baked idea and no sales can value his startup at 3M$?
One method to value a startup is the so-called VC method. It states that the value of a startup is some discount factor for risk times the expected value of the business at some time in the future (say, 5 years). Let’s look at each of these in turn.
Discount factor for risk: If I’m willing to take no risk at all, I can put money in high-savings rate account and make about 3% per year in these days of high interest rates. Free money! If I can afford a little risk, I can put money in an index fund and make an average of 8% a year, with some variance. Higher risk tolerance is rewarded with higher average returns. When an investor puts money in a startup, with a >50% risk of losing all of it, they will require much higher multipliers for it to be worth it—perhaps on the order of 50% year-on-year. The total discount factor is equal to (1+d)^-X, where X is the number of years at maturity (say, X = 5)
Future value of the business: that can be estimated using the same type of calculation used to estimate the value of a poutine stand. Tech businesses have much higher multipliers (aka price-to-earnings) than brick-and-mortar independent restaurants: the industry average is 20. Thus, if your business could be making 10M$ in annual sales in 5 years, according to this calculation, it would be worth 200M$.
I like to think about the discount factor in a slightly different way: I like to think about it in terms of p(things will work in X years). Hence, the value is:
Note: this is a highly simplified model, and doesn’t take into account that you may have several sales scenarios, each weighted by their likelihood, etc. The lessons are broadly the same regardless of your exact valuation model.
Generally, you’ll want each successive funding round to be at a valuation of 2–3X higher than the last one, otherwise you will quickly dilute yourself out of a business. Investors also hate so-called down rounds, where the startup is valued less than previously, because that means that they overpaid for their investment. Sometimes a miracle happens and you discover more yearly sales—equivalently, a larger total addressable market—prior to starting to sell things, but that is a very rare occurrence indeed. That means that the only thing you can really work on to increase the value of your startup and survive to live another day is to make your product or business “more certain to work” by a factor of 2–3X every 18–24 months. In other words: you should work to decrease risk.
What kinds of risks do startups face?
So far we’ve established that decreasing risk is a key activity in a startup. What kinds of risk are we talking about anyway?
Market risks
Will people want to buy the thing you’re selling at a price that will ultimately make you a profit? Generally, investors don’t like to invest in widget businesses where it’s not very clear whether anybody will buy the widget. You can address market risk by:
Showing that you already have sales. Definitely the best. If you have 100 clients willing to buy a thing at a decent price and you’re growing year-on-year by 10X, then it won’t be too hard for an investor to believe you have a viable business.
Showing that people are using the thing and they like it. Maybe you haven’t figured out how to sell a widget, but you might be able to show that people use a free, minimalistic version of a thing (MVP) and love it. That can mean having alpha users or demo installs.
Showing that people will actually buy the thing once it’s made. That can mean getting a letter-of-intent from some big client saying they’ll buy the widget once it’s made.
In all these cases, you need to build things. It’s hard to get a letter of intent for something you can’t hold in your hand in a prototypical manner. To decrease market risk, you have to build a prototype and then a minimum viable product (MVP).
Technological risk
Perhaps the product you are building is based on a principle that has been demonstrated only on a small scale in a lab, or maybe even just on paper. The usual scale to establish the technological maturity of an idea is TRL (technological readiness level). It comes from NASA’s classification system to judge prospective technologies for use in space, but it’s been broadly adopted in early stage R&D. The higher the TRL, the lower the risk.
The TRL scale. From Wikipedia.
Early-stage, pre-revenue startups will need very deep pockets to work on ideas at TRL1—TRL2. That really only makes sense if the market is so obviously large (”blockbuster”) that the risk and time expense can make sense. Maybe somebody will fund a TRL1 nuclear fusion startup, because clean energy at scale is such a large market. Otherwise, TRL1—2 is usually done at research institutions with public money, with the outcome, knowledge, becoming a public good. Early-stage deep-tech startups typically focus on TRL3—6 instead.
Building an MVP can really decrease technological risk because it shows the thing actually exists. Generally it’s a bad idea to combine something that has high technological risk with high market risk. A better combo might be: show that something has a big, obvious addressable market if only this very difficult technological/science problem can be overcome by the world’s foremost expert in this area, who happens to be you. Or maybe you have already built a super special, super cool technology and you think, I think maybe somebody would buy this, but you need to do market research.
Aside & rant: it can be a pointless exercise to try to forecast how basic research ideas done at research institutions will translate into tangible, marketable goods, because there are so many steps between the research and its commercialization. That means that judging basic research based on commercialization potential (as funding agencies increasingly do) can be an exercise in futility. Fundamental research is better judged by criteria such as: is this interesting?
Regulatory and IP risk
If you do anything medical, much of the uncertainty will be regulatory. Is your medical device class II or class III? The burden of proof and the cost of trials in each case will be different. Do you have access to the necessary intellectual property (IP)? Can you protect yourself long enough to go to market without somebody doing the thing you were planning to do eating your lunch?
Execution risk
Sometimes all the stars are aligned: the market is there, the innovation works, you’re lawyered up, but you still fail to deliver a good product. Maybe you don’t hire the right people; people who are too research-y or insist on putting things which don’t work yet into production. You get into constant fights about priorities and every planning meeting is a mutiny. You get a lawsuit at an inopportune time. Many factors can converge to cause a good idea with a demonstrated market to fail if not executed properly.
If it’s too good to be true, it probably is
We hear a lot about “research-labs”-like startups that don’t have any visible products and that seem to be doing great. This sets a really bad precedent for students and postdocs, because it gives the impression that there’s some free money floating out there. If it looks too good to be true, it probably is.
We don’t hear so much about startups that fail, because dead men tell no tales: it’s similar to the file drawer problem in academia. We often hear of successful AI startups, but we seldom hear about the cautionary tale of ElementAI. ElementAI was a startup originated from Yoshua Bengio, Turing-award winning AI researcher. They used up a lot of money to bring AI-based solutions to businesses, but grew too fast without a laser-tight focus on a single, scaleable product that could be sold. The fact that it was started by somebody famous is completely irrelevant: it may have hurt them, because they were able to raise money too fast to get their feet solid and prove a product-market fit. At some point, the startup has to sell a product for more than the cost of the inputs to the product. In this case, they were sold for scraps, employee stock became worthless, and many wasted years of their lives in the debacle.
Understand the hypothesis underlying what a startup is doing and the addressable market. If it makes sense, jump in. If you cannot make the numbers add up and the people in the startup can’t give you a good story, you need to get out of there. I’ve heard students and postdocs tell me that they’re skeptical of startup X but want to jump in anyway, because, hey, what do they know? If you’re a subject domain expert, you often know more about the technology being developed and its risk profile than 90% of people involved. You should combine your pre-existing technical knowledge with knowledge about the market, which can be acquired pretty fast. If you do this, you can invest your time and energy on what you sincerely believe is a winner rather than engage in magical thinking.
What can you do as a PhD to help your startup succeed?
Now we get to the crux of my argument: your job, as a PhD in an early-stage startup, is to reduce risk. You have several ways of doing this:
Helping to build an MVP or prototype
Talking to people who will use the product to make it better (needs-driven design)
Do research showing the efficacy of the product for its intended purpose
On this last point, you might think that industrial research is less reliable (sloppier) than academic research: if you have financial incentives to show something is true, wouldn’t you want to show that it’s true? In fact, you have every incentive for the research to be correct – not as in “finding the answer management wants to be true”, but as in “finding the right answer”. Finding the right answer decreases the inherent risk in the venture, which means you’re likely to survive another day: if the hypothesis underlying the startup is wrong, you should pivot now rather than later when you’ve burned through your cashflow.
Because a pre-revenue startup has so much uncertainty and doesn’t have the cash flow to cushion the blows, you should be working on little R research. I would also argue that you should deeply understand the market at this point, and avoid doing R&D fenced off from the business: your research needs to translate pretty quick into things people can buy, ideally within one round of financing (18–24 months).
Papers, what are they good for?
Publishing papers in this environment is a mixed bag. Positive incentives to publish include:
researchers want papers, and keeping high-value researchers happy is often a primary incentive for the startup, as employee churn is expensive.
the research itself can bring value to the company by decreasing uncertainty and reorienting activities depending on findings.
the seal of approval from peer-reviewed research can reassure investors the work is innovative in ways and uncertainty has been decreased. Be careful though, ultimately, it’s not hard for an investor to find an expert senior postdoc and give them several hundreds of dollars to thoroughly review your claims. That due diligence is any many ways far more extensive than peer review ever could be.
recruiting high-value researchers can become easier if they have visibility into the research the company does. That’s also one reason many startups have company blogs.
research may be required to pass regulatory hurdles, for example with medical devices.
However, there are many downsides:
papers don’t pay the bills: clients do. Papers can be a huge distraction and encourage scientists to isolate themselves from the business and not contribute to product early on.
research creates intellectual property, which brings value to the company; but publishing that IP prevents patenting. It allows competitors to copy the innovation.
In conclusion: an early-stage startup that doesn’t seem to have many clients but has a lot of papers coming out is usually a warning sign. As a large shareholder in a startup, or an early employee in charge of R&D, you have to make sure you do the right kinds of research, and create a culture that incentivizes the continued existence of said startup.
Later-stage startups and companies with mature cashflows are in a better position to engage in big R research. Even then, however, the goal is rarely the production of knowledge, understood as a public good–though on occasion good PR could be valuable for secondary purposes like recruiting. The purpose instead is to perform research in such a way that ultimately feed into new products down the line that translate into sizeable revenue. Companies with long runways–for instance Meta with its Metaverse strategy–can afford to use a very long horizon to evaluate research. However, that’s only viable because Meta has tens of billions of dollars in cash reserves AND a positive cashflow. Most startups and companies are not in that case.
How not to succeed as a startup
You can learn from my mistakes. I was briefly the CTO of a medtech startup that used AI to treat a given neurological condition. As a PhD, it was my job to decrease the uncertainty surrounding the startup rapidly for us to raise money. I did all the activities expected of that role: created valuable IP that could be protected, did studies demonstrating our proof-of-concept, participated in product development, etc.
However, in doing so, I isolated myself from the marketing and fundraising aspects of the startups. What I should have focused more on was validating the market. Digging deeper and deeper, we realized that the market we had hoped for was smaller than anticipated as we uncovered a number of contraindications for the treatment we were building. To address a larger market through a different indication would have increased the uncertainty about efficacy to an unacceptable level. Ultimately, without a sizeable market, the opportunity was dead in the water. Although we could have continued to grind things out for a number of years, because the market opportunity was so small, it seemed a better trade-off to shut things down. A couple of papers and a patent are simply pointless for a startup unless you can turn that R&D into something that can ultimately be sold.
What about my favourite R&D startup?
I’ve shared my thoughts on the role of research in a for-profit startup with many over the years. I inevitably get pushback, often in the form of “what about startup X? It looks like they’re just doing R&D and doing fine?” Let’s deconstruct some counterexamples that people frequently cite.
Disclaimer: although I worked at Facebook and Google, I don’t have any special insight about these startups that were ultimately acquired by Facebook and Google, I’m just reading the publicly available information.
Deepmind
Deepmind, the artificial-general-intelligence company, is an interesting case. It was acquired in 2014 by Google at a reported 600M$. At the time, they had about 75 employees. For a company that had no visible product, its exit paths were rather narrow: an early investor confided that had they not been acquired by Google, they would have probably failed. About 8M$ per employee is pretty expensive for an acquihire, but if you look at contemporary discussions on HackerNews, commenters speculate that the core value of the company was indeed its people: recruit AI researchers cheaply in Europe, repackage as a functioning R&D team that could be sold to a tech giant. That would make Deepmind prior to its acquisition, in essence, a recruiter specialized in AI talent.
Compared to other acquisitions around the same time, it was cheap: WhatsApp was acquired the same year for 19B$, with 55 employees at the time, that would come with about 40X the price per employee. I don’t want to imply that Deepmind’s sale to Google was a firesale. However, in my mind, its sale was not the roaring success that some claim it was. Deepmind’s path was quite narrow, and I don’t think it’s replicable today.
Ctrl-Labs
Ctrl-Labs built neural interfaces for AR/VR. They assembled an R&D team to create a neural interface that detect the smallest of neural impulse close to the wrist and translate that into commands to a computer. At the time of acquisition, they were–to the best of my ability to search through the internet wayback machine–a pretty small team, perhaps on the order of a dozen people. They were acquired by Facebook in 2019 for a reported 500M$-1B$. Importantly, however, they had real protected intellectual property – a portfolio of their own patents plus IP acquired from North (previously named Myo). Their R&D had been translated to a real prototype, which they presented at conferences. They had a real use case, which is interfacing for AR/VR. In other words, they had removed much of the technical uncertainty around the technology, in precisely the way I’m advocating here. In my view, it’s a good example of the role of R&D is in an early stage startup.
FROs & OpenAI
So far our discussion has been focused on for-profit startups, which are by the far the most common legal structure for what we call “startups”, and which have the constraints that I discussed previously. People are actively looking for different legal structures that allow them to do big-R research outside the confines of academia.
Focused Research Organizations (FROs) are a new kind of model that’s emerging that’s potentially very exciting. They focus on building one highly specific public good (e.g. a tool, a dataset) that they can deliver within a time frame of ~5 years. Importantly, most FROs are non-profit organizations, and funders understand that the effort is oriented towards this specific outcome without the expectation of profit. They’re very much an experiment at this point, but it will be interesting to see what comes out them.
Another example of an alternative model is OpenAI. OpenAI was started as a non-profit with a billion dollars in funding, and received a further injection of 1B$ from Microsoft following its re-incorporation as a limited-profit company. Much of their research is done out in the open (although GPT-3 and DALL-E2 checkpoints have not been released). Notice, however, that despite its legal status, OpenAI does have revenue through GPT-3. Indeed, non-profits or limited-profits still need to have a valid business model or constant cash influx to continue operating.
Perhaps a specific FRO’s cash flow looks like “we have exactly 30M$ from a donor to create a new tool, then we will release the tool and then shut down”. As long as everybody knows what they’re getting into, I applaud these efforts at finding alternatives. However, for startups incorporated as for-profits, which is the vast majority of startups, doing blue-sky research un-anchored to a product is a sure-fire recipe to run out of cash.
A checklist for joining a startup
If you’re thinking of joining a startup, ask yourselves these questions:
Is the thing the startup working on necessary? Does anybody actually care?
Is this a cool, innovative product?
Do I think it’s doable, technically?
Do I think that people are going to buy the thing the startup sells?
Will I have the agency to positively affect the course of the startup?
How much can I grow as a scientist/programmer/data scientist within the group?
What do I think about the people that run this startup? Do they know what they’re doing? If it’s their first time, are they getting the mentorship they need?
Do they have money? How much? Where are they in their fundraising?
Will I be able to do the kinds of research I want to do? Little R research? Big R research? Product dev?
Are there any other red flags? Have several people asked “I don’t understand, how can they stay in business?” Is one of the founders a creep or a cult leader or has anger-management issues?
Conclusion
A startup’s goal is to make money back to its investors by using a scalable business model. In an early stage startup, it is very rare that your job will be to “do research and publish papers”. Publishing papers can, in some circumstances, increase the value of the startup and keep it afloat, but much of the time it can be a distraction unless it’s aligned to business goals. Technical staff should understand these business goals and align to them. R&D cannot be separated from the business until the startup is post-revenue and on more solid footing.
It takes dedication to keep up with the scientific literature. 2344 papers were accepted at last year’s NeurIPS conference. Who has time to read all of that? It’s difficult to see the forest from the trees as ever more research is published. What if every single one of us had infinite access to world experts […]
Show full content
Image generated by VQGAN + CLIP from “Seeing the forest from the trees, brilliant light Unreal Engine”, from https://creator.nightcafe.studio/
It takes dedication to keep up with the scientific literature. 2344 papers were accepted at last year’s NeurIPS conference. Who has time to read all of that? It’s difficult to see the forest from the trees as ever more research is published.
What if every single one of us had infinite access to world experts in any field of field of study?They could recommend papers to read, based on:
relevance
accuracy
importance & novelty
In this essay, I’ll try to convince you that we’re not far off from that world. Advances in natural language processing (NLP) and large language models (LLMs) mean that it’s becoming possible for machines to understand natural language, including the prose in scientific papers. We have only started to see these advances deployed, but my bet is that it will change how we consume and create science over the next few years. Let me recap the progress in NLP over the last 5 years and paint a vision of scientific natural language processing in the near future.
Subscribe to xcorr and be the first to know when there’s a new post
I last looked seriously at NLP 5 years ago, and back then, it didn’t really work except for narrow use cases. Seeing recent results from PaLM and DALL-E 2 motivated me to pick a book on NLP and apply it to a semantic search project. In that 5 years, the field has completely changed. Let’s recap.
Cover of the latest O’Reilly book on NLP and Hugging Face
TL;DR: NLP now works. It’s also heavily commoditized. Anybody with a technical skillset–and increasingly non-technical people–can train a model to solve any number of NLP tasks. Some of the biggest changes are that:
We have Transformers, a neural net architecture that scales well and can deal with large temporal contexts. The models vary from encoder-only, like the famous BERT model, decoder-only, like GPT-3, and encoder-decoder models like ones used for translation.
We have powerful pre-training tasks, especially masked language modelling, to pre-train large scale models that are useful for a number of tasks.
Models have been scaled tremendously and now routinely number in the 100 billion parameter range, and increasingly are trained on a significant subset of all the text on the internet.
The fundamental paradigm for adapting to specific tasks has changed. We now routinely use pretrained models either as is, with in-context learning, or we refine pre-trained models. This is far cheaper, less burdensome, and scales better than training a model from scratch.
With these advancements in NLP models and training paradigms, the tooling has become far better:
HuggingFace is essentially building Github for transformer models; anybody can download one of over 10,000 pretrained models, and either use them as is or fine-tune them on their own data.
Access to datasets is far easier as well, again through HuggingFace.
Access to hardware is commoditized. Anybody can rent a supercomputer with 320 GB of GPU RAM for 30$ an hour.
If one doesn’t want to bother with training and deploying models by themselves, they can use APIs which abstract away all the messiness, including OpenAI’s GPT or cohere.
What do I mean when I say that the models now work? There are a number of tasks which we now know how to handle, provided that we have a sufficient amount of data:
classification of single sentence, for instance to evaluate emotional tone or valence
named entity recognition, e.g. extracting the names of compounds, companies and people from text
extractive question answering, figuring out which passage in a document is relevant to answering a question. For example, extracting the name of the first prime minister of a country from its Wikipedia article
document embedding and semantic search, e.g. finding other relevant documents from a seed document
summarization, i.e. generating a short summary of a paragraph or document
translation, e.g. from English to French or Python to C++
autocomplete, including autocompleting code
text generation, including steered text generation, e.g. generate convincing ad copy for a product from a list of features
That’s a long laundry list, but what is the relevance to how we do science?
Scientific NLP can ease information overload
Let’s go back to our initial thought experiment, having access to a world expert advisor to help us navigate the scientific literature. Because NLP now works and it’s highly commoditized, we’re starting to see scientific natural language processing being scaled up into real products.
Determining relevance
If you have a list of papers you have written, or a list of papers which you’ve identified as relevant to something you’re interested in, it’s becoming straightforward to find other relevant papers in that vein. Relevance can be determined from a number of signals, in particular semantic similarity as well as citation graph.
How SPECTER embeds documents from different disciplines
A prototypical work in this space is SPECTER, a system from the Allen AI institute to find semantically related papers. A BERT-like model is pretrained on a masked language modeling task on unlabeled scientific texts (SciBERT): words are blanked out at random and the transformer is trained to find the missing word. This gives rise to one vector per token, in this case a 768-dimensional one. The tokens are then summarized, in this case extracting the vector of the special first token, although you could also take the average of tokens over a document.
The pre-trained network is then fine-tuned on a set of papers with a triplet loss: it moves similar papers together and dissimilar papers away from each other. Citations are used as a proxy signal for similarity: make papers which cite each other close to each other, and papers that don’t cite each other farther away. The result is a model that can generate fixed-length dense semantic vectors from each document. This can be used to retrieve similar documents, by finding other documents which are close together in this high-dimensional vector space.
Don’t Google Scholar, Mendeley, etc. already do recommendations? Yes, of course. However, consider these other uses of NLP in this context:
You can use a question as the seed document. That means you can ask a question in natural language and obtain a ranking of relevant documents that answer this question. That means you can do a search in a field that’s completely unknown to you and obtain good starting documents.
The search engine can summarize a document for you. At a glance, you can figure out what a recommended document is about and select it for further reading. An example of such a system is AllenAI’s TLDR, an extreme summarization method. You can see this in action on elicit.org and on Semantic Scholar.
You can get the language model to tell you why it chose a given document. For instance, you can ask GPT-3 to tell you why document A is relevant to document B, and oftentimes the answers are quite reasonable. When I asked GPT-3 to explain why someone interested in a famous paper comparing deep nets to brains would be interested in another paper on it, it answered quite reasonably that “they are both about the ventral visual stream and object recognition”. It’s not very precise but it’s tantalizingly close to “getting it”, and the gap will most likely continue to decrease in the future.
GPT-3 can relate two documents or passages together
Recommendation systems can also be used to find new collaborators, matching students and postdocs to PIs, and students about to graduate to industry jobs. That means less friction in science and more opportunities for all.
Determining accuracy
It’s easy to grasp the value of semantic search: “what if we had Google for science, but better?”. However, this only scratches the surface of NLP’s potential uses for science. Consider determining the accuracy of a paper. There are many features of documents relating to trust and accuracy that are tedious for a human reader or reviewer to go through. These include:
Is prior literature represented fairly and citations correct? Especially in the intro, the prior art can presented in non-exhaustive, biased ways, over-citing one’s own prior work, misrepresenting consensus or lack thereof, etc. Furthermore, authors may cite bad papers as evidence: citing underpowered, flashy or controversial papers without introducing proper caveats. Assuming we had a good semantic search engine (previous section) and a pre-existing trust system (this section), this could be automated.
Are the methods complete? If I took a result from a paper, and I wanted to recreate it, is it mentioned how that result was generated? Extractive Q&A can help find relevant passages in the methods.
Are the methods appropriate? There are already crawlers for papers that alert authors to statistical issues. This is a logical continuation of such efforts. Is the study prospective? Retrospective? Was a power analysis run? Are there enough subjects? Are the methods similar to those of other papers in that field? Are those papers trustworthy?
Are the claims substantiated? Entailment is an NLP task where one compares two statements and verifies that one follows another. One can build a whole statement tree and see whether the argument holds together. The more holes in the argument, the less trustworthy the papers.
Is the abstract representative of what’s in the paper? So many times you see papers where the abstract sounds awesome but the paper doesn’t substantiate its claims. With the claim tree resolved we can verify the representativeness of the abstract.
Is the paper plagiarized?
Is the paper written by authors who tend to write bad papers?
Now imagine having a trustworthiness score for each paper, with attribution for which passages are dubious and which are trustworthy. This would be immensely helpful for reviewers and readers, and perhaps a good learning tool for students so they can understand how to strengthen their papers.
Importance and novelty
We have tools for determining importance and novelty:
Prestige journals.
Affiliation.
Social media.
I think everyone is starting to understand why prestige journals are a bad idea: prestige journals take money from taxpayers and put them into the pocket of shareholders that neither do research nor do much to advance science. In fact, they actively create barriers to science by placing it behind paywall. They have odious profit margins and extract free labor out of an exploited population (scientists). They delay publication by months, the judgement is extremely noisy, and those papers which do end up being published are more likely to be wrong than those in a less prestigious journal.
Affiliation is much the same way. Sure, if you see a paper from a Harvard researcher or a Nobel prize winner you’re more likely to read it. But affiliation, like prestige publication, suffers from the safe defects: it distorts the marketplace of ideas and leaves very good science from non-Ivy-League schools unread.
Social media is a bottom-up way of measuring importance and novelty of new work, and indeed, altmetric has built a business around measuring this buzz. There is signal there. However, having a very critical part of science diffusion in the hands of private corporations whose core mission is not science is risky in the long-term. Science Twitter could get disrupted by global algorithmic changes on the platform. For example, a higher focus on comments as a metric – algorithmic engagement – might surface less papers and more recurring debates in your feed; less signal, more noise. The reality is that science Twitter is just a very small, poorly monetizable, insignificant part of Twitter as a whole, and its continued healthy existence could very much change.
None of the existing methods fully satisfy the core needs of the scientist as consumer: to see new developments in areas close to their area of research in realtime. Neither do they satisfy the needs of the scientist as producer: to have a fair chance of their science being read and fairly evaluated for what it is.
How can NLP help? There is some work attempting to predict future citations based off of different factors, a potential measure of importance and novelty. Some factors like the shape of the citation graph (e.g. citing papers from two different, disconnected areas of science) can be predictive of novelty. Perhaps it will take a truly deep understanding of the scientific literature to gauge that a new paper is revolutionary, but in the next 2-5 years I can easily see more modest goals being achieved:
sift through the literature to give you papers that are relevant to you, giving you short summaries of these documents
warn you if a new paper is likely unreliable because of its methods
free up time so you can deep dive into relevant papers with potential, and decide for yourself whether it is important
A platform that facilitates goals 1 and 2 could be an ideal location to start a science-focused social medium. The captured data could be used to build models to predict importance over the long term.
A part of the roadmap for the elicit.org search engine
Medium-term, I expect that the switch will flip, and that scientific research will be consumed more by computers than humans. Meta-analyses and review papers are already based on distilling existing data, and we can foresee that computers will be far more agile in creating texts that require distilling hundreds, if not thousands of source documents. There will eventually come a time where LLMs or their successors will be able to generate new scientific discoveries through distillation. It might not happen just right now, but certainly science will look far different in 10 years for this reason.
I also expect that large language models, embodied as assistants, will be excellent tutors. If a computer understood your learning path, they could become the perfect tutor, feeding you just the thing you need to learn at the time you need it. Both humans and machines will be far more capable as a consequence. It’s up to us to make sure this future is also fair and accessible to all.
The social media copy for this blog post was generated by GPT-3, and lightly edited by me. The prompt was “Write a compelling short teaser of this blog post for social media (288 characters max)”
This blog post was adapted into an article for a16z’s Future. It’s been 60 years since Hubel and Wiesel first started unlocking the mysteries of the visual system. Proceeding one neuron at a time, they discovered the fundamental building blocks of vision, the simple and complex cells. Yet for a long time, neurons in high-level […]
It’s been 60 years since Hubel and Wiesel first started unlocking the mysteries of the visual system. Proceeding one neuron at a time, they discovered the fundamental building blocks of vision, the simple and complex cells. Yet for a long time, neurons in high-level visual cortex were something of a mystery. What kinds of neural computations support complex, flexible behaviour?
When I defended my thesis in 2014, I confidently stated that we did not know how to build computers that could see like humans do. Yet only a few months later, Niko Kriegeskorte and Jim DiCarlo’s labs showed that deep neural networks (DNNs) trained on ImageNet represent visual information similarly to shape-selective regions of the visual brain. Follow-up research, including some of my own, increased the number of visual areas that could be explained this way, while also decreasing the numerical gap between brains and computers.
There are some remaining qualitative gaps, covered in Grace Lindsay’s review: DNNs are more susceptible to adversarial stimuli than humans; they require far more training data than brains; they’re biologically implausible in that they don’t follow Dale’s law; etc. As I argued in my previous post, people are working on all these fronts. We will be able to build a comprehensive in silico version of the visual brain over the next decade. What next?
In this essay, I’ll argue that we should build, over the next decade, an in silico version of the visual brain that will unlock a whole array of applications in human health. We’ll be able to exercise fine control over our visual experiences, and this will enable therapies delivered through the visual sense. Some therapies will be applicable for people with neurological disorders, while others will enhance healthy people. It will be unlocked by stages of technological development: first with the maturation of neuroAI, then through consumer augmented reality (AR), and finally (and much further down the line) with brain-computer interfaces (BCI) through closed-loop control. Follow me as I take you on a tour of the near future of visual neuroAI.
This is one of my longer posts, and it covers a lot of ground:
What’s neuroAI, and why do I think that neuroAI models will keep getting better, fast?
Why building models of the visual brain is more than just a satisfying intellectual exercise, it can actually help people
How does visual communication and control currently work, and how will they change when visual neuroAI will be deployed?
What are the technological trends that will unlock these neuroAI applications? How AR and BCI will allow closed-loop control of the visual system
Subscribe to xcorr and be the first to know when there’s a new post
Modelling fills different roles. David Marr classified neural computational models into 3 levels of explanation:
High level (computational theory): What is the goal of the computation?
Mid level (representation and algorithm): How can this computational theory be implemented?
Low level (hardware implementation): How can the representation and algorithm be realized physically?
For some, the high-level theory is the cornerstone of their work: that’s the approach taken by David Marr and by James Gibson. Visual neuroAI takes a different approach: it finds image-computable functions that can solve a task (e.g. ImageNet classification) and aligns them to neural representations. The basic paradigm of the field has scarcely changed since the early days of neuroAI:
train artificial neural networks in silico without the use of brain data, for example for object recognition on ImageNet
compare the intermediate activations of trained artificial neural networks to real brain recordings, e..g using linear regression, representational similarity analysis, centered kernel alignment, etc.
pick the best performing model as the current best model of the visual brain
(optional) open the black box of the best model to determine how it accomplishes its task and gain insight into the brain’s computations
NeuroAI solves the representation and algorithm question–how can we implement the computations that the brain does?–using an approach that is primarily empirical.
An incomplete modular diagram of human vision by artist Colin Prahl. Our investigations of the visual system over the last 60 years has yielded a lot of insight that can be distilled by mid-level models.
The mid-level approach is more empirically grounded than the high-level approach, which is advantageous when investigating a high-dimensional sense like vision that has gone through years of evolution. There are simply some facts about vision which will never be explained by an optimality principle: the fact that the mammalian retina is wired up the wrong way around, for example. Despite their empiricism, mid-level models can help us understand high-level goals: we can perform infinite experiments and rapidly iterate in silico.
Mid-level models differ from lower-level approaches, like those pioneered by MiCRONS and the Blue Brain project that seek to measure and model every synapse. While low-level models offer exquisite detail of small brain areas, mid-level models offer better coverage. They implicitly use a coarse-to-fine refinement strategy, with early versions of models accounting for a lot of the low-hanging variance, and later versions iteratively refining those models.
An illustration of how mid-level models refine over time, from coarse to fine. From Liquid Web.
The data flywheel
One of the biggest reasons that mid-level approaches have the wind in their sails is that they’re susceptible to the data flywheel effect. A data flywheel is a concept in industrial applications of machine learning that is quite intuitive. Let’s say that you have a data-driven product that’s driven by a machine learning model: for example, an app that identifies birds from their sound. You build an alpha version that is not very accurate. However, if it’s good enough that some people find it useful, you’ll get some users. They will upload more snippets of birdsong and might agree to label them on your behalf. You can use that data to build a better model. That will make the app more useful, which will bring in more users, who will collect more data on your behalf, thus making your model better, etc. Pretty rapidly you end up with the best bird identifier app in the world.
So how does the data flywheel work with mid-level models? Let’s take a look at the specific example of BrainScore. BrainScore is a community benchmarking website where anyone can submit models that explain responses in the ventral visual stream that are selective for shape and image categories. It’s similar to paperswithcode or kaggle. The candidate models must be image-computable: they make predictions about the response of a population of neurons to an image or sequence of images. The website hosts a number of datasets, a set of yardsticks against which models are evaluated. This benchmarking approach stack-ranks different candidate models, and potentially identifies what meta-parameters drive some models to perform better than others.
Benchmarking is not a perfect solution: focusing on the horse-race aspect of building models and yardsticks may not capture the intent of their designers. However, consider this: people will keep dreaming up new models, if only because of the continuing progress in computer vision. Bigger and better datasets will also be made available as brain recording tech doubles in capacity every 3.5 years. As long as we commit more models and more discriminating datasets, models at the top of BrainScore will keep getting better and better from a metrics standpoint. Moreover, BrainScore itself becomes more useful. The code that implements the the model at the top of the board is, in a sense, the best theory of computation for the ventral stream at that time.
BrainScore is specialized for the ventral visual stream, but people are doing this as well for other areas. Indeed, Shahab Bakhtiari and I presented two spotlights at NeurIPS 2021 [paper 1, paper 2] where we evaluated models of the dorsal stream, and found some standout candidates. There’s a lot of interesting, basic, curiosity-driven scientific observations regarding high-level goals of the dorsal stream that came out of this exercise. In addition, we got an engineering artifact: an in silico model of the dorsal visual stream.
Right now, our state-of-the-art models of the visual brain are the equivalent of black-and-white TV (or GANs in 2014). But with the data flywheel effect, we have a recipe to make them into 4k HDTV (or GANs in 2019). So what are we going to do with these models?
From natural to artificial intelligence and back
A dominant idea in the field is that by studying natural intelligence, we’ll learn principles by which we can create better artificial intelligence models. For example, the NAISys conference introductory statement goes:
In spite of tremendous recent advances in AI, natural intelligence is still far more adept at interacting with the real world in real-time, adapting to changes, and doing so under significant physical and energetic constraints. The goal of this meeting is to bring together researchers at the intersection of AI and neuroscience, and to identify insights from neuroscience that can help catalyze the development of next-generation artificial systems.
There are historical examples of AI drawing inspiration from natural intelligence, especially convolutional neural nets and reinforcement learning. We should continue to research divergent approaches, including brain-inspired ones, to solving problems in AI. For instance, Bayesian programming and neuro-symbolic AI are very different than “standard deep learning”, and researching these areas further could create a cache of novel ideas, should deep learning every hit a wall.
While the neuro→AI approach has been championed in both scientific papers and in popular media, we’ve heard much less about the opposite direction, AI→neuro. It’s been a bit under the radar for a little while, but I think it’s just as promising as the more familiar direction. AI is a vastly applicable toolkit that will help us understand the brain. Just in the past year, we’ve learned from AI approaches:
Clearly, AI is helping us understand how vision and other sensory systems work. There are constitutive reasons why this will continue to be the case, namely that AI works on far shorter timescales than typical biology experiments, it receives a tremendous amount of funding, and it’s improving rapidly. It has the potential to open new windows into the human mind and help us understand who we are as people. But, there’s another, widely overlooked point: in silico models of the brain can be used to improve human health.
Visual communication and agency
My bet is that in silico models will give people agency over what happens in their visual brains. It will make it possible to steer visual brain activity and deliver therapies through the eyes. When we have a good model of a system, we can design policies to control it effectively: this is the subject of control theory. Visual control encompasses two concepts:
visual communication, which is when an external agent (an artist, a marketer) attempts to control a visual system.
visual agency, which is when we attempt to control our own visual system.
These are two sides of the same coins, differentiated by who initiates the control. Visual communication, which seeks to leave an imprint on the visual system, is already commonplace. It’s currently done empirically, however.
Consider painting:
painting, the expression of ideas and emotions, with the creation of certain aesthetic qualities, in a two-dimensional visual language. The elements of this language—its shapes, lines, colours, tones, and textures—are used in various ways to produce sensations of volume, space, movement, and light on a flat surface. — Encyclopedia Britannica
These building blocks–shapes, lines, colours, tones, textures–are shared between painters and visual neuroscientists. Visual elements are manipulated in a purposeful way to communicate ideas and emotions. The artistic process varies, but there is almost always an element of evaluation, where the artist looks upon their work to see if it communicates the right feeling; if it doesn’t, they refine their painting. When artists suffer vision problems, because of cataracts, for example, their output changes accordingly. This indicates that the artist attempts to create a pattern of brain activity in the viewer’s brain, using their own brain as critic.
Paintings from Monet (circa 1922) before and after his cataracts were removed. Source.
The history of painting also shows that there are ways of communicating visual information that are, in some ways, more real than real, a topic explored in Margaret Livingstone’s excellent book. Edvard Munch’s The Scream presents a purer expression of anxiety and distress than a realistic portrayal could. Although, as a painter, he could use his own visual sense to guide his craft, his work generalized across individuals to convey emotions visually. Interestingly, modern neural networks can be used to generate images aligned to emotional words; while the images are not realistic, they visually distill emotions very effectively, in a manner reminiscent of the work of painters.
Edvard Munch’s Scream (public domain)A maximally activating image for CLIP for the concept of shock (OpenAI).
So vision can change how we feel, there are a set of stimuli that are better than reality itself at eliciting certain feelings, and these stimuli generalize across people. Vision can also change how we act. Web giants have known this for some time. Google famously tested which of 50 shades of blue to use for the blue links on the search engine results page. According to The Guardian, the best choice caused improvements over the baseline of 200M$ in 2009, or roughly 1% of Google’s revenue at that time. Netflix customizes the thumbnails to the viewer to optimize its user experience. Researchers at Adobe are working on tools to predict visual attention in illustrations so people can make their materials more visually effective.
Put together, I’ve demonstrated that vision is a powerful sense; that the visual input can be easily controlled via screens; that artists and businesses are using vision to bring about desired emotions and behaviours; that in some cases, artificial stimuli can be more effective at driving these reactions than natural stimuli; but that, by and large, the ability to control one’s brain states via the visual sense is a very empirical affair. My thesis is that we’ll be ever to create ever more powerful visual media through neuroAI, AR, and BCI.
Open-loop control and applications
Nothing is in the mind that was not first in the senses except the mind itself. — Leibniz
Let’s think about what it means to control the state of the visual brain through a model-based approach. I take the Leibnizian view that the content of the mind (”the state”) is modified by the senses to create a new state of mind. A little notation will clarify what I mean:
The brain is initially in a state s0. By state, I mean a vector representing a concatenation of the visual brain’s activity at the appropriate spatial scale, for example at the scale of cortical columns.
The brain experiences visual experience I(x,y,t), which consists of an image at coordinates x and y, changing over time t.
As a consequence, the brain moves from state s0 to s1. We model this as the forward operator f: f(s0, I(x,y,t)) = s1.
The goal of control is to find the right stimulus I(x,y,t) to bring the brain into some desired state s1.
The controller has access to an in silico model g, where Hg = f, and H is a nuisance linear transformation.
Optionally, the controller takes low-dimensional measurements of the final state s1 by, for example, measuring behaviour, using introspection, or tracking eye movements.
How the visual brain relates to a deep neural net. Created in BioRender.com.
Model-free vs model-based control
Given this setup, let’s consider the two visual communication paradigms I discussed in the previous section. First, people can use their own visual systems (or a committee of visual systems) as critics. That’s the painter’s approach–I call this small N empiricism. A second method is through large-scale empirical approaches, in particular those facilitated by websites with huge amounts of traffic. You can let a million people each evaluate fifty different shades of blue to decide which one to use on Google search engine results page: you don’t need to know anything about the visual system to make that decision. This is big N empiricism. In both cases, they primarily work by a trial-and-error process of changing I(x, y, t) and measuring low-dimensional outcomes M. These approaches are thus primarily evaluative and model-free.
Model-free methods can help you evaluate which images communicate most effectively, but they can’t tell you how to generate better images. For that, you need a model-based approach. We approximate the brain’s model f via a deep neural net g. Importantly, we can backpropagate through g, which means that we can generate better stimuli to accomplish a goal, starting from a random image, using gradient descent. If our in silico model is a good model of the brain, we’ve just generated a stimulus which can bring the visual brain to a desired state.
I call this open-loop design and control. This approach is already deployed for certain applications. It’s common to use the latent middle layers representations of convolutional neural networks as critics, for instance for style transfer or to manipulate images to increase memorability. The middle layers of these networks tend to be aligned to intermediate visual representations in the brain (e.g. V4). In the case of style transfer, one starts with a photograph and modifies it through gradient descent until the summary statistics of intermediate visual representations matches that of a style image. The ineffable “looks similar” or “has the same style” is transformed into a hard metric that can be optimized to create stimuli which generate a certain brain state.
The style of the Van Gogh painting Starry Night is transferred to a photograph of Tubingen. From Gatys & al. (2015).
Why model-based open-loop control is promising
A disadvantage of open-loop design is that it’s sensitive to mismatch between the system and the model of the system. This can be addressed in two ways: first, neuroAI models should and can be better. I expect that in ten years:
Models will be aligned to an exhaustive list of visual areas (ventral, dorsal, and early multimodal), such that H will be constrained by biology. Ideally, g should directly model the brain f.
They will be biologically plausible: they will follow Dale’s law, they will have different resolution in the fovea and periphery, they will have different receptive field properties as a function of space, they will include noise, recurrence, etc.
Models will be binocular, they will work on images and videos, they will take into account active vision (eye movements), and they may be embodied.
They will be customizable such that scotomas, strokes, or neurological disorders can be modeled through parametric means.
They will be steerable such that they can be adapted to mimic the visual system of specific individuals.
They will account for temporal dynamics at millisecond (tuning), second (adaption), day (learning), and millennium (evolution) timescales .
Eventually, once brain-reading technology becomes more advanced (see later BCI section), they will model how the brain’s internal state interacts with the external stimulus to generate successor states. Until then, they will implicitly average out the brain’s initial internal state s0 as noise.
People are already working on most of these questions in isolation. When all the elements come together in one model, open-loop control will be far more powerful.
The second way to address the prediction mismatch in neuroAI is to mix and match open-loop control and empirical evaluation (as with painters and large online services) to obtain strictly better strategies. An analogy is AI-assisted drug discovery. An AI sifts through millions of different molecules to see which ones are likely to have good binding affinity with some target. Then scientists can go to the bench and test out the top 10 candidates. It’s clear that this is a better strategy than either testing millions of different molecules directly or picking the best molecule identified by the AI and trying to commercialize that.
Application areas for open-loop model-based visual control
As neuroAI models become better and better aligned to the brain, we’ll start seeing applications of visual neural control. The core idea is to create the right stimulus I(x,y,t) to bring the brain into a desired state. The applications are numerous: like conventional methods of visual communication, we’ll see it used in the arts and for marketing, but also in a variety of areas that have yet to be defined. I imagine a world in which a developer can download an in silico visual model of the brain–think torchvision or huggingface–to generate stimuli that maximize certain desired properties.
Selected application areas for neuro AI.
People will almost certainly use neuroAI to optimize ad clicks. However, some of the most exciting applications will be in healthcare, to develop therapies for people with neurological disorders and to enhance the well. Many of the application areas I highlighted above are extensions of currently available treatments through conventional visual psychophysics (see Chapter 11 of Dosher & Lu for an overview).
I’m particularly excited about applications in accessibility. Imagine you’re trying to design a font that’s easier to read for people with learning disabilities, like dyslexia, which affect up to 10% of people worldwide. One of underlying issues in dyslexia is sensitivity to crowding, which is the difficulty recognizing letters when they’re presented in the visual periphery because of an interference effect. The visual periphery doesn’t fully represent the visual world: rather, it represents a summary of the visual content, which means it can be confused by letters with similar shapes, like mirror-symmetric letters (e.g. p and q). Anne Harrington and Arturo Deza at MIT are working on neuroAI models that model peripheral processing and getting some very promising results. Imagine taking that model and backpropagating through it to find a font design that is both aesthetically pleasing and is easier to read. With the right data about a specific person’s visual system, we can even personalize the font to a specific individual, which has shown promise in improving reading performance. These are potentially large improvements in quality of life waiting here…
AR and the metaverse
The next trend that will make visual neuroAI applications far more powerful is the adoption of augmented reality glasses. A lot of people are skeptical about the metaverse and whether it is vaporware. Virtual reality remains niche. But I think that augmented reality (AR) is a very different story than virtual reality (VR), because AR integrates into daily life.
This will make widely available a display device that’s far more powerful than today’s static screens. If it follows the trajectory of VR, it will eventually have eye tracking integrated. This would mean a widely available way of presenting stimuli that is far more controlled than is currently possible, a psychophysicist’s dream. And these devices are likely to have far-reaching health applications, as told by Michael Abrash in 2017:
Consider the glasses I’m wearing. They enhance the acuity of my vision every waking moment. But what it they were AR glasses that enhance perception in other ways as well? What if they could enable me to see in low light? What if I could see people in places on the other side of the world? What if AR glasses could help me live a normal life if macular degeneration–which runs in my family–hits me? (emphasis mine)
The significance for visual neuroAI is clear: we could deliver therapies through stimulus sequences I(x,y,t), in a highly controlled way on a continuous basis in every day life. This is still open-loop control, but it’s vastly enhanced by the delivery method.
It’s easy to think that AR glasses will bring a nightmarish, always-connected world in which our senses are constantly bombarded. But I can imagine AR moving the needle in the opposite way, where I can modify the visual environment to my liking and have complete agency over it. Imagine walking with AR glasses and being able to turn on a focus mode, like noise-cancelling headphones for your vision. To be safe, they would need to let through critical information, which means having environmental awareness as well as a very good model of how your visual system works. Assuming we can overcome the technical hurdles, you could get yourself into a deep work mindset very easily.
BCI
With a great, highly controlled display, you can control the input to the visual system precisely. The next, more powerful stage in delivering therapies through the eyes is to verify that the brain is reacting in the expected way and to adjust the input accordingly. Closed-loop control means measuring in real time how brain states react to stimulation through a read-only brain-computer interface (BCI). We can then nudge the initial brain state s0 towards the desired outcome s1. An analogy will help. Think of how a thermostat works: it controls a heating element. When the temperature hits above a certain threshold, it stops heating until a lower temperature is hit. That way the temperature is always within range. A closed-loop thermostat is strictly more powerful than a heating element that must be manually turned on or off (e.g. a space heater). And it doesn’t require having a very good model of the heat transfer properties of the room or the heater: closed-loop control can be robust to model inaccuracy.
Getting to closed-loop control means having access to brain-computer interfaces that can read brain states. In this scheme, AR glasses are used to stimulate the brain, which is far more practical than invasive stimulation. To stimulate the visual brain at the relevant spatial scale directly via optogenetics or electrodes would likely require a million stimulation sites, which will likely remain technically infeasible in humans over the next decade. We can access a significant number of therapeutically interesting states by controlling the input to the eyes instead. As the visual system accounts for 30% of the brain and it is densely connected to other areas, these other areas should also be partially controllable, as shown in pioneering work by Dani Bessett and colleagues.
There are a number of non-invasive read-only BCIs that are commercialized today or in the pipeline. Some examples include:
EEG. Electroencephalography measures the electrical activity of the brain outside of the skull. Because the brain acts a volume conductor, EEG has high temporal resolution but low spatial resolution. While this has limited consumer application to meditation products (Muse) and niche neuromarketing applications of dubious scientific value, I’m bullish on some of its uses in the context of vision. EEG can be much more powerful when one has control over the visual stimulus, because it’s possible to correlate the contents of the display with the EEG signal and decode attention (visual evoked potentials methods, VEPs). Indeed, NextMind, which made an EEG-based “mind click” based on VEPs, just got bought by Snap, which is now making AR products. Valve is also making a strong play into integrating EEG into VR headsets through their partnership with OpenBCI. I would not count EEG out.
fMRI. Functional magnetic resonance imaging measures the small changes in blood oxygenation associated with neural activity. It’s slow, it’s not portable, it requires its own room and it’s very expensive. The only current clinical application of fMRI is in pre-surgical mapping (e.g. to localize speech areas of the brain before surgically removing a tumor). However, fMRI remains the only technology that can non-invasively read activity deep in the brain in a spatially precise way. There are two paradigms which are fairly mature and relevant for closed-loop neural control. The first is fMRI-based biofeedback. A subfield of fMRI shows that people can modulate their brain activity by presenting it visually on a screen. The second is visual cortical mapping, including approaches like population receptive fields and estimating voxel selectivity with movie clips, which allow one to estimate how different brain areas respond to different visual stimuli. These two methods hint that it should be possible to estimate how a neuroAI intervention affects the visual brain and steer it to be more effective.
fNIRS. Functional near infrared spectroscopy uses diffuse light to estimate cerebral blood volume between a transmitter and a receptor. It relies on the fact that blood is opaque and increased neural activity leads to a delayed blood influx in a given brain volume (same principle as fMRI). It’s been dismissed as being too slow and having too low spatial resolution to make a dent, but with time gating (TD-NIRS) and massive oversampling (diffuse optical tomography), spatial resolution is far better. On the academic front, Joe Culver’s group at WUSTL have demonstrated decoding of movies from visual cortex. On the commercial front, Kernel is now making and shipping TD-NIRS headsets which are impressive feats of engineering. And it’s an area where people keep pushing and progress is rapid; my old group at Facebook demonstrated a 32-fold improvement in signal-to-noise ratio (which could be scaled to >300) in a related technique.
MEG. Magnetoencephalography measures small changes in magnetic fields, thus localizing brain activity. Portable MEG that doesn’t require refrigeration would be a game changer for noninvasive BCI. People are making progress with optically pumped magnetometers: you can buy a single OPM for $10k from QuSpin. Kernel is teasing an upcoming Flux headset with a large number of OPM modules.
In addition to these better known techniques, some dark horse technologies like digital holography, photo-acoustic tomography, and functional ultrasound could lead to rapid paradigm shifts in this space.
While consumer-grade non-invasive BCI is still in its infancy, there are a number of market pressures around AR use cases that will make the pie larger. Indeed, a significant problem for AR is controlling the device: you don’t want to have to walk around with a controller or muttering to your glasses if you can avoid it. Companies are quite serious about solving this problem, as evidenced by Facebook buying CTRL+Labs for a hefty amount in 2019, and Snap acquiring NextMind. Thus, we’re likely to see some real-life, low-dimensional BCIs (e.g. a mental click) not long after the introduction of AR. Whether high-dimensional BCIs like Kernel’s offerings will rapidly find product-market fit outside of the lab remains to be seen. Companies would need to find a real use case; it’s possible that the kinds of neuroAI therapies I advocate for here are precisely the right use case.
If we can control the input to the eyes as well as measure brain states precisely, we can deliver neuroAI-based therapies in a monitored way for maximum efficacy.
Conclusions
A fantasy timeline for Neuro AI, consumer AR, and consumer BCI
NeuroAI is creating better and better models of the brain, especially sensory systems. A push in the next 10 years will create models which will be excellent stand-ins for the visual brain at the scale which matters for visual perception. While we continue to push to understand the mysteries of vision, excellent in silico models will open up a slew of application areas. Open-loop control already exists, and it will continue to grow as we become better at using it to unlock different applications. AR will bring ubiquitous, world-locked, high-resolution displays to everyday life. BCI, when it ships, will make closed-loop brain control possible. Through visual agency, we can take control over our own visual systems, and by extensions much of our inner lives. The eyes are a window to our brain, and we’ll see exciting applications of that in the next decade.
We’re in a golden age of merging AI and neuroscience. No longer tied to conventional publication venues with year-long turnaround times, our field is moving at record speed. 2021 saw a Cambrian explosion of research into unsupervised learning to explain brain representations, which is teaching us about how the brain might have evolved for sensing. […]
We’re in a golden age of merging AI and neuroscience. No longer tied to conventional publication venues with year-long turnaround times, our field is moving at record speed. 2021 saw a Cambrian explosion of research into unsupervised learning to explain brain representations, which is teaching us about how the brain might have evolved for sensing. Join me on this tour of this exciting new research area.
One of the most robust findings in neuro-AI is that artificial neural networks trained to perform ecologically relevant tasks match single neurons and ensemble signals in the brain. The canonical example is the ventral stream, a series of areas in the visual cortex of mammals selective for shape. DNNs trained for object recognition on ImageNet match representations in IT, an area of the ventral stream (Khaligh-Razavi & Kriegeskorte, 2014, Yamins et al. 2014). Supervised, task-optimized networks link two important forms of explanation: ecological relevance and accounting for neural activity. They answer the teleological question: what is a brain region for?
However, as Jess Thompson points out, these are far from the only forms of explanation. In particular, task-optimized networks are generally not considered biologically plausible. Conventional ImageNet training uses 1M images. For a human infant to get this level of supervision, they would have to receive a new supervised label every 5 seconds (e.g. the parent points at a duck and says “duck”) for 3 hours a day, for more than a year. And what about a non-human primate or a mouse? How would they receive a supervisory signal? Thus, the search for biologically plausible networks which match the human brain is still on.
The Neuro-AI hypothesis space. By Jess Thompson, 2021
This year, we’ve seen a lot of headway into finding ways around supervision, and alternatives to supervised training are now competitive.
Subscribe to xcorr and be the first to know when there’s a new post
The alternatives to supervised training are many, and it’s not clear which method is winning right now. Let’s review the methods:
Self-supervised learning in a nutshell
Unsupervised learning aims to represent the data distribution. I’ve covered some unsupervised techniques in the context of dimensionality reduction in neuroscience.
Self-supervised training aims to find good data representations by solving pretext tasks (figure above). Many have been proposed over the years. Language models are almost universally trained with self-supervision; think BERT and GPT-3.
Contrastive learning is a particular flavor of self-supervised learning where the pretext task is to predict whether a sample is from a positive or negative (or distractor) class (hence “contrast”). There are many different flavors of contrastive learning: MoCo, InfoNCE, SimCLR, CPC, etc. There are also closely related non-contrastive methods that do away with negative samples, including BYOL and BarlowTwins.
Multimodal learning is a particular flavor of self-supervised training which aims to find a common subspace for two different modalities (e.g. vision, text, audio, etc.) by predicting one from the other, or predicting a common subspace. CLIP is perhaps the most famous multimodal network – it’s trained contrastively.
All of these methods allow us to learn a representation without the need for pesky supervision. If it turns out that this representation is aligned to a brain area, this is a win, as self-supervised and unsupervised methods are more biologically plausible than supervised methods.
Papers
I went through and reviewed papers and preprints published in 2021. I covered MAIN conference, NeurIPS, CCN, as well as whatever papers and preprints happened to show in my Twitter feed. Of course, this review reflects my research interests (heavily skewed towards a certain flavor of neuro-AI and vision), but I hope it’s useful to many of you who want to see where the field is going.
Unsupervised neural network models of the ventral visual stream
From Zhuang et al. (2021)
Zhuang and colleagues (2021). This paper was just published in PNAS this year and already has > 60 citations. They found that unsupervised and self-supervised methods learned representations that are well-aligned to ventral stream (V1, V4, IT) neurons. Their significance statement:
Primates show remarkable ability to recognize objects. This ability is achieved by their ventral visual stream, multiple hierarchically interconnected brain areas. The best quantitative models of these areas are deep neural networks trained with human annotations. However, they receive more annotations than infants, making them implausible models of the ventral stream development. Here, we report that recent progress in unsupervised learning has largely closed this gap. We find the networks learned with recent unsupervised methods achieve prediction accuracy in the ventral stream that equals or exceeds that of today’s best models. These results illustrate a use of unsupervised learning to model a brain system and present a strong candidate for a biologically plausible computational theory of sensory learning.
In particular, they found that SimCLR and other contrastive learning methods could explain ventral stream neurons almost as well as supervised methods. Very strong proof-of-concept that labels are not necessary; major milestone for the field IMO.
Beyond category-supervision: Computational support for domain-general pressures guiding human visual system representation
From Konkle and Alvarez (2021)
Konkle and Alvarez (2021), bioRxiv. This paper asked a similar question to the Zhuang et al. paper: can the ventral stream be accounted by a network trained without supervised learning? They used fMRI in humans rather than single neuron recordings to evaluate this. They find results which are broadly compatible with the Zhuang paper, with their own flavor of instance-contrastive self-supervision, as well as other along the same lines (SimCLR) accounting for the fMRI data. Interestingly, they find a disconnect between ImageNet classification accuracy and match to fMRI data. This paper has a really interesting discussion, where they seriously consider how you would actually implement self-supervision in a brain. They identify retinal distortions, saccades (see also this recent NeurIPS paper), efference copy, as well as a hippocampus-based buffering mechanism as means of instantiating self-supervision. I would love to see a modelling study instantiating these ideas – I’m sure they’re already on it!
Your head is there to move you around: Goal-driven models of the primate dorsal pathway
Mineault et al. (2021). Excited that we got this accepted as a spotlight in NeurIPS this year. As I’ve discussed previously, ventral stream neurons are selective for shape. However, there’s an entirely separate stream, the dorsal stream, which is highly sensitive to motion. How does that work? I compared a number of supervised 3d (spacetime) networks to different dorsal stream areas and found that they could not account for the responses of single neurons in non-human primates. Taking the task-driven paradigm at face value, we asked: what kind of task would create the representations we see in the dorsal stream? I trained a network to estimate the parameters of the self-motion of an agent from the patterns of images falling on its retina. Image sequences were generated in Airsim, an Unreal-based simulation engine for drones (below).
The resulting network looked a lot like the dorsal stream, and this was true both qualitatively and quantitatively. The model recapitulated the hierarchy of the dorsal stream (direction selective cells, pattern cells, optic flow cells), while also predicted biological neural responses better. Now, the training of this model is supervised, but from the agent’s perspective, it’s self-supervised, multimodal learning. The agent learns to predict the parameters of its self-motion (vestibular, efference copy) from another modality (vision). That’s potentially biologically plausible, something we’ll be able to verify in NHP experiments to follow.
A subunit of the model that prefers spiral motion
The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning
From Bakhtiari et al. (2021).
Bakhtiari et al. (2021). This was our second spotlight in NeurIPS this year – I am very grateful to Shahab, who really outdid himself with this one! Mammals have both dorsal- and ventral-like streams, whether it’s humans, non-human primates or mice. Can one artificial neural network explain both? Shahab used a contrastive predictive coding (CPC) network trained on movie clips and found that, with two separate parallel pathways, the network self-organized into a dorsal and a ventral stream. The dorsal stream pathway provided a nice match to mice dorsal areas, and the ventral stream was well matched to ventral stream. What’s more, networks trained with supervision, or networks with just one pathway did not match the brain of mice. I think it’s a very nice embodiment of the idea that with anatomical priors and a sprinkling of self-supervision, the brain can bootstrap itself into existence.
Shallow Unsupervised Models Best Predict Neural Responses in Mouse Visual Cortex
From Nayebi et al. (2021)
Nayebi et al. (2021), bioRxiv. Deep neural nets are great models of primate visual cortex, but not so much for mice. This paper asks the question of how we can obtain better models of mouse vision using more ethologically relevant, more anatomically grounded models. They used mouse visual cortex data (static images) and compared it to supervised and self-supervised networks of different architectures. One interesting finding is that shallow networks with parallel branches explain mice data better. This corroborates Shahab’s findings. They make the argument that mice’s visual brain is a shallow “general purpose” visual machine, which is good for a variety of tasks, unlike the deep neural network in our brains, which is highly specialized to subtasks. I think these ideas fit very nicely with the idea that different brains are adapted to different ecological niches, which is one of my favorite theories of everything in neuroscience.
Neural Regression, Representational Similarity, Model Zoology & Neural Taskonomy at Scale in Rodent Visual Cortex
From Conwell et al. (2021)
Conwell et al. (2021), NeurIPS. Another paper on self-supervised learning in mouse visual cortex. This one is notable for the breadth of metrics and models considered: vision transformers, MLP mixers, taskonomy encoders, self-supervised models. Broadly speaking, the results are compatible with the previous two papers. I think the most notable finding is that some ethologically relevant taskonomy tasks stood out as better aligned to the brain: 2d segmentation, object recognition and semantic segmentation. Perhaps the mouse’s visual brain receives some multi-modal self-supervision from the whiskers?! It’s a very cool idea that an experimentalist should pick up.
Partial success in closing the gap between human and machine vision
Geirhos et al. (2021)
Geirhos et al. (2021), NeurIPS. Humans are very good at performing classification of images under distortions, such as noise, changes in contrast, rotations, etc. Up to now, CNNs have performed poorly on this task. In this paper, they ask the question of whether newer models have closed this gap. Indeed, they have! They find that newfangled self-supervised and multimodal models are now at par with humans when it comes to robustness to distortions. One important factor underlying the results is how much data the network was trained on: models trained with orders of magnitude more data are more robust. CLIP performs exceptionally well here, for reasons that are still not clear. Newer models are also less sensitive to texture and more to shape: they seem to be taking less shortcuts. Nevertheless, newer models still make distinctly non-human mistakes, which they were able to show with extensive psychophysics. Very thorough evaluation, the reviewers loved it and so did I.
Multimodal neural networks better explain multivoxel patterns in the hippocampus
From Choksi et al. (2021)
Choksi et al. (2021), SVHRM workshop. The hippocampus contains “concept cells” (e.g. the infamous Jennifer Aniston cell) that are highly multimodal; they will respond to a textual representation of a concept or an image. Interestingly, CLIP does the same, and in fact can be fooled by adversarial labels (i.e. writing wrong labels into an image). In this workshop paper, the authors use publicly available fMRI data to show that multimodal networks, including CLIP, explain hippocampal data best. It’s wild that there’s something so special about CLIP, and we still don’t know what it means!
Unsupervised deep learning identifies semantic disentanglement in single inferotemporal neurons
From Higgins et al. (2021)
Higgins et al. (2021), Nature Comms. The Deepmind gang is at it again comparing neural representations in face selective areas and unsupervised, disentangled neural networks. They ask whether single, face-selective neurons in IT are aligned to single disentangled causal factors. Perhaps a single neuron in IT encodes a single causal factor which affects the appearance of faces, e.g. skin tone, lighting, age, etc. The manifold perspective of neural representation, OTOH, is that neurons don’t have to be axis-aligned to represent interesting factors; arbitrary rotations of the latent factors are just as good candidates for representation. This would argue against axis-alignment.
They trained an unsupervised neural network, the beta-VAE, to represent faces. A beta-VAE is a variational auto-encoder (VAE) with a higher weight (=beta) on the KL term, which encourages the representation to be disentangled. They show in this case that single neurons in IT are aligned to single disentangled latent factors in the VAE, which argues against the manifold view and for the disentangled causal graph view. Like all good papers, asks more questions than it answer: how does disentangling happen in the brain? How do people learn disentangling? What’s the advantage of a disentangled representation?
Unsupervised learning predicts human perception and misperception of gloss
From Storrs et al. (2021)
Storrs et al. (2021), Nature Human Behaviour. This paper examines the question of how humans perceive gloss from surfaces. They train a pixel-VAE on an ensemble of textures and examine how the latent factors are aligned to how humans perceive surfaces. They find that the VAE naturally disentangles different factors, and furthermore, is well aligned to human perception! Furthermore, they find that supervised networks don’t perform nearly as well on this task. Very cool idea, and lots of nice psychophysics in this paper.
Conclusion
This year has seen huge advances in matching unsupervised and self-supervised models to brains. They offer a number of advantages over the alternatives: potentially a better match to brain data, they can be trained without labels.
We have to be careful though: learning without labels is better than with labels, but we still have to run through the numbers to see if the math works out. In particular, because labels are expensive, and self-supervision opens the door for learning without labels, the latest class of models often uses a tremendous amount of data for training. For instance, GPT-3 is trained on essentially all the text that’s ever been produced by humanity (~500 billion tokens). By contrast, children in the most talkative homes are exposed to ~30M words by the time they’re five. If we find an alignment between GPT-3 and representations of language in the human brain, it doesn’t follow that contrastive training on an immense corpus as in GPT-3 is a biologically plausible mechanism for language acquisition and representation; we’re still off by 4 orders of magnitude.
2022 will hopefully see a lot of work on embodied learning in ecologically valid settings. This, mixed with biologically plausible self-supervised and unsupervised learning, will lead to more refined models of how the brain can learn good representations during evolution and development.