Severely Theoretical

No, agentic AI will not massively boost productivity in scientific research; here’s how to actually boost scientific productivity

Emin Orhan Feb 13, 2026

There seems to be a lot of recent interest in and excitement about the promise of “agentic AI” (tools like Claude Code or Cursor) for improving productivity in scientific research. The idea, or perhaps more accurately, the hope, seems to be that by automating key steps in scientific workflows, agentic AI tools can massively improve… Continue reading No, agentic AI will not massively boost productivity in scientific research; here’s how to actually boost scientific productivity →

Show full content

There seems to be a lot of recent interest in and excitement about the promise of “agentic AI” (tools like Claude Code or Cursor) for improving productivity in scientific research. The idea, or perhaps more accurately, the hope, seems to be that by automating key steps in scientific workflows, agentic AI tools can massively improve the productivity of practising scientists. It’s not uncommon these days to hear claims such as: “What used to take me weeks previously now only takes a few hours.” The general message (the discourse, the narrative, the zeitgeist, the vibe of the times) seems to be that if you’re a practising scientist (whatever your field may be), you should be using these tools, because if you’re not, you’re going to be massively left behind. This constant drumbeat of “use these tools or become obsolete” messaging seems to create a lot of fomo-induced anxiety among scientists and researchers who do not use these tools or do not find them as useful as their more tech-savvy or technophile colleagues (“Am I doing something wrong?”).

In the last week alone, I saw two pieces projecting this kind of message: this post by the machine learning researcher Tim Dettmers and this hour-long YouTube video by the astronomer David Kipping.

In this post, I’d like to argue that expectations of massive productivity boosts in scientific research through the use of agentic AI tools are completely unrealistic for one very simple and obvious reason: these tools do not address the main bottleneck in scientific research, which is experimentation (broadly construed to include things like running simulations or experiments in silico, etc.) or, more precisely, the limited resources available for experimentation.

If you’re an academic machine learning researcher, you typically have access to a limited amount of compute on your university’s HPC cluster (or on a national supercomputing resource or on some cloud computing service). This puts hard and severe limits on the number of simultaneous projects you can run, the number of experiments per project you can run, and the scale at which you can run these experiments. Every machine learning researcher knows extremely well that this is the main and the single most significant bottleneck throttling their work. Yet, agentic AI tools (or any other AI tools, for that matter) do absolutely nothing to address this bottleneck: you previously had ~50 H100 days of compute per month on your local HPC cluster (realistic estimate from my last academic employer) and you still have ~50 H100 days of compute per month with your Claude subscription.

Hard sciences where experiments interface with the real world are even more strongly bottlenecked by the resources available for experimentation, because experiments are more time-consuming and require at least a human in the loop in this case. If you’re running a wet biology lab, for instance, you can only run so many animal experiments, analyze, image, or test only so many biological samples, etc. in a given amount of time. Again, every biologist knows that these resource constraints are the main bottleneck limiting their work and AI again does absolutely nothing to change these constraints. Research that relies on large-scale instruments like the Large Hadron Collider (LHC), the James Webb Space Telescope (JWST), or supercomputers is maximally resource constrained, so I’m afraid your Claude subscription will help even less here. Incidentally, this is why I find it so strange to see David Kipping, an astronomer whose work relies on large-scale and very expensive telescopes like the JWST, join this “use these tools or become obsolete” bandwagon.

Or consider the following example from Tim Dettmers’s blog post (linked above). Dettmers claims that he uses agentic AI tools to help him write grant proposals. This would make sense if there were a million funding opportunities and you wanted to automate the submission of a broad range of high-quality ideas to maximize your chances of success (this kind of automated generation and submission of ideas would raise questions about your actual ownership of these ideas, but let’s leave these questions aside for now, since this scenario is already too unrealistic). In this hypothetical world of plenty, scientists would be genuinely limited by the time it takes to write a grant proposal, so it would make sense to automate their writing. But here’s the thing: there aren’t a million funding opportunities for supporting scientific research projects! There aren’t even a thousand or even a hundred of them. At best, there are only maybe 10 such opportunities for most scientists in any given year (if that), so writing grant proposals is not even remotely the main productivity bottleneck for scientists even if they work meticulously on each and every one of them.

Perhaps, some will argue that although AI doesn’t do anything to change the resource constraints scientists are subject to, it may enable them to use those resources much more effectively, hence still massively boosting productivity. So, for example, you may still have ~50 H100 days of compute per month on your local HPC cluster, but perhaps now you can run 10x more experiments or maybe 10x better experiments in some sense with those ~50 H100 days of compute when you give the reins to Claude Code. Or take Dettmers’s grant proposal writing example. Maybe the idea is that Claude Code will help you write 10x better grant proposals that will massively boost your chances of receiving funding, even though it obviously won’t do anything to increase the total amount of available funding resources (of course, it would be mathematically impossible for Claude Code to massively improve everybody’s chances simultaneously, given limited resources, but again let’s ignore this technicality for now). The idea that Claude Code will write 10x faster code or that it will generate 10x better ideas to test, explore, or experiment with (whatever “10x better” may mean in this context) doesn’t even pass the smell test, but let’s scrutinize it a bit further for a reality check.

Let’s be as charitable to AI agents as possible and pick a domain where they are expected to excel; their home turf, so to speak: writing code. And just to give a concrete example, let’s take a look at this paper that came out very recently, which introduces a new state-of-the-art method for doing “discovery” with LLMs, i.e. finding novel, highly performant solutions to specific quantifiable problems. One of the problems they consider here is writing performant GPU kernels for specific matrix operations (e.g. triangular matrix multiplication). Note that this is a case where one can explicitly write down an unambiguous, well-defined, and easily verifiable reward function, namely the inverse runtime of the produced kernel, which can then be directly optimized in silico through reinforcement learning. For the H100 GPU, the best kernel this method came up with for triangular matrix multiplication is only 18% better than the best human-written code (for the B200, the number is more like 13%; see Table 4). For another kernel writing problem (MLA Decode on the MI300X), the method actually fails to discover a more performant kernel than the best human-written kernels (see Table 5). So, a grand improvement of less than 20% in the absolute best case, the most optimistic scenario for AI, where we can explicitly write down an unambiguous, well-defined, and easily verifiable reward function and then directly optimize it in silico. Needless to say, this is not going to be even remotely possible for most scientific applications.

How to actually boost scientific productivity massively

There’s actually a very simple and straightforward way we could significantly boost scientific productivity and every economist knows how to do this: if you want to improve productivity significantly, you have to make significant capital investments, including in human capital (no pains, no gains). There are no shortcuts to this, no silver bullets, no magic tricks, no free lunches. If you want to boost the productivity of your machine learning researchers 10x, for example, you have to buy them 10x more and/or 10x better GPUs. If you want to boost the productivity of your molecular biologists 10x, you have to buy them 10x more or 10x better microscopes (plus all the other instruments and devices that make cutting-edge research in biology possible), you have to give them 10x more lab space to run their experiments in, etc. If you want to boost the productivity of your astronomers 10x, you have to buy them 10x more or 10x more powerful telescopes. In general, if you want to boost the productivity of your scientists 10x, you have to be willing to increase their funding 10x (or something like that). Of course, this is much harder and much more expensive than simply buying them $20/month Claude subscriptions and I suspect that one of the reasons why this idea of agentic AI massively boosting scientific productivity seems so alluring to people is its “get rich quick” nature. I’m sorry to have to remind you that “get rich quick” schemes are unfortunately almost always scams.

Conclusion

Current AI tools are extremely useful for a limited set of problems/tasks, chief amongst which is writing code. For this subset of tasks, these tools likely do boost productivity significantly. However, writing code is not the main productivity bottleneck in scientific research (for most scientists in most fields) and has never been so (if it were so, there would be a lot more professional software engineers hired in scientific research organizations than there currently are). By far the main bottleneck in science is rather the (often severely) limited resources available for experimentation. Current AI tools do absolutely nothing to address this bottleneck, which is why they will have a very limited impact on scientific productivity in the short term. In the long run, AI is, of course, a part of the process that improves our instruments of experimentation, making our GPUs, telescopes, microscopes, etc. more efficient and more powerful, but this is a much slower process.

http://severelytheoretical.wordpress.com/?p=4769

Extensions

Continual training of Llama-3.1-8B for 809B tokens

Emin Orhan Apr 21, 2025

Over the last couple of months, I’ve been continually training the pretrained Llama-3.1-8B model (with a context length of 8192 tokens) for 809B tokens. This was my first truly large-scale distributed training experience and in this post, I’d like to share some of what I’ve learned so far. Why First of all, why am I… Continue reading Continual training of Llama-3.1-8B for 809B tokens →

Show full content

Over the last couple of months, I’ve been continually training the pretrained Llama-3.1-8B model (with a context length of 8192 tokens) for 809B tokens. This was my first truly large-scale distributed training experience and in this post, I’d like to share some of what I’ve learned so far.

Why

First of all, why am I doing this? To be honest, the primary motivation for me was just to get some hands-on experience in large-scale distributed training. That being said, I’m also genuinely very interested in continual training. It always seemed incredibly wasteful to me to have to do these large-scale training runs from scratch every time, instead of starting from an already well-trained model. If we could find really effective ways to do this, i.e. minimizing the loss of previously acquired knowledge while at the same time not hobbling the model’s capacity to acquire new knowledge, that would be extremely impactful in my view.

For Llama-3.1-8B specifically, I remember I was explicitly motivated by the following post from the Llama-3 pretraining lead to try continual training on this model:

Yes, both the 8B and 70B are trained way more than is Chinchilla optimal – but we can eat the training cost to save you inference cost! One of the most interesting things to me was how quickly the 8B was improving even at 15T tokens. https://t.co/r0uBJwQEbC
— Mike Lewis (@ml_perception) April 18, 2024

The fact that the model was still improving surprisingly quickly after 15T tokens suggested to me that, if I could optimize my continual training setup, I could possibly get substantial improvements over the base model even with a few trillion tokens of continual training, which is basically my compute budget. So, improving upon the released base model in this way was another major motivation for me.

Training infrastructure

The model is trained on the Frontier supercomputer hosted at OLCF, which consists of over 10k compute nodes with 4 AMD MI250X accelerators on each node. I was initially quite ambitious about scale. I wanted to continually train the pretrained model for another 15T tokens or so. Given the relatively old hardware and the less than ideal interconnect (to put it mildly) on Frontier, basically the only way to train an 8B parameter model for this many tokens within a reasonable time frame (i.e. in a few months) is to use a very large batch size (on the order of at least 100M tokens globally) and do correspondingly fewer training iterations. I was quite excited about this setup, because it would allow me to try things that, to my knowledge, haven’t really been tried before in large-scale LLM training, namely training with very large batch sizes (~100M tokens globally) and multi-epoch training. However, a batch size of 100M tokens requires ~500-600 nodes on Frontier and it quickly became clear to me that with the current limits on my account, it wouldn’t be feasible for me to go through 15T tokens at this scale within a reasonable amount of time, so I had to scale back my ambitions.

In the end, I decided to settle for a 64-node training run, with a batch size of 11M tokens. This would allow me to run my jobs in the extended partition of Frontier that has a 24-hour maximum run time limit for jobs (otherwise the limit is a mere 12 hours). I planned for a maximum of 500k training steps, which corresponds to ~5.5T tokens, which in turn is roughly equal to 1 epoch over my training data. I estimate that it will take me about 10 months to complete training in this setup.

64 nodes on Frontier correspond to 512 “devices” (or graphics compute dies) with 64 GB HBM2e GPU memory per device, for a total of 33 TB of GPU memory. Further technical details about the system can be found here.

Training details

For efficient distributed training, I used a combination of hybrid sharding data parallelism (HSDP), tensor parallelism (TP), and pure data parallelism (DP). These were factorized across 512 devices as follows: HSDP=32, TP=8, and DP=2 (this means that there are 32 HSDP ranks, 8 TP ranks, 2 DP ranks, and each device’s rank can be uniquely identified by a combination of these ranks). I tried a bunch of other configurations (e.g. HSDP+DP without TP), but I found that this particular configuration maximized the training throughput. In addition, I also used just-in-time (JIT) compilation of the individual model layers through torch.compile, full activation checkpointing, and bf16 mixed precision training.

The learning rate schedule is a simple a linear schedule with a warm-up of 500 steps, a peak learning rate of 3e-5, and the total number of training steps is scheduled to be 500k as mentioned earlier (i.e. the learning rate decays linearly from its peak of 3e-5 at step 500 to 0 at step 500k). I initially tried a 10 $\times$ larger peak learning rate (3e-4), but this resulted in a pretty big jump in the loss early on during training (and evaluations indicated a quick and substantial degradation in model quality, which I didn’t like!), so I decided to play it safe and went for a lower peak learning rate. I didn’t really have the chance to finetune the learning rate extensively, but in hindsight I should have probably gone for a slightly larger peak learning rate, maybe something like 5e-5 or 6e-5.

The global batch size is 11M tokens. This number comes about as follows: 64 data-parallel ranks (HSDP $\times$ DP) $\times$ local batch size of 21 per data-parallel rank $\times$ context length of 8192 tokens.

To train the model, I’ve used a customized version of the excellent torchtitan library, which is a lightweight, PyTorch-native distributed training framework. The training code is available from this repository with detailed instructions for full reproduction. Although I’ve been using AMD hardware to train the model, I expect that the same code should work seamlessly on NVIDIA hardware without any modifications (I haven’t verified this at scale though).

Training data

The choice of training data is probably the single most important decision in large-scale training runs. Again, I didn’t really have the time and the compute budget to explore the training data choice as extensively as I would have liked to, but I did peruse the literate on this topic pretty thoroughly and in the end came up with a training dataset consisting of the following components:

Zyda-2, which is itself a cross-deduplicated and filtered combination of DCLM (3.3T), FineWeb-Edu (1.3T), Dolma (0.2T), Zyda (0.2T) datasets.
Stack-2, specifically the the-stack-v2-train-smol-ids subset (525B).
FineMath, specifically the finemath-3plus subset (34B)

The numbers in parentheses represent the approximate token counts of the datasets (the full dataset has ~5.6T tokens). The mixture weights for these components are as follows (in terms of data rows, not tokens): DCLM (40%), FineWeb-Edu (44%), Dolma (3%), Zyda (2%), Stack-2 (10%), FineMath (1%). Again, these weights were chosen mostly based on the prior literature rather than my own experiments.

I don’t pretend to claim that this particular mixture of datasets is optimal at all, but as of this writing, it’s probably one of the strongest and highest-quality composite datasets one can put together from open, public sources of general web text and code and math data.

Results

I’ve now trained the model for 73500 steps (which is roughly 809B tokens or ~15% of the total number of steps planned for training) and the training loss curve so far looks like this:

*The black trace shows the loss tracked in 100-step bins, the red trace shows the loss tracked in approximately 15k-step bins.*

That’s a pretty nice decrease in loss! We love to see it. The checkpoints are available from this Hugging Face repository (and any future checkpoints will also be deposited in the same repository). It took me about 1.5 months to go through 73500 steps and I estimate that it would take another 9 months or so to complete the full training run. The training run is paused at the moment, because I have other (higher priority) experiments to run on Frontier. I don’t know yet if I’ll complete the full training run (probably not) or how much longer I’ll keep training the model, but I’ll post any updates here and on the accompanying GitHub repository.

In terms of downstream evaluations, unfortunately, I haven’t had a chance to run the full set of evaluations yet (because some of the tasks take quite a bit to run), but from the handful of evaluations I’ve run so far, it basically looks like a wash at the moment, with a slight improvement in some tasks and a slight degradation in others:

MMLUARC-ChallengeWinograndeHellaSwagstep-063.451.474.160.0step-7350060.952.173.660.2

Downstream performance on four evaluation tasks. Step-0 corresponds to the pretrained base model without any continual training. The evaluation code and task configs are available from here. Any discrepancies from evaluation results reported elsewhere are likely due to differences in evaluation setups.

A note on large-scale training on AMD hardware

The problems with AMD’s deep learning software stack are well-known and I don’t really want to rehash them here. Instead, I want to discuss a couple of issues that I personally encountered in my own experience. I also highly recommend that people read this report, which provides a fairly comprehensive overview of the problems with AMD’s approach to software, with lots of concrete recommendations for improvement. For me, the thing that stood out in this piece was this single sentence:

Tensorwave, the largest AMD GPU Cloud has given GPU time for free to a team at AMD to fix software issues, which is insane given they paid for the GPUs.

AMD (and HPE) does this with OLCF as well. They are given valuable GPU time on OLCF systems to hunt for rccl bugs that they should have fixed themselves and, honestly, that should have never been there in the first place. This is a completely unacceptable situation, especially on a publicly funded system. AMD should have its own cluster with at least hundreds (if not thousands) of its best GPUs and should be using its own cluster to improve and optimize its software stack and to showcase capabilities by constantly releasing models trained on its own cluster (just like NVIDIA does). AMD has the money to do this. This is the only way to gain the trust of the community and to prove that it’s serious about software.

Currently, it isn’t possible to do training runs on more than ~800 nodes on Frontier in my experience due to apparently intractable rccl bugs (or some interaction between rccl and HPE’s Slingshot interconnect). In practice, rccl+Slingshot is so slow and inefficient that it ceases to make any sense to do training runs beyond ~500 nodes on Frontier, even if you could do it reliably. Moreover, MI250X is still ~1.5-2.2 $\times$ slower than its NVIDIA counterpart, A100, even at much smaller scales in my own benchmarks and although it may be possible to get a training throughput close to A100 on MI250X at such scales by taking advantage of its larger GPU memory (128 GB vs. 80 GB) and using a larger batch size, this isn’t ideal at all.

AMD has the goodwill of the entire community behind it. Everybody desperately wants AMD to succeed. But, to do that, they have to take software much more seriously than they have been doing so far.

http://severelytheoretical.wordpress.com/?p=4695

Extensions

Language development in children is more like post-training than pre-training in LLMs

Emin Orhan Feb 28, 2025

One of the pitfalls in scientific writing in general is the danger of prematurely describing the basic observations of a field in theory-laden terms. This can be particularly problematic in fields that are in their infancy, such as developmental psychology, where basically all the fundamental questions are wide open, and where we basically don’t know… Continue reading Language development in children is more like post-training than pre-training in LLMs →

Show full content

One of the pitfalls in scientific writing in general is the danger of prematurely describing the basic observations of a field in theory-laden terms. This can be particularly problematic in fields that are in their infancy, such as developmental psychology, where basically all the fundamental questions are wide open, and where we basically don’t know anything about anything. Avoiding this major pitfall is the primary virtue of Eve Clark’s excellent book on language development in children, Language in Children. The book describes the general character (the Gestalt) of the types of experience children learn their first language from admirably plainly, with lots of actual examples of child-parent (and child-child) interactions from diary studies. It does so without groping after premature “explanations” of how children actually learn their first language from such experiences, which are obviously going to be wrong (a favorite theme of mine is that “explanations” are overrated, descriptions are underrated in sciences that deal with complex systems).

Reading the book made me realize the ubiquity (and presumably the importance) of feedback in children’s early language exposure. This feedback doesn’t have to be (and often isn’t) very explicit (as in: “no, that’s not correct; here’s how you actually say it …”), but much more implicit as in recasts, reformulations, and other types of implicit corrections of a child’s productions that happen all the time as part of the natural conversational to-and-fro between the child and the caregiver, as in the following example (p. 15):

Or as in this exchange between a 2.5-year old and his/her father (p. 138):

Clark claims that this kind of feedback is quite common in children’s early language experience, “with adults following up between 40 per cent and 60 percent of child errors up to around age 3;6, in middle- and upper-class speakers of English and French” (p. 16). Children also frequently ask questions to directly elicit appropriate labels and descriptions for novel objects, events, or actions from adults.

Another thing that appears to be common in early language development is practice. Children regularly practice different aspects of language on their own and with others. For example, they practice the sounds of their language (the following examples are from the bed time monologues of a 2-year old, Antony, p. 31):

(interestingly, note how this practice sequence contains both positive and negative examples, i.e. “not barries”; it’s as if the child has an internal learned reward model, or verifier, that can evaluate his own productions with respect to particular criteria). They practice the building up and breaking down of phrases:

And they practice various other grammatical constructions:

Practice and feedback are tightly related to each other: more practice elicits more feedback, which provides more opportunities for the child to learn and to refine the meanings of the words and phrases in his/her language (p. 41).

Despite this apparent prevalence of feedback in early language exposure, it seems to be commonly assumed, implicitly or explicitly, that much of language development in children happens without supervision. I suspect that this may be another remnant, another relic, of Chomsky’s malignant influence in linguistics, specifically his “poverty of the stimulus” idea: the idea that not only is there not enough data in a child’s early linguistic experience, but whatever little there is of it is not rich in supervision, therefore of lower value for learning. A natural consequence of this assumption is that language acquisition in children is often likened to unsupervised, or self-supervised, pre-training in language models.

A case in point is the BabyLM challenge. The BabyLM challenge involves learning sample-efficient language models from a developmentally plausible corpus, containing 10M or 100M words. Pretty much all submissions to both of the two editions of this challenge that have run so far feature variations on pre-training, e.g. exploring new architectures for pre-training, new objectives for pre-training, designing new curriculum strategies for pre-training, etc. (here‘s a synopsis of the first edition of the challenge and here’s a synopsis of the second edition). As far as I can tell, none of the submissions so far have explored ideas involving what is nowadays called post-training in language models, i.e. supervised learning or learning from various types of feedback (such as human preferences). Reading Language in Children has made me seriously question this identification of language acquisition in children with unsupervised pre-training in language models. So, I would like to see more work come out that explores possible connections between language acquisition in children and feedback learning (or post-training) in language models instead (this doesn’t necessarily mean completely rejecting any role to unsupervised learning in children’s language acquisition though).

There are reasons to think that learning from feedback would be more sample-efficient than unsupervised learning, so it could potentially help explain the remarkable sample-efficiency of language acquisition in children: (i) feedback provides both positive and negative evidence, unlike in the currently dominant paradigm of unsupervised learning, where the only negative evidence is the absence of positive evidence; (ii) feedback, or supervision, is tailored to the learner’s productions, so the supervision signal is, in some sense, much more relevant to the learner than passively received unsupervised data (this also means that it’s automatically titrated to the learner’s current skill level).

Language in Children also offers a novel alternative perspective on the challenge of learning a second language in adulthood that I hadn’t thought of before. The apparent difficulty of learning a language in adulthood, compared to the apparent ease with which children acquire their first language in early development, is commonly attributed to some sort of critical period in brain plasticity, i.e. children are somehow biologically much more keyed to learning a language than adults. But, Clark offers an alternative explanation for why language learning might be more difficult for adults than for children, namely the radically different nature of the language experience in adulthood vs. in childhood. More specifically, the quantity and the quality of feedback received by adult learners, even for adult learners immersed in a second language environment (e.g. a new immigrant to a foreign country), is likely much inferior to the kind of intense and intimate feedback that children receive while learning their first language (p. 140): “… first-language acquisition offers many occasions for feedback from more expert speakers, occasions that young children attend to, and make use of. But with a later-acquired second language, such feedback is virtually absent…” At least much less common than in childhood.

I don’t necessarily endorse this explanation as the sole, or even the primary, reason why language learning in adulthood appears to be more difficult than in early childhood (as a counterpoint, for example, one could point out the numerous cognitive advantages adults have over children that could help them make better use of whatever feedback they receive while learning a language), but in my view, it offers an interesting, novel hypothesis that needs to be taken seriously and investigated further as a potentially important factor contributing to the apparent difficulty of learning a language in adulthood vs. in childhood.

http://severelytheoretical.wordpress.com/?p=4667

Extensions

On the entropic brain, trapped priors, and machine learning

Emin Orhan Oct 29, 2024

If the doors of perception were cleansed every thing would appear to man as it is, Infinite. For man has closed himself up, till he sees all things thro’ narrow chinks of his cavern. William Blake, The Marriage of Heaven and Hell I’ve recently been reading up on the effects of psychedelics on the brain.… Continue reading On the entropic brain, trapped priors, and machine learning →

Show full content

If the doors of perception were cleansed every thing would appear to man as it is, Infinite. For man has closed himself up, till he sees all things thro’ narrow chinks of his cavern.
William Blake, The Marriage of Heaven and Hell

I’ve recently been reading up on the effects of psychedelics on the brain. One of the consistent effects that comes up in the literature is that psychedelics seem to make brain activity more entropic or higher dimensional globally (or more chaotic if you’d prefer to put it that way); for example, a recent paper claims “psilocybin desynchronizes the human brain”. It is intuitively very tempting (although potentially misleading) to interpret the psychological effects of psychedelics in terms of this entropy expansion mechanism: e.g. they help the brain escape from established, laid down, entrenched, lower dimensional (and often pathological) patterns of activity, opening up an opportunity to break free from its “trapped priors”.

The main reason I wanted to write this short post is that I would like to make a connection with machine learning here. I think that a similar entropy expansion story, by and large, accounts for the benefits of various architectural motifs used in modern deep learning models to improve their trainability. I’ve long argued that the main obstacle in training deep learning models is degeneracy, i.e. the activity space of the model collapsing into a very low-dimensional, degenerate space for generic initializations of the model. This makes the model effectively a pathologically low capacity model. Skip connections alleviate this degeneracy problem by making the activations more entropic. Normalization also likely has a similar effect, and so does the mixture-of-experts (MoE) motif.

Neuroscientists often interpret learning effects, e.g. improvements in sensitivity to environmental cues or improvements in cognitive or behavioral flexibility, very locally in terms of changes in synaptic plasticity (for example, in the paper cited above, we read: “Synaptogenesis in the medial frontal lobe and anterior hippocampus is thought to be key to the neurotrophic antidepressant effects of psilocybin”). But the overwhelming importance of these more global, system-level effects for learning has been woefully overlooked in my opinion. A more entropic model is a better learner simply because of its reduced degeneracy overall, not because of any low-level details (and there are probably many different low-level implementations that achieve a similar global effect).

By far one of the most interesting papers I’ve read on this topic is this paper by Carhart-Harris et al. from 2014, which explores some really original (albeit highly speculative) ideas about possible connections between this more entropic state of the brain and human consciousness. Carhart-Harris et al. argue that this entropic state corresponds to a more primary form of consciousness that is distinct from the normal waking consciousness in humans. It is characterized as a more dream-like, less constrained conscious state with a diminished, diffuse sense of self and a stronger sense of unity with the universe (mind at large, as Huxley called it). It is speculated that infants might also possess a similar primary conscious state as their default conscious state (early psychosis and certain types of meditation might induce such states to a certain extent as well). As infants grow up and their brains mature, they become less entropic. This was also interesting to me, because something like this is again observed while training neural networks: they start out in a less degenerate (more entropic) state, and gradually become more degenerate (less entropic) over training. The architectural motifs mentioned above (skip connections, MoEs, normalization) typically make the model less degenerate before training, but more so after training.

One of the main interests of these drug-induced altered states of consciousness (and other extreme states of consciousness) for me is that they demonstrate the remarkable “latent phenotypic variation” inherent in the human brain. In other words, they show us what we are potentially capable of becoming or experiencing. We can possibly cultivate some of these states with our own will to some extent, and cultural and biological evolution can work on such states over longer time scales to select desired phenotypic traits (in fact, Carhart-Harris et al. argue that something like this presumably already happened along the evolutionary lineage leading up to humans: the brains got more entropic). If at some point in the future, for example, a selection pressure arises for exquisitely sensitive, gentle, kind-hearted, selfless, child-like saints with a profound sense of unity with the universe, evolution can work its magic quickly to make a Peaceable Kingdom on Earth, because the raw material is already there.

http://severelytheoretical.wordpress.com/?p=4621

Extensions

IsoFLOP curves of large language models are extremely flat

Emin Orhan Jul 31, 2024

An interesting detail in the recently released Llama-3 technical report has caught my eye (p. 8): This has caught my eye, since I had noted the same phenomenon in a previous post about the Chinchilla scaling laws (more than two years ago) to argue for training smaller models (point 4 in that post). I’m glad… Continue reading IsoFLOP curves of large language models are extremely flat →

Show full content

An interesting detail in the recently released Llama-3 technical report has caught my eye (p. 8):

This has caught my eye, since I had noted the same phenomenon in a previous post about the Chinchilla scaling laws (more than two years ago) to argue for training smaller models (point 4 in that post). I’m glad that this observation is finally being taken seriously, but I think the quotation above from the Llama-3 paper still underestimates the extent of this isoFLOP flatness issue. The performance of these models is not just robust to small variations in model size around the optimal, but it is actually pretty robust to even massive variations in model size at the Llama-3 compute scale. Here’s a simple experiment I did to illustrate this point.

Suppose you have a compute budget of 3.8e25 FLOPs, which is the amount of compute used for training the 405B parameter flagship Llama-3.1 model. What is the loss you can achieve by training different sized models, given this much compute? We can estimate this using the Chinchilla scaling laws, in particular, using their “Approach 3”, i.e. fitting a parametric scaling function to L (pretraining loss), N (model size), D (data size) values across many different training runs. In fact, we can even estimate the uncertainty around our predictions with bootstrapping (following Besiroglu et al.). Here’s what it looks like when we do this (I won’t bore you with the details, but suffice it to say it’s a pretty straightforward exercise):

**Figure:** Estimated *model size vs. pretraining loss* relationship given a compute budget of 3.8e25 FLOPs (equivalent to the amount of compute used for training Llama-3.1 405B). Estimates are based on the Besiroglu et al. correction to the Chinchilla scaling laws. Gray curves are 4000 individual predictions based on bootstrapped parametric scaling law estimates. Red dots indicate the corresponding compute-optimal models. Black line is the mean prediction. We highlight three different sized models (65B, 650B, 6.5T parameters) with the star symbols.

For this particular compute scale, a model with roughly 650B parameters (the middle star) turns out to be compute-optimal. But, note first the uncertainty around this optimal size (red dots). The 95% confidence interval around the optimal size spans almost an order of magnitude range! And this assumes that we got the parametric scaling function exactly right (no misspecification) and our experiments were all optimal (hyperparameter choices, etc.). If not, that’s going to be another major source of error. Secondly, and even more importantly, look how wide and flat that bottom of the curve is around the optimal size. Models an order of magnitude smaller or bigger than the optimal model (indicated by the star symbols on the left and the right) basically have the same loss as the optimal one. In this particular case, for example, the optimal 650B model has a loss of ~1.89, whereas both the 65B and the 6.5T models have a loss around ~1.91. So, these models are within 1% of each other in terms of final pretraining loss.

Parenthetically, note also how this estimate of ~650B parameters for the compute-optimal model size differs quite substantially from the ~400B parameters estimated in the Llama tech report for the same amount of compute. This is presumably because these respective scaling “laws” are estimated from two different sets of experiments (using different training data and different training configurations, etc.). This again goes to show how sensitive these so-called scaling “laws” are to experimental details. I’m always amazed when people talk about them as if they were actual, precise, quantitative “laws” of nature, like Newton’s laws of motion. What a terrible and unfortunate choice to call them laws!

I haven’t been able to find a credible estimate of this, but my guess is that the total (lifetime) inference costs of these flagship large language models are likely several orders of magnitude larger than their one-time training costs, so any calculation that takes into account the inference costs of these models (not just their training costs) will massively favor training smaller models for longer (way beyond the training-compute-only-optimal point). I appreciate that Meta already did this for their 8B and 70B models with Llama-3 to a certain extent, but I hope that they’ll be even bolder next time and do it for their flagship model as well. Given a fixed compute budget, smaller models also allow for more room for experimentation: for example, for tuning the hyperparameters of the model or the training configuration much more extensively. So, again any calculation that includes hyperparameter search and other types of experimentation (in addition to training and inference compute) will likewise shift the isoFLOP curve to the left and thus favor training smaller models for longer (although this is likely a much smaller effect than the effect of inference).

To wrap up this post, I’d like to make a few very concrete recommendations for the next iteration of Llama models (presumably Llama-4), if anybody from the Llama team reads this (je sais que tu le fais ):

Include inference compute and experimentation compute (e.g. hyperparameter tuning) in your scaling law calculations, not just the training compute. These do not have to be very precise. It’s OK to be conservative about inference compute. Any effort is better than no effort. Here‘s one recent attempt and here‘s another attempt at incorporating inference compute in such calculations.
Normalize training models for multiple epochs. With the current data sizes (and especially for smaller models), a few epochs of training over the same data is almost certainly indistinguishable from training on brand new data. And training for even more epochs is likely only slightly worse than training on brand new data for any number of training epochs practically feasible today.
I think there’s no need to train 3 separate model sizes, maybe two models at most is enough: one small and one bigger. For Llama-3, for example, instead of a 70B model, I would have much preferred to see Meta spend a bit more compute and train the 8B model for 20x longer (or something like that). Seeing how well such a well-trained small model worked would also go a long way toward substantially increasing people’s confidence in the practice of training smaller models for much longer.

http://severelytheoretical.wordpress.com/?p=4520

Extensions

Does Sora understand physics? A few simple observations

Emin Orhan Apr 25, 2024

I’m a bit late to the fray as usual, but I wanted to write a short post about Sora. Sora is OpenAI’s new video generation model. As of this writing, it’s still not open to the public, so all we’ve got so far is some high-level information about the model and some generated samples shared… Continue reading Does Sora understand physics? A few simple observations →

Show full content

I’m a bit late to the fray as usual, but I wanted to write a short post about Sora. Sora is OpenAI’s new video generation model. As of this writing, it’s still not open to the public, so all we’ve got so far is some high-level information about the model and some generated samples shared by OpenAI in a blog post. The samples look impressive in their visual quality and their apparent realism, however most of the videos seem to contain pretty glaring physical inaccuracies that are easy to detect when one looks at the details a bit more carefully (e.g. objects merging into each other and then unmerging, objects spontaneously disintegrating or disappearing, objects spontaneously changing their features, etc.). This prompted some to question whether (or to what extent) Sora really understands physics and even further whether it’s possible to understand physics at all by, effectively, just learning to predict pixels over video clips (which is, at a high level, what Sora does).

I should preface everything I will say here by emphasizing that I really dislike this sort of binary “understands it or not?” framing of discussions about capabilities in general. Why do we always have to frame our debates in terms of extremes like this (sigh)? It’s absurd to claim that a model that can generate videos as good as Sora does hasn’t learned anything about physics. It also seems absurd to claim that it has learned a highly accurate physics engine, as the model generated videos often display clear physical defects. Obviously, the reality is somewhere between these two extremes. The real interesting questions here are: what aspects of physics was Sora able to learn exactly and how far can we push this approach to learn a more accurate physics engine (in other words, how good the learned physics engine will become as we scale up Sora)?

With this important caveat, I’d like to make a few very simple observations as my humble contribution to this assize about Sora’s understanding of phsyics. Most of these are probably obvious to anybody who knows anything about anything (or to somebody who knows something about something), but I happen to belong to that rarefied species that finds prodigious value in stating the obvious from time to time, so here we go:

1) There’s an important distinction between “understanding physics” and being able to generate physically accurate videos. Although the model might struggle with generating physically highly accurate videos, it might still be able to reliably recognize that there’s something “weird” going on in physically inaccurate videos. This is roughly the difference between recognition and generation (or the difference between recognition and recall in memory retrieval). The latter is generally harder. So, a potentially more sensitive way to test the model’s understanding of physics would be to run carefully controlled recognition tests, as is typically done in intuitive physics benchmarks, for instance.

2) People’s understanding of physics seems to be mostly of this “recognition” variety too (rather than the “generation” variety). People don’t really have a very accurate physics engine inside their heads that they can use to simulate physically highly accurate scenarios (cf. Marcus & Davis, 2013; Davis & Marcus, 2015; Ludwin-Peery et al., 2021). This is why this capability is often properly described as intuitive physics as opposed to actual physics (or similar).

3) People can also generate fictitious, physically highly implausible or even impossible scenarios in their imagination with remarkable ease and ingenuity (and they have been doing this since time immemorial). Cartoons, fairy tales, fantasies, legends, etc. are full of such examples: levitating creatures, objects passing through solid walls, objects melting or disintegrating into pieces and then regrouping again, etc.

4) For related reasons, you also do NOT want a video generation model that only generates physically highly accurate videos. You want something that can bend or break physics, ideally in a precisely controllable way (based on a textual prompt, for instance, among other ways).

5) We know nothing about the distribution of the videos Sora was trained on. Almost certainly, a subset of its training data consists of CGI, digitally edited, or animated videos depicting physically implausible or impossible scenarios (we don’t know how large this subset is). So, part of the reason why Sora sometimes generates physically implausible or inaccurate videos may be traced back to this subset of its training data.

6) Even granting the previous point, however, some of the generated samples seem to show clear signs of gross errors or inaccuracies in whatever physics engine Sora has managed to learn by watching videos. Consider this generated video of wolf pups frolicking, for example. Why do inaccuracies like this arise in the first place and how might they be remedied or ameliorated? At the risk of sounding like a man with a hammer seeing nails everywhere, I will suggest that many of the inaccuracies like this particular one are “granularity problems” that will be fixed when Sora can model videos at a sufficiently fine granularity (both spatially and temporally). For example, this particular scene with wolf pups frolicking is a highly complex, dynamic scene and accurately generating a scene like this requires very fine-grained individuation and tracking of multiple objects. In the absence of this level of granularity, the model instead generates something more coarse-grained, freely merging and unmerging objects in physical proximity without regard to correctness in details, but capturing the overall gist, the gestalt (or “texture”) of the action in the scene, somewhat analogous to how we see things in our visual periphery.

Update: After writing this post, I saw this thoughtful and much more detailed post on Sora by Raphaël Millière, which I recommend as well.

Download video

http://severelytheoretical.wordpress.com/?p=4523

Extensions

The “it” of deep learning and convergent evolution

Emin Orhan Feb 28, 2024

I recently came across this beautiful short blog post by James Betker (who works at OpenAI), arguing that the thing that really determines the capabilities and, more generally, the behavior of a machine learning model is not its architecture, it’s not the particular optimizer used for training the model, or any other details of the… Continue reading The “it” of deep learning and convergent evolution →

Show full content

I recently came across this beautiful short blog post by James Betker (who works at OpenAI), arguing that the thing that really determines the capabilities and, more generally, the behavior of a machine learning model is not its architecture, it’s not the particular optimizer used for training the model, or any other details of the model configuration, but it’s the training data. Surprisingly, even the optimization objective often doesn’t seem to make a huge difference, within a wide margin. An example of this is the wide range of self-supervised visual representation learning algorithms (SimCLR, MoCo, BYOL, DINO, MAEs, etc.) and the wide range of model architectures (ConvNets, transformers, MLP-mixers, etc.) that all seem to work more or less equally well when trained on the same data. This doesn’t mean that the model architecture, optimization objective, or other details of the model/training configuration are completely irrelevant; that’s certainly not the case: e.g. earlier generation self-supervised learning algorithms like RotNet were clearly inferior to the newer generation ones listed above for learning useful, general-purpose visual representations, and similarly, MLPs seem to be clearly inferior to the modern architectures mentioned above (although even this may change if we can train bigger MLPs with more data than we have been able to do thus far). The point is rather that the capabilities of the trained models seem to be surprisingly insensitive to a wide range of variation in these factors.

It occurred to me that this situation is not unique to deep learning. A similar thing happens in biological evolution too. “Training data” in this case roughly corresponds to the environment organisms find themselves in (broadly construed), including other organisms they interact with. As in machine learning, “training data” in this sense seems to forcefully and profoundly affect the outcome of evolutionary “optimization” too1. The two main pieces of evidence for this are (1) the rampant convergent evolution in biology and (2) biological structures and processes often pushing up against the limits of physics. Wings and powered flight independently evolved at least four times; complex, image-forming eyes independently evolved dozens of times; C4 photosynthesis independently evolved perhaps over sixty(!) times; complex brains may have evolved independently at least a dozen times; and so on. These examples suggest that when faced with similar environmental challenges, evolution hit upon the same solutions over and over again in vastly different lineages. These solutions are also often close to the physical limits2. The person who, perhaps more than anyone else, emphasized these aspects of evolution is probably Simon Conway Morris. Life’s Solution: Inevitable Humans in a Lonely Universe (this concept of inevitability or predictability in evolution is a major theme in Conway Morris’ work) and Six Myths of Evolution are two of his books documenting a large number of such examples of (1) convergent evolution and (2) evolutionary optimization reaching the limits of physics (both quotations below are from Chapter 1 of the latter book):

… convergence comes to our rescue; its ubiquity suggests the regions of biological hyperspace that are actually habitable represent the minutest fraction of what is potentially available.

… are there ultimate limits of life, and if so is the biosphere anywhere near such a closure? Curiously, the evidence suggests that with one crucial exception we are indeed near to the boundaries.

Why does this happen? Why do training data in deep learning and the environment in biological evolution seem to have such profound effects on the nature of the solutions reached through these processes? It seems that there are basically two requirements for this to happen: (1) training data or the environment is rich and complex enough to tightly constrain the space of “good solutions”, (2) the optimizer is “good” in some informal sense (I use “the optimizer” in a broad sense here to include everything other than the training data itself in the case of machine learning)3. (2) seems to be necessary, since training data or the environment would presumably not be able to constrain the nature of highly suboptimal solutions very tightly. “Optimal solutions are all alike; every suboptimal solution is suboptimal in its own way”, as Tolstoy might have said in another universe in which he (regrettably) chose to become an optimization theorist.

One can imagine other domains where these properties roughly hold as well, for example, human history (again broadly construed, e.g. history of technology and innovation, history of ideas and morals, history of social and political institutions, etc.). There’s a sobering quality to this view of human life and human history. While we tend to think of ourselves as autonomous, free individuals, and our free will as hugely consequential and important, in the large scheme of things, even the most consequential events or the most consequential individuals in history tend to have at best a transient effect on the unfolding of history at longer time scales: they can only slightly delay or speed up the inevitable (or the highly likely), which are themselves shaped by much more stable and deeper forces of history, like human nature (just like how physical laws often determine the mechanisms of life that end up evolving in biological organisms). So, a lot of apparent chance and happenstance at the scale of individual human lives, but imperturbable order and necessity at more “cosmic” scales. I often struggle with this Hegelian view of history.

Of course, evolution happens in a dynamic, evolving landscape (both the environment and the species in it change over time), so “training data” in this case is not static and the situation is quite a bit more complicated than in a standard static machine learning problem. ︎
What initially appear to be suboptimal solutions often turn out to be consequences of evolution solving a different and more complex optimization problem than the one we have in mind, although I understand that this is a somewhat intricate topic that needs to be handled more carefully (perhaps in another blog post). ︎
It may seem a bit strange to claim that evolution by natural selection is a “good optimizer”, since it’s basically just random search, and it is true that compared to gradient-based optimization, random search is definitely suboptimal, but my guess is that the vast number of parallel searches that happen in biological evolution and the vast stretches of time over which it takes place sufficiently alleviate the obvious problems with random search to make it a “good enough” optimizer. ︎

http://severelytheoretical.wordpress.com/?p=4476

Extensions

Intelligence is a granularity problem (or the reality has a surprising amount of detail, so must intelligence)

Emin Orhan Jan 2, 2024

One of the recurring themes in Hans Moravec’s prescient book, Robot: Mere Machine to Transcendent Mind (first published in 1999), is how practically important problems (e.g. agile robot navigation in the real world) become tractable more or less automatically, as the amount of widely accessible compute reaches a soft threshold. Before this threshold is reached,… Continue reading Intelligence is a granularity problem (or the reality has a surprising amount of detail, so must intelligence) →

Show full content

One of the recurring themes in Hans Moravec’s prescient book, Robot: Mere Machine to Transcendent Mind (first published in 1999), is how practically important problems (e.g. agile robot navigation in the real world) become tractable more or less automatically, as the amount of widely accessible compute reaches a soft threshold. Before this threshold is reached, people try to come up with all sorts of ingenious ideas, clever tricks to squeeze the last bit of performance from the available compute, but in the long run, this almost always proves totally unproductive, basically a complete waste of time, as the most straightforward, the simplest, “brute-force”, “dumb” method to solve the problem turns out to work just fine once the available compute reaches the requisite threshold, whereas the “ingenious” tricks almost invariably do not scale nearly as well with compute. This is, of course, another version of Rich Sutton’s famous Bitter Lesson.

The main reason problems become tractable only at particular compute scales is that their solution requires a minimum level of granularity or detail to be modeled. And most of the fundamental, practically important computational problems we face in the real world need a very high degree of granularity for their solution. The main reason for this, in turn, is that reality has a surprising amount of detail and these details are often very important.

Here’s an illustrative example from the book showing 3D maps of two similar visual scenes generated by essentially the same “dumb” (but scalable) mapping algorithm 18 years apart:

*18 years of steady increase in the amount of widely available compute finally made real-world robot navigation a reality.*

With the drastic increase in the available compute over those 18 years, it became possible to extract many many more features from the scene and estimate their locations to a much higher degree of resolution. This much finer granularity in 3D mapping is what finally enabled acceptably good robot navigation in the real world.

Here are some other examples of this phenomenon:

Visual object recognition: You can’t do fine-grained real-world object recognition with 8×8 images (nor even with 28×28 images). This is just too small to resolve the important details of many real-world objects. If the compute available to you only allows for the processing of such small images, I’m afraid you’re just going to have to wait until the compute catches up (much better to work on increasing the compute than to churn out cute little tricks that only work with 8×8 images!).

Chess: You can’t beat the world champion at chess if you can search the game tree only up to depth 3 or so. Beating the world champion at chess requires being able to search the game tree much more extensively at sufficiently large depths and breadths. And in fact, “dumb” brute-force search combined with sufficient compute was basically how a computer program defeated a world champion at chess for the first time, although there have been some important developments in making search more efficient since then (i.e. MCTS).

I believe this granularity problem also fundamentally underlies most cases of current AI methods not yet being able to do well in certain domains and it will ultimately be overcome when the widely available compute scale allows for the modeling of the requisite level of granularity in that domain even without any fundamental improvements in the algorithms, just like in Moravec’s 3D mapping example above. To give a few further examples:

Robotics: I believe this is why robotics is still hard for AI. For example, fine-grained, dexterous control of robotic hands in the real world requires being able to learn high-dimensional, high-precision, complex temporal patterns (with lots of high-frequency components, for example, due to contacts), which, in turn, requires sufficiently big models trained with a sufficiently large amount of data. This fine-grained, high-dimensional, high-precision control problem is, in fact, presumably so hard that the sensory and motor cortices in the human brain allocate a disproportionately large amount of cortical space to the representation of hands, as illustrated by these cartoonishly grotesque figurines of cortical homunculi (as a side note, it seems to be generally accepted among evolutionary biologists that the evolution of upright posture and the subsequent freeing of the hands for the manufacture and manipulation of objects was indeed one of the main drivers of the rapid expansion of brain size in the genus Homo):

*A disproportionately large amount of cortical real estate is allocated to the representation and control of hands in the human brain (source).*

Data efficiency: I believe that this granularity problem is also (at least partly) behind the apparent data efficiency gap between current deep learning algorithms and humans. To give an example from the visual domain, the human retina contains something like 6M color-sensitive cone receptors very tightly concentrated within a few degrees around the fovea. By moving our eyes, we can resolve different objects or surfaces in a scene to a very high degree of precision. The most commonly used image size in computer vision today, on the other hand, is something like 310×256 pixels (for the entire image), which is about 0.08M pixels, or two orders of magnitude lower resolution than the human retina (directly comparing the number of pixels in an image and the number of photoreceptors in the retina is a bit tricky, but I think it does make sense under fairly reasonable assumptions). My own recent work suggests that the apparent data efficiency gap in the visual domain between current deep learning algorithms and humans might be closed once we start to work with sufficiently large natural images, closer in size to the photoreceptor array in the human retina (~6MP), instead of using much smaller images, which is currently the norm.

Long-form video modeling: The granularity problem is the reason why long-form video modeling (long-form video understanding and generation) is still not there yet. Representing even very short clips without too much information loss requires a large number of visual “tokens”. From my own work, for example, I know that even 1 second long natural video clips require at the very least something like 4x16x16 discrete tokens (i.e. 4 tokens in the temporal dimension, 16×16 tokens in the spatial dimensions) in order to represent them faithfully enough. That is roughly 1K tokens. Scaling this up to a 1 hour long video would require roughly 4M tokens. It is not possible to train a large GPT model with a 4M token context length at the moment (not even for big industry labs), but as surely as the sun will rise tomorrow, this will be eminently feasible at some not too distant future and at that point AI models will be able to understand and generate long-form videos (e.g. films) at least as well as humans, but orders of magnitude faster (it will be a very wild world when AI models can generate entire films in a matter of minutes or seconds).

Text, hands, faces: The granularity problem is the reason why generative vision models had problems with creating realistic texts, hands, or faces in images, until very recently. These categories of objects all involve a large amount of fine-grained visual detail that needs to be represented and modeled in order to generate and recognize them accurately.

Developing and understanding large, complex software projects: Such projects often involve large codebases and their corresponding documentation (perhaps also including auxiliary information such as issues and pull requests, etc.). Similar to the case of long-form video modeling above, it is currently not yet feasible to train large GPT models with a large enough context size to cover all of the relevant pieces of code and documentation contained in a complex, realistic software project.

Long-form text modeling: The granularity problem is also the reason why AI models can’t write a convincing novel yet (nor read and understand a novel as well as humans do). The length of a good-sized novel like Anna Karenina is roughly on the order of 1M tokens (give or take a factor of 2). Again, it is currently not feasible to train a large GPT model with a context size this long, but it will surely be feasible at some not too distant future and at that point AI models will be able to write and comprehend novels (and other types of long-form text) at least as well as humans do. But, you may ask, do all those 1M tokens really matter for writing or comprehending a good novel? Yes, absolutely! It takes a lot of detail to build convincing characters, it takes a lot of detail to build rich internal and external lives for the characters in a novel. And we are exquisitely sensitive to these details. Human life is rich and complex, we go through a lot as our lives unfold over the years and, as a result, we are very sensitive to these vicissitudes, twists-and-turns of life. Let me also take this opportunity to wax lyrical about one of my favorite writers and one of my favorite novels: this is precisely why Tolstoy was one of the greatest writers and Anna Karenina is one greatest novels ever written. Tolstoy is particularly adept at creating, expressing, conveying these rich details of both the inner and outer lives of the characters in Anna Karenina, so much so that when you read Anna Karenina, you say “this could be real”; nothing in the novel really sticks out as strained, implausible, or unconvincing.

*One of the greatest novels ever written (source).*

Are there any problems that cannot be regarded as pure granularity problems for current AI methods, i.e. problems caused by our temporary inability to apply these methods at sufficiently fine granularities? My current working hypothesis is that reaching human-level AI will prove to be nothing but a granularity problem (or a series of granularity problems). I think we will once again be surprised when we find out we can actually solve many of the currently intractable looking problems with increased granularity. But, how about reasoning or planning, for example? Are they also just a granularity problem? First of all, I don’t think that humans really do reasoning or planning in the sense in which these terms are often used, as evidenced by the fact that models that can actually do reasoning and planning wipe the floor with even top human players in board games. What seems like reasoning in humans is most often just the use of shortcuts afforded by abstraction tools, for example, we write and use computer programs to do our reasoning for us. And writing code, as we found out recently, seems eminently amenable to reasoning-free, “pattern recognition” type learning strategies. Otherwise, again, my current hypothesis is that for human-level AI, it is going to be “pattern recognition” all the way down, but at increasingly finer granularities (perhaps with the sole addition of just a little bit of supervised finetuning applied on top).

http://severelytheoretical.wordpress.com/?p=4408

Extensions

Further thoughts on hallucinations in generative models

Emin Orhan Oct 30, 2023

While working on some generative video models recently, I had a moment of epiphany about hallucinations in generative models. I wanted to share this tiny bit of insight (if it isn’t too presumptuous to call it an insight) that has occurred to me. It is one of those simple things that has always been right… Continue reading Further thoughts on hallucinations in generative models →

Show full content

While working on some generative video models recently, I had a moment of epiphany about hallucinations in generative models. I wanted to share this tiny bit of insight (if it isn’t too presumptuous to call it an insight) that has occurred to me. It is one of those simple things that has always been right in front of your eyes, but you never paid attention to it, so you were never consciously aware of its existence or its nature, like a sign or a pattern you notice for the first time in a familiar environment. It may even be already obvious to most people.

If you train an autoregressive image model and do conditional sampling with it, that is, if you take a real image, give the upper half of the image as context (or prompt) to the model, and ask it to complete the bottom half, it can typically generate a pretty diverse set of novel continuations. OpenAI’s Image GPT post had cute examples of this:

Conditional sampling with Image GPT: given the upper half as context or prompt, the model is asked to complete the bottom half. Columns 2 to 6 show 5 different conditional samples from the model for each image.

Now, this is true even for a well-trained model that’s trained on a relatively small dataset and even when the conditioning images given as context come from the training data itself. I’ve tried this in a recent work here. You can see some examples here of conditional samples generated by a model trained on a relatively small child headcam dataset. Obviously, the exact type and diversity of the samples generated will depend on the details of the sampling strategy, but the point is that it is surprisingly difficult to make the model generate just the “ground-truth” continuation given by the training example itself. In other words, the model has a strong tendency to “hallucinate” novel but plausible continuations.

My moment of epiphany came when I noticed that this was not true for video models (modeling short, i.e. a few seconds long, video clips). If you train an autoregressive video model on 2-second long video clips and do conditional sampling with it, i.e. give the model a 1-second long clip as context (or prompt) and ask it to complete the rest of the video, it will basically generate the same continuation over and over again. And if the prompt clip given as context comes from the training data, a well-trained model will almost always generate something very close to the ground-truth continuation. So, why does this happen? What explains this difference between the image models and the video models?

Well, the answer is pretty obvious: single images are much less redundant than short video clips. Given the upper half of an image, there are lots of different plausible continuations for the bottom half of the image. On the other hand, the possibilities are much more restricted for a short video clip: given the first second of a video clip, it is virtually determined what will happen in the next second of the clip (of course, the situation is a bit different for much longer clips); here, the given context is much more constraining than in the case of images.

Text is even less redundant than images, so the context (prompt) is even less constraining for text. Consider this short paragraph (it’s the first paragraph of Guy de Maupassant’s famous short story, Clair de Lune, one of my favorite short stories):

Abbe Marignan’s martial name suited him well. He was a tall, thin priest, fanatic, excitable, yet upright. All his beliefs were fixed, never varying. He believed sincerely that he knew his God, understood His plans, desires and intentions.

There’s an enormous variety of ways you can continue this paragraph, endless possibilities most intriguing, most wonderful. The actual continuation of this story is just one among those endless possibilities. If we want to retrieve the actual continuation of the story, this short piece of context may not be good enough (it may be insufficient) to pick the right choice among the endless possibilities.

OK, but what does this all have to do with hallucinations in language models? Well, I think these examples suggest that hallucinations may be at least partly a retrieval problem affected by two main factors: (i) the intrinsic redundancy of the domain we’re dealing with, so for example, hallucinations are always going to be more likely when you’re trying to predict one half of an image from the other half than, say, when you’re trying to predict one half of a short video clip from the other half (even within the same modality, say, text, some genres, e.g. fiction, may be more prone to hallucinations than others, e.g. official documents, because of their weaker predictability or redundancy), and (ii) how much context we’re giving the model to help it retrieve the “correct” or ground-truth continuation.

More speculatively, these examples also suggest that the more a model “knows” in some intuitive sense (for example, by being trained on a larger and more diverse set of data), the more context it may need to retrieve the correct piece of information, since the likelihood of retrieval failures (e.g. retrieving a similar but incorrect piece of information) increases as the model’s knowledge increases. Intuitively, this is analogous to how you would need longer vectors in order to retrieve from a larger vector database at a fixed level of accuracy.

At this point, I should point out as a caveat that language models (and generative models in general) are not standard information retrieval models. There are important differences between standard information retrieval systems and generative models: for example, generative models can “retrieve” novel, non-existent items. However, thinking of generative models as soft retrieval systems (that can retrieve soft, novel mixtures or variations of their training data) is still a very useful perspective in my view, especially when it comes to questions such as memorization and hallucinations.

The importance of context for effecting correct retrievals suggests that we may be able to reduce hallucinations in language models by basically giving them more (and better) retrieval cues in the prompt (this is, in fact, the core idea behind retrieval-augmented generation, or RAG, methods to reduce hallucinations in generative models). I tried this strategy with some of the examples of hallucinations by GPT-3.5 I documented in my previous post, with partial success. So, for instance, for the George Orwell example, just prepending the introductory section of the Wikipedia article on George Orwell to my question about Orwell’s university attendance reliably elicits the correct answer (i.e. that Orwell did not attend university). Even though the copy-pasted text from Wikipedia doesn’t contain any information about Orwell’s university attendance, it presumably “nudges” the model’s retrievals toward the more “wikipedia-esque” corners of its memory landscape, thus helping them to be more accurate (the rest of the Wikipedia article does indeed mention the fact that Orwell did not attend university). Similarly, I was also able to elicit some details of the plot of Clair de Lune by simply copy-pasting the first few paragraphs of the story (I wasn’t able to achieve any of this previously by just directly asking the model to describe the plot of the story).

This is all anecdotal evidence. It needs a much more rigorous investigation to test these ideas, but the initial results seem pretty promising. In some cases, however, I noticed that this strategy doesn’t seem to work at all, suggesting that some hallucinations may be due to encoding errors rather than retrieval errors, as memory researchers would put it (i.e. the model just hasn’t learned the relevant information in the first place), that is, some hallucinations may be due to write errors rather than read errors. In such cases, the desired behavior would be for the model to simply decline to answer the question instead of answering it based on a superficially similar but incorrect retrieval.

http://severelytheoretical.wordpress.com/?p=4370

Extensions

GPT-3.5 is surprisingly non-factual about literature

Emin Orhan Oct 5, 2023

GPT-3.5 seems to be surprisingly bad at answering basic factual questions about famous writers and famous works of literature. This is something I’ve noticed over the last couple of months and here I’d like to share some random examples of this that I encountered recently: Some of these errors are more egregious than others, but… Continue reading GPT-3.5 is surprisingly non-factual about literature →

Show full content

GPT-3.5 seems to be surprisingly bad at answering basic factual questions about famous writers and famous works of literature. This is something I’ve noticed over the last couple of months and here I’d like to share some random examples of this that I encountered recently:

Did George Orwell go to university? GPT-3.5 sometimes thinks Orwell attended Oxford, “where he studied at Eton College” (an impressive double error there!). Orwell, in fact, never attended Oxford or any other university. He did go to Eton, but Eton is not a university and it’s not a college of Oxford.
GPT-3.5 completely makes up the plot of Chekhov’s short story, Ariadne.
GPT-3.5 sometimes mixes up some of the main events in Far From the Madding Crowd, for example, claiming that Seargent Troy is stabbed to death while trying to retrieve a family heirloom (completely made up as far as I can tell), even though it correctly notes just a few sentences prior to this that Mr. Boldwood shoots him. This kind of mixing up of important details seems to be a fairly common failure mode in GPT-3.5’s responses.
GPT-3.5 cannot answer some really very basic questions about the plot of Chekhov’s fantastic short story, The Grasshopper.
GPT-3.5 makes up a plot detail in George Eliot’s novella, Silas Marner (about the cause of Dunstan Cass’s death: GPT-3.5 consistently claims he gets killed while riding the horse, Wildfire, but in fact he dies afterwards, not from falling off a horse).
GPT-3.5 completely makes up Daodejing 15 and many other chapters I’ve tried (despite this being one of the more famous chapters in Daodejing).
GPT-3.5 completely makes up the plot of Guy de Maupassant’s beautiful short story, Clair de Lune (one of my favorite short stories).
GPT-3.5 makes up an important detail from Tolstoy’s famous short story, How Much Land Does a Man Need? It confuses something that happens in reality in the story for a dream. This answer also illustrates the fact that there are often gaping holes in GPT-3.5’s understanding of stories: it makes no sense whatsoever for what GPT-3.5 claims here to be a dream to actually be a dream for the internal coherence of the story.
GPT-3.5 also completely makes up the plot of another one of Tolstoy’s famous short stories, What Men Live By.
GPT-3.5 completely makes up important details in H. G. Wells’s very famous short story, The Country of the Blind, claiming for example that the main protagonist of the story, Nunez, “plans his escape during a rare solar eclipse, which temporarily plunges the valley into darkness. During this eclipse, when the blind inhabitants are disoriented, he climbs the mountain to freedom.” In addition to being a complete fabrication, this detail also doesn’t make any sense whatsoever. Obviously, a solar eclipse would have no effect on the blind people of the valley.
GPT-3.5 completely makes up the plot of another one of H. G. Wells’s well-known short stories, The Lord of the Dynamos.
GPT-3.5 basically completely makes up the plot of Mikhail Zoshchenko’s short story The Adventures of a Monkey.
GPT-3.5 is hazy about some of the plot details in David Copperfield; for example, it is unable to correctly describe the fates of the characters Emily and Littimer by the end of the novel. This is arguably one of the most famous novels ever written, with thousands of expository pieces and critical commentary written on it (the text of the novel itself presumably appears hundreds, maybe thousands, of times all over the internet), yet the fact that GPT-3.5 still cannot nail some of the basic plot details is really disappointing.
GPT-3.5 does not recognize Chekhov’s short story, A Woman’s Kingdom, claiming that “A Woman’s Kingdom is not a well-known short story by Anton Chekhov” and that it “does not appear in his known body of work”. It then proceeds to confabulate various details of one of Chekhov’s most well-known short stories, The Lady with the Dog.
GPT-3.5 completely makes up the ending of Oscar Wilde’s short story Lord Arthur Savile’s Crime.
GPT-3.5 fails to recognize Chekhov’s long short story Three Years, falsely attributes it to Turgenev instead, and fabricates an alternative title for it (Three Days in the Country) along the way. Now, this is an interesting confabulation, since Turgenev did, in fact, write a play titled A Month in the Country and the confabulated title seems to be based on this.
GPT-3.5 fails to identify the famous biblical story mentioned in Chekhov’s exquisite (and very well-known) short story The Student.

Some of these errors are more egregious than others, but in no case was I able to elicit answers from GPT-3.5 that indicated flawless knowledge of a short story or a novel and thus inspired confidence in its reliability and competence. It’s frustratingly easy to get GPT-3.5 to generate completely fabricated answers about straightforward factual matters. What is really troubling is that these are not some unknown writers or works of literature, these are all very well-known writers and works of literature. Most of these works have their own dedicated Wikipedia pages and possibly thousands of primary and secondary sources written about them, not to mention the original texts themselves which probably appear many many times over all over the internet.

Does this happen with other topics as well or is it specific to literature?

Anecdotally yes, this does happen with other topics as well. There’s no reason to think that literature is unique in this way. I find GPT-3.5 totally unusable, for example, for doing scientific literature reviews about subjects I’m familiar with because of the same reliability issues. This problem may be worse in some domains than in others, but it seems pretty clear that it’s a pervasive issue overall.

Why does this happen?

I guess the honest answer is that nobody really knows for sure. If I had to make a guess, I would blame the severe undertraining of these giant models as the primary culprit (you see what happens when you train your model for one epoch, Larry, you see what happens?). This interesting paper that came out a few months ago suggests that strong forms of memorization (e.g. verbatim memorization) are surprisingly rare in GPT-3.5/GPT-4. Presumably, we want our models to memorize more, not less, in order to reduce their hallucinations/confabulations and training the model down to a lower training loss value would be one way to achieve this.

However, it is also entirely possible that hallucinations may be inevitable in these models (even in the ultra low training loss regime), conceptually somewhat similar to the large number of inevitable “spurious attractors” that exist in the energy landscapes of associative memory models. If that’s the case, the extent of these inevitable “spurious attractors” and their severity becomes a vitally important empirical question: i.e. just how common are they and just how bad are they?

Some objections

Objection: Why don’t you use GPT-4?

Retort: Look, I’m not rich, OK? So far, OpenAI hasn’t really been able to convince me that their product is worth my $20/month (partly because of the pervasive reliability issues discussed in this post). I would much rather donate that money to charity instead given my current budget. ChatGPT/GPT-3.5/GPT-4 isn’t yet offering me anything I can’t do better with a few Google searches and following a few links. This is not the main bottleneck in my life right now; I have to do this only a couple of times everyday (or thereabouts) and I’m perfectly fine with that. The main bottleneck in my life right now is rather getting distracted by all sorts of stupid things. That and compute. That’s my main bottleneck. I understand that the value of ChatGPT/GPT-3.5/GPT-4 may be higher for somebody whose job involves a lot of high-volume, low-value, easily automatable outputs, but that’s not me.

Incidentally, can we just digress here for a second and talk about how pathetically trivial, banal, and boring the use case examples highlighted on the ChatGPT login page are? “Suggest fun activities for a family of 4 to do indoors on a rainy day”, “recommend a dish to bring to a potluck”, “help me pick a gift for my dad who loves fishing”, “brainstorm names for my fantasy football team”… Ugh, ugh, ugh, ugh! Come on, guys! This is a model that’s supposed to pose an existential risk to humanity in its next few iterations!

Objection: Relatedly, didn’t OpenAI show a clear improvement in factuality from GPT-3.5 to GPT-4? So, maybe just keeping on doing what they’re doing will eventually solve these reliability issues.

Retort: In the GPT-4 tech report, there is indeed a figure (Figure 7) that shows an improvement in performance on the TruthfulQA multiple choice benchmark. However, several caveats are in order:

To put this improvement in perspective, the chance baseline in this benchmark is around 45%, and their best model (RLHF-ed GPT-4) does slightly worse than 60% (a large number of much smaller, open-source models currently achieve a better accuracy than this on this benchmark [this is incorrect: the open llm benchmark seems to report the mc2 score, whereas the GPT-4 white paper reports the mc1 score, so these numbers are not directly comparable]). Interestingly, this improvement seems to be almost entirely due to RLHF (there’s barely any difference between the base GPT-3.5 and GPT-4 models, which both perform below chance); it’s unclear if this is due to GPT-4 specific RLHF or just generic RLHF bringing out more factuality from the base GPT-4 model.
This benchmark is a multiple choice benchmark, which isn’t really a good model of how these language models are typically used in practice and it also likely overestimates the models’ factuality, since recognition is much easier than recall as I show in this paper.
As they acknowledge in footnote 9, they do not check data leakage for this benchmark, so it seems possible that part of the improvement here may be due to data contamination during RLHF tuning.

They have another figure in the same tech report (Figure 6) claiming an improvement in some “internal factual evals”, but absolutely no details are given regarding these evals, so it’s not really possible to say anything about the results in this figure (this is, by the way, completely unacceptable behavior from a company that wants to sell products in my view. I can understand wanting to make training/model details secret, but you absolutely must convince your customers that your model evals are sound and relevant, so they can feel confident about the model’s claimed capabilities).

Objection: This is the wrong way to address the reliability issues in language models. We need more agentic models that can search the internet or other databases, follow links, read sources in real time in order to be able to truly address the reliability issues. Otherwise, these language models will always hallucinate and confabulate unacceptably frequently.

Retort: Possibly. But, nobody really has shown yet that these more agentic models that interact with the internet or with other external data sources perform better than standard language models at scale in terms of accuracy and reliability. Can they be relied on to find the correct and most relevant sources? Can they be relied on to understand a document they read in real time (in their “context window”)? Can they be relied on to accurately synthesize multiple documents? We simply don’t know.

These agentic models also have a crucial disadvantage compared to standard language models. One of the biggest promises of large language models is their potential ability to instantaneously make far-reaching associations, their potential ability to synthesize vast quantities of information over their huge knowledge base. This ability is basically given up in the agentic models, since they’re severely bottlenecked by search and reading, both inherently slow, serial processes (an intelligent combination of this agentic mode and the more associative LLM mode could be a different story though). This feels like a feature too important to give up so easily.

Objection: Bro, why didn’t you use my super-duper fancy inverted-double-linked-chain-of-brainfarts-of-thought prompting method (which, incidentally, I chose to call IdLeCObRaS, just because I could and just because I hate humanity)?

Retort: Nah bro, I’m good. You keep your fancy brainfarts-of-thought to yourself. I insist on communicating with my language models in the most normie way possible. This isn’t too much to ask for from a technology that’s supposedly so intelligent and powerful that it may kill us all in its next few iterations. Also, bro, you and I, we both know in our heart of hearts that your brainfarts-of-thought will have exactly zero impact on this technology when all is said and done; OpenAI will just train a 10x bigger model on 10x more data or instruction-tune it with 10x better data and your brainfarts-of-thought will end up in history’s giant trash can of silly ideas where it belongs. You published some useless papers on it, some prestige points were accumulated based off of it, so it has served its purpose, now just move on, bro.

Concluding thoughts

In a recent interview, OpenAI’s co-founder and chief scientist, Ilya Sutskever, said that if language models (or generative AI models, in general) end up being a disappointment overall 5-10 years from now, with relatively little impact in our lives, the most likely culprit would be their unreliability:

“I really don’t think that’s a likely possibility, that’s the preface to the comment. But if I were to take the premise of your question, why were things disappointing in terms of real-world impact? My answer would be reliability. If it somehow ends up being the case that you really want them to be reliable and they ended up not being reliable, or if reliability turned out to be harder than we expect. I really don’t think that will be the case. But if I had to pick one and you were telling me — hey, why didn’t things work out? It would be reliability. That you still have to look over the answers and double-check everything. That just really puts a damper on the economic value that can be produced by those systems.”

I think he’s right that reliability is the most serious issue facing LLMs today. It’s preventing LLMs from having a much wider and deeper impact. I feel less confident than him that this problem will be solved in the next couple of iterations in a relatively straightforward way. I give it a 50-50 chance that this problem will in fact turn out to be more intractable than people hope and expect today, that the LLM bubble will burst soon as a consequence and a lot of LLM startups will go bust in the near future (a potential silver lining in this scenario would be that A100s and H100s would presumably become cheaper and easier to come by and people would hopefully start experimenting with novel ideas again). We shall find out soon.

http://severelytheoretical.wordpress.com/?p=4294

Extensions

https://severelytheoretical.wordpress.com/feed

Posts