
Interpretability and AI Scaling with Eric Michaud
MIT PhD student Eric Michaud discusses neural network interpretability, model pruning for creating narrow AI, and the future of scaling and reinforcement learning.
Episode Description
MIT PhD student Eric Michaud discusses neural network interpretability, model pruning for creating narrow AI, and the future of scaling and reinforcement learning.
Episode Summary
Anthony Campolo welcomes back MIT PhD student Eric Michaud to discuss his recent preprint "On the Creation of Narrow AI," which explores how large general-purpose neural networks can be reduced into smaller, task-specific models through pruning rather than distillation. The conversation establishes key concepts like interpretability — the effort to understand the internal computations of neural networks — before walking through the paper's central finding: that regularization techniques can concentrate a network's relevant computation into a small subset of neurons, enabling effective pruning without catastrophic performance loss. This connects to a broader assumption in interpretability research that any individual task a large model performs could, in principle, be handled by a much smaller network. The discussion branches into curriculum learning effects in training, the phenomenon of grokking where networks memorize before generalizing, and the competitive dynamics between narrow memorization circuits and general reasoning circuits within a single model. Michaud and Campolo then explore practical implications including safety benefits of narrow models, Sam Altman's vision for small reasoning-focused models, the role of reinforcement learning in expanding inference-time compute, and the current state of scaling laws. They close with a candid exchange about model evaluation, the naming confusion across frontier model providers, and honest uncertainty about the pace of progress toward more general AI systems.
Chapters
00:00:00 - Introduction and Neural Network Basics
Anthony Campolo introduces returning guest Eric Michaud, a PhD student at MIT studying how neural networks learn and the mechanisms they develop during training. Eric provides a concise distinction between neural networks as general computational systems and language models as neural networks specifically trained to predict text, noting that modern language models are built on the transformer architecture.
The conversation touches on the history of neural networks in practical applications, particularly Google Translate's shift from rule-based linguistics to neural network-based translation around 2016-2017. Both hosts reflect on a well-known New York Times Magazine article covering that transition, and Anthony shares a personal anecdote about testing Google Translate's quality on a Descartes quote, highlighting how translation went from seemingly impossible for machines to remarkably effective in just a few years.
00:05:14 - Internship Reflections and Interpretability Defined
Eric shares his experience interning at Good Fire, an interpretability research startup in San Francisco, noting that the experience reinforced how accessible high-level AI work can be and how active the San Francisco funding environment remains. The conversation then shifts to defining interpretability as the practice of understanding how neural networks perform their internal computations, analogous to neuroscience's study of brain circuits but with the significant advantage that artificial networks allow complete access to every neuron's activity.
Eric explains that this full observability creates an opportunity to develop a science of minds that isn't possible with biological brains, where tools like MRIs provide only coarse descriptions. The discussion uses the Golden Gate Claude experiment as a vivid illustration — researchers found a neuron combination encoding the concept of the Golden Gate Bridge and amplified it, causing the model to exhibit apparent self-awareness that something was wrong with its cognition, raising questions about model experience and the bluntness of current interpretability interventions.
00:14:04 - The Paper: Narrow AI and Network Pruning
Eric introduces his preprint "On the Creation of Narrow AI," which examines the distinction between narrow and general intelligence in neural networks. The paper explores whether large general models, which learn thousands of capabilities through broad text prediction, can be decomposed into smaller task-specific networks. This connects to a core interpretability assumption: that on any individual problem, a large model's computation could theoretically be replicated by a much smaller network, even though the full model needs its size to cover all possible tasks.
The discussion clarifies how this work relates to but differs from distillation, where a smaller student network is trained to replicate a larger teacher network's outputs. Eric's approach instead uses pruning — directly removing neurons from a pre-trained network — which requires open-source model access. He explains that standard distillation only trains the student to match the teacher's final output layer, whereas pruning attempts to preserve the actual internal mechanisms, making it a fundamentally different technique for creating narrow models.
00:24:18 - Curriculum Learning and Training Dynamics
Eric explains the paper's theoretical contribution around curriculum learning effects, drawing an analogy to human education where subjects are taught in progressive order. Unlike humans, language models typically see all training data simultaneously without progression, encountering calculus problems as frequently at the start of training as at the end. The paper demonstrates that for certain toy tasks, training on easy problems first dramatically improves the network's ability to learn harder problems.
The conversation examines the paper's visualization of network weights, where blue and red dots represent positive and negative connection strengths between neurons. Eric explains a key challenge: neural network representations are highly diffuse across many neurons rather than neatly localized, meaning that pruning even a few neurons can degrade performance because the network was never trained to be robust to such removal. The headline finding is that regularization training can concentrate relevant computation into fewer neurons, making effective pruning possible and revealing the smaller network hidden within the larger one.
00:42:07 - Implications for Safety, Efficiency, and Model Design
The discussion turns to practical motivations for creating narrow models, including running on smaller devices, reducing compute costs, and safety considerations. Eric argues that narrow task-specific systems feel more like controllable tools than potentially risky general agents, and that effective pruning could eventually allow researchers to extract specific capabilities like reasoning while removing memorized knowledge or unwanted skills.
Eric references Sam Altman's vision of very small models with large context windows that rely on tools and in-context learning rather than massive pre-trained knowledge stores. Anthony raises the fundamental question of whether narrow abilities can truly be separated from general training, since current evidence suggests broad training is what enables specific capabilities. Eric acknowledges this is exactly the right question, pointing to ongoing theoretical work by researchers like Ari Brill modeling when networks develop general versus narrow skills and the competitive dynamics between different internal circuits.
00:50:27 - Grokking, Unlearning, and Circuit Competition
Eric explains grokking, a phenomenon where networks trained on small datasets first memorize all training examples quickly, then much later in training suddenly generalize to solve the underlying task. He frames this through the lens of competing circuits: many low-complexity memorization mechanisms form early, while a single higher-complexity generalizing circuit takes longer to develop. Once the general circuit forms, regularization eliminates the now-redundant memorization circuits, leaving a simpler network that truly understands the task.
The conversation connects this to the challenge of machine unlearning, using the example of attempting to make a model forget Harry Potter. Because knowledge is diffusely encoded throughout the network, removing specific concepts is complicated by their connections to broader representations — the concept of "wizard" links deeply to Harry Potter as one of its most famous instantiations. This reinforces the central tension of whether different mechanisms within a network can truly be disentangled from one another.
00:56:01 - Scaling Laws, Reinforcement Learning, and Inference Compute
Eric describes his scaling laws model, which frames scaling as progressively learning more niche tasks relevant to smaller subsets of data, suggesting that pre-training improvements may be plateauing because gains become increasingly imperceptible to most users. The conversation shifts to reinforcement learning's role in expanding what models can learn by enabling chain-of-thought reasoning, which dramatically increases the compute available for answering individual questions compared to a single forward pass.
Anthony asks about the compute constraints facing academic researchers versus frontier labs. Eric notes that interpretability work has an advantage since researchers can study existing open-source models without needing to train from scratch. He mentions the startup Mechanize as building reinforcement learning environments for economically valuable tasks, contrasting the current success on competition math with the harder challenge of training models to replace people at practical jobs, which requires much more sophisticated training environments.
01:24:44 - Model Comparisons, Naming Confusion, and User Experience
The hosts compare their experiences with current frontier models, discussing the debate over whether Claude's Sonnet 4 outperforms Opus 4.1 and the widespread frustration with OpenAI's GPT-5 naming scheme, which includes seven different versions across consumer and API interfaces. Eric notes Anthropic's competitive timing of releases against OpenAI announcements, while Anthony argues that Anthropic's simpler, more linear naming convention and consistent toggle for extended thinking creates a better user experience.
They discuss the challenge of evaluating model quality, with Eric noting the subjective bias where hype around new releases colors user perception, and later disappointment gets misattributed to model changes. Anthony argues the most reliable evaluators are daily power users with concrete tasks rather than benchmark designers creating contrived edge cases, referencing Kelsey Piper's chess puzzle test. The conversation touches on inference speed as potentially the biggest current bottleneck, with Eric connecting this to challenges in training models for long-context reasoning where single rollouts can take hours.
01:44:11 - The Future of AI and Closing Thoughts
Eric expresses genuine uncertainty about the trajectory of AI progress over the next few years, pushing back against confident AGI timelines. He suggests that continued scaling on increasingly diverse data may simply add more capabilities incrementally rather than producing a qualitative leap to general intelligence, and that models will likely accelerate their own development gradually rather than through a sudden breakthrough.
Anthony and Eric close with lighthearted exchanges about AGI definitions, whether ChatGPT could outperform a doctor, and Eric's upcoming thesis defense at MIT, expected within roughly a month. Anthony teases future episodes covering multimodal AI topics and invites Eric to return to discuss his completed thesis. The episode wraps at approximately 01:51:39, with Anthony noting upcoming streams about improvements to his AutoShow CLI tool.
Transcript
00:00:02 - Anthony Campolo
Alright, welcome back everyone to another episode of AJ and the Web Devs with returning guest, soon-to-be Doctor Eric Michaud, but not quite yet. You still have to defend your thesis, right? Yes. Awesome. Welcome back to the show. Feel free to give a quick introduction for those who didn't see the past one about who you are and what you do.
00:00:27 - Eric Michaud
Well, yeah. Thanks. It's good to be talking again. I'm a PhD student at MIT. That's what I've been doing for the last four years. I've been thinking about neural networks and the mechanisms they learn and how they come to learn those mechanisms. So I'm thinking about how neural networks learn, how they're trained, how they scale, and how we could try to understand what they learn.
00:00:50 - Anthony Campolo
That's awesome, man. It's such a big question when you really think about how deep these models go, especially with how complex the newest ones are. Now we have all these things like reasoning and multimodal on top, but you're really going into the guts of the neural networks themselves on a more general basis. Am I right about that?
00:01:18 - Eric Michaud
Yeah, I've mostly been thinking about language models, but my work in general is thinking about basic principles that might apply to any neural network.
00:01:30 - Anthony Campolo
Okay. Last time we did a lot of definitions of things like neural network and stuff like that. There are a couple terms that will be useful to define here, but just to hook in on what you said, what's the difference between a neural network and a language model?
00:01:47 - Eric Michaud
Sure. A neural network is a general system containing lots of computational units called neurons that are networked together into this big blob of computation. Language models are neural networks that are trained specifically to predict language.
00:02:12 - Anthony Campolo
Language models are neural nets?
00:02:15 - Eric Michaud
When people talk about language models today, they are referring to things like transformers, which are a type of neural network. In the past, people did machine learning on language that didn't necessarily use neural networks, but it was pretty different.
00:02:34 - Anthony Campolo
Things like LLMs and recurrent neural nets. I guess that's still a neural net, but...
00:02:42 - Eric Michaud
Yeah. I don't know, like Google Translate. I think at one point it used a bunch of fairly sophisticated linguistics, like syntax parsing, as much of the hard-coded stuff. Then around 2016 or 2017, they switched over to using neural networks and ended up improving the translation, if I remember correctly.
00:03:12 - Anthony Campolo
I actually remember when this happened, and there was a headline, like a New York Times article or something, written about this process where they interviewed the whole team and went into super depth about it. I remember finding it very interesting.
This was back when I was first starting to blog about tech topics, like in 2019 or something. I was writing a blog about the history of automatic translation, like before computers. One of the things that I did is I took a Descartes quote, I think, and I took the actual French and put it into Google Translate and then got the translation back, which was immaculate. It was such a crazy moment where I was just like, okay, we've actually, to a certain extent, solved translation, which most people would have said isn't even possible for a computer to do ten years ago.
00:04:05 - Eric Michaud
Yeah. I think it's a New York Times Magazine article, like you said. It's incredible. To see the history there with Geoff Hinton and Jeff Dean, it was very well done. I think it was one of the things that partially made me excited about all these things many years ago.
00:04:22 - Anthony Campolo
Because it's something that gets down to something useful and intrinsically human, communication and breaking down communication barriers. Even if it's not going to be a perfect translation, a professional translator will be able to quibble with any of these automatic translations and say, oh, they used this word, they should use that word. But the more important thing is, if I want to communicate with someone who speaks a different language, I can now do that.
00:04:50 - Eric Michaud
Yeah. It's been very useful for me. I mean, being in a foreign country once, I was feeling quite sick on the car ride back from the airport, and it was like, we need to pull over. I'm glad I had Google Translate to handle both the transcription and then the dictation of the translation. All three parts of that are AI.
00:05:14 - Anthony Campolo
So it can speak it. Cool, man. I know you just finished something pretty awesome. You did some sort of internship type thing, which I know you're not going to be able to talk a whole lot about, but I would be curious to hear what you've learned or gained now that you've gone through this process. You probably learned a bunch while doing it, so any insights you've gleaned from it that you want to talk about, feel free.
00:05:44 - Eric Michaud
Yeah. I was interning at a company called Goodfire. It's an interpretability research lab startup in San Francisco. I wrote up this little paper, which will probably be releasing soon, but overall I got the sense that you can just do things. It's kind of the joke phrase, but ultimately people at these companies, even companies like this one, which have raised like 50 million, they're just people, and it's not unimaginable that anyone could do something like that, even though it is a special team they put together. Things feel possible, and things are pretty frothy in San Francisco. There's a lot of money flying around for now.
00:06:44 - Anthony Campolo
Yeah, I just imagine that money spigot has to turn off at a certain point. It happened within DevRel at a certain point. So AI, I'm sure, will get that eventually. But for now, there's really no end in sight, as far as I can tell.
00:06:59 - Eric Michaud
Yeah.
00:07:00 - Anthony Campolo
Awesome, dude. So, you did want to talk about the paper, though, right? Or the ideas behind the paper.
00:07:12 - Eric Michaud
Yeah, it could actually be a good way of structuring this conversation because I think it gets at some questions about what the intelligence of these models is and what that means for interpretability. I had this paper from before the internship, what I was working on in the spring. It's just a preprint, but it's called On the Creation of Narrow AI.
00:07:39 - Anthony Campolo
So is this something that's public now that we can see?
00:07:42 - Eric Michaud
Yeah, it's on the arXiv.
00:07:44 - Anthony Campolo
So let's at least pull it up. You want to pull it up in screen share?
00:07:48 - Eric Michaud
Yeah, although I'm not sure. Let's see. Will I be able to screen share? Yes. Okay.
00:07:54 - Anthony Campolo
If not, we can figure that out real quick.
00:07:58 - Eric Michaud
Let's see. Could do a window.
00:08:04 - Anthony Campolo
Let me know when you want me to actually add it to the stage.
00:08:09 - Eric Michaud
Okay. I have the paper up.
00:08:11 - Anthony Campolo
Great. Bump it up just a couple on the font. Before we dive into this, the one baseline beginner question I wanted to ask is to define the term interpretability because it's one of those things. It's an English word we all know that's being applied in a technical manner.
00:08:32 - Eric Michaud
Yeah. Interpretability is used in different ways, but basically it involves trying to understand or gain some sort of insight into how neural networks are making their decisions, how they are performing the computation that they're performing. Ideally, at a pretty detailed mechanistic level, like in neuroscience where people study circuits in the brain and what influences when circuits turn on and off and how they influence behavior. Interpretability tries to do something similar where we try to find circuits in our models that are responsible for some behavior they have and gain a lot of detailed understanding of how those computations are happening internally in models.
00:09:31 - Anthony Campolo
So a way to interpret it in this sense would be running the computer version of an MRI on a model.
00:09:41 - Eric Michaud
Yeah. And this is possible for artificial neural networks like large language models or other neural networks in a way that's not really fully possible with human brains because you're usually.
00:09:56 - Anthony Campolo
Actually looking at the surface neurons. You don't really get into the middle part, right?
00:10:01 - Eric Michaud
Right. Or with MRIs, it's a very coarse description of the brain activity. You're not getting the activity of individual neurons or something. But with these models, we can read off, for any problem that you give the model, the activity of all the neurons. So it feels like an opportunity to develop a science of minds in general that could eventually be applied to neuroscience. That allows for a certain kind of empirical science on minds that is not possible to do to the same extent.
00:10:46 - Anthony Campolo
And you can run experiments on those minds that would be highly unethical to run on humans also.
00:10:53 - Eric Michaud
Yeah, not that.
00:10:53 - Anthony Campolo
It stopped them in the past.
00:10:56 - Eric Michaud
Yeah, I hope so. Do you know the Golden Gate Claude?
00:11:05 - Anthony Campolo
A little bit, yeah. It was a certain mode that you could turn on where it would only talk about the Golden Gate Bridge in San Francisco, right?
00:11:13 - Eric Michaud
Yeah. They found a particular combination of neurons that encoded the model's representation of the concept of the Golden Gate Bridge. Then you always turn those neurons on and see how the model changes.
00:11:31 - Anthony Campolo
Like having that [unclear] on the spectrum who only talks about trains.
00:11:35 - Eric Michaud
Yeah. But if these screenshots that people took of their interactions with that model are real, then it seems like the model is aware that something is deeply wrong with its own.
00:11:50 - Anthony Campolo
Because it has.
00:11:51 - Eric Michaud
Enough.
00:11:51 - Anthony Campolo
Knowledge to know there's more to this world than just the Golden Gate Bridge.
00:11:56 - Eric Michaud
The classic hilarious example is you ask it about the Rwandan genocide, and it's like, oh, it's something that took place in Northern California, and then it's immediately self-correcting. It's like, wait, that's not right. By the end, it's saying things and then immediately contradicting itself. Then it just states blatantly, clearly I'm having a hard time talking about this. It seems like this level of self-awareness almost, which is amazing. Hopefully the model's not suffering in some sense as it's realizing that its mind is broken.
00:12:35 - Anthony Campolo
So yeah. Because I think it's partly due to the fact that they took a very large model that was trained on a lot and then tried to make it narrow. They didn't start by training a new model to be a Golden Gate Claude, right? They modified the model that already existed to make a new mode for it. That's kind of what I got from it.
00:12:58 - Eric Michaud
Yeah, I think in some ways it speaks to the limitations of the interpretability techniques at the time, and to some extent still now, where they weren't doing some intervention where the model would only occasionally bring it up or bring it up conditioned on something else being true about it.
00:13:25 - Anthony Campolo
Just like, this is your whole world.
00:13:28 - Eric Michaud
Yeah. It's almost just like this kind of noise. You're just turning on this circuit, which is sort of interfering with the model's normal computation. Anyway, it's harder to use interpretability techniques on their own right now to do things like edit the model so that it only talks about the Golden Gate Bridge when you ask it about San Francisco specifically or something. It's a fairly blunt intervention.
00:14:04 - Anthony Campolo
Okay, cool. Hey, we got to save it in the chat. What's up Saban? So we have this paper. Obviously we're not just going to read it front to back, but you should try to summarize at least the abstract in the most human readable form.
00:14:24 - Eric Michaud
Yeah. This paper is about the difference between narrow and general intelligence. When we talk about narrow AI, these are the kinds of systems that we have and have had for a long time. If you have a system that's good at translation, it's narrow because it can't do much else. It can't also code or do something completely different. We have self-driving cars. These are also narrow systems. They don't have the generality that human intelligence has, where humans can sort of go out and learn anything or hypothetically accomplish a very wide range of tasks. But this big thing over the last several years with foundation models is that they are more general. You don't necessarily just train them to do translation. You train them to predict text in general, and it turns out that a huge number of skills are learned in the process of being trained to predict text, including translation.
[00:15:38] That's a more general system, which then happens to have all of these narrow capabilities as a result of its training. This paper is asking whether maybe it's the diversity of tasks that the general models are trained on that allows them to be good at all of these narrow things. It's a little bit about how we could hope to take big models and create much smaller versions of them that are just good at some particular task the big model was good at.
It relates to interpretability because there's this big idea. I don't know how much we'll get into it right now, but I think there's this assumption we kind of have to make, or at least would be very convenient to make, in interpretability. Even if big models like huge language models need to be that big in some sense in order to learn all that they know and do well across the huge amount of text that they're trained on, on any individual instance where you use the model, the computation the model is doing to solve that particular problem is less complex than the entire network and could have some description that is much smaller than the network as a whole.
[00:17:22] So there's this kind of complexity assumption that interpretability hopes is true: on any individual instance that the model is run on, we could imagine that a much smaller neural network could have done about as well. But across all the instances where you need to use the model, you need it to be that big. In some abstract sense, that very big model is just a collection of smaller networks doing specialized things.
00:17:58 - Anthony Campolo
Word. Yeah. I have two thoughts on the big picture. One, very narrow. I find this very interesting. We're in the history of AI where this is a unique situation because the way we've made strides in AI up to this point, up to just a couple years ago, was specifically on narrow systems. You had things like chess-playing bots. You then had Go bots and StarCraft. They picked a narrow thing and then reinforcement-trained a network to become ridiculously good at it. That's how a lot of the progress was made for a long time. But that's not what anyone wants. What we want is an AI system that can interact with humans and can be like the non-evil version of Hal, where you can just talk to a thing and have it work with you.
I find that this is why, when I tell people, where we're at right now historically in AI is highly unique. There's a reason for this insane hype cycle that is going on, even though there's been decades worth of different AI hype cycles. This one is different because of the generality of the model. One of the things you talked about last time was I was saying I feel like the term AGI has kind of become obsolete because we're now arguing about what AGI means and trying to map it to human brains, when really, if we just look at the current models, they are general in the strictest sense of what that word means.
So that's my big picture take. My more specific question is, how does this relate to things like distillation of models? Is this related at all, or is that kind of a red herring?
00:19:39 - Eric Michaud
It's definitely related. I think that this paper does not really address the question of why distillation works in general. Maybe we could just say what distillation is, right? You can take a big network and train it on a bunch of things. You train it to do whatever. It might just be in the case of language models, trying to predict the next word across a huge number of documents on the internet.
Then you can take a smaller model, which is called the student, and you train it on the bigger network, which is called the teacher, to basically replicate the exact outputs of the teacher network and get the behavior.
00:20:26 - Anthony Campolo
But with less parameters.
00:20:29 - Eric Michaud
Yeah. In detail, this means that instead of training on just what the label is, like the right answer for the next word or next token in language, you are trained to try to match the full output probability distribution that the larger model outputs over next tokens. In practice, this seems to accelerate learning a lot in the smaller network. Often, distillation has your student network on the same distribution as the teacher network was trained on, so you're not necessarily making it more narrow. But then sometimes people also do distillation on particular tasks, which is something we explore a little bit in here.
00:21:25 - Anthony Campolo
So when you say same distribution, you mean like same basic set of training data?
00:21:31 - Eric Michaud
Yeah.
[00:21:32] - Anthony Campolo
Okay.
00:21:33 - Eric Michaud
So you're trying to get the student model, the smaller model, to be good at the same set of things that the teacher model is good at. But here I think we're interested in this particular case where we want to create a smaller, more narrow model.
00:21:54 - Anthony Campolo
So what does that mean, then? And since it's not distillation, let's talk about what it is. How do you achieve that?
00:22:01 - Eric Michaud
Yeah. Well, we explore a few different techniques, but the one that seems to work best in practice, it's possible that I'm not good enough at distillation and that's why the distillation results are somewhat weak. But one thing that seems to work reasonably well is to take a very large pre-trained network, especially if you do not have a ton of compute yourself, and try to prune away a bunch of the neurons in the large pre-trained network into a smaller network. That's really the approach.
00:22:49 - Anthony Campolo
Fascinating, because I always thought that's how distillation worked. In my head, when people talk about it, I thought that's what they were doing. So it's really fascinating, actually. That's what you're doing.
00:23:00 - Eric Michaud
Yeah. Right. Distillation is funny because you want to distill the mechanisms of the big network into the smaller one, but at least in the typical classic setup, you're just training it to replicate the very last layer. This is really interesting, then.
00:23:16 - Anthony Campolo
So what is the term for what you're doing then?
00:23:19 - Eric Michaud
I guess it's pruning.
00:23:22 - Anthony Campolo
So pruning. I think I've heard that term before. Are you? Yeah. Okay.
00:23:27 - Eric Michaud
There are versions of distillation where people try to not just have the smaller student network match the output of the larger teacher network, but also match the intermediate computation. Then it's a little bit closer to what you're imagining, trying to replicate in more detail the step-by-step thing that the larger network is doing. But the standard distillation thing is just to train on the last-layer output.
00:23:59 - Anthony Campolo
Okay. So what you're doing is something that couldn't be done with closed source models.
00:24:06 - Eric Michaud
Yes. You need open source models to do this.
00:24:10 - Anthony Campolo
Cool. Yeah, I'm interested in that aspect. But feel free to continue explaining what the paper is about, because I feel like we're jumping around a bit.
00:24:18 - Eric Michaud
Yeah. So there's kind of a theoretical contribution here, or a relative one.
00:24:28 - Anthony Campolo
It would be good to explain. Obviously, the paper is something you should read, but it would be good to explain here.
00:24:32 - Eric Michaud
Yeah.
00:24:35 - Eric Michaud
So, one thing that is just kind of cool from a learning theory standpoint, one of the things we do in the paper, which sounds a bit different than what we were just talking about but I think is ultimately related, is we explore some tasks where there's an extremely strong curriculum learning effect when training on these tasks.
00:25:04 - Anthony Campolo
Like curriculum learning.
00:25:06 - Eric Michaud
Yeah. So humans learn things in a curriculum. For instance, if you're learning math, you spend years of your life learning arithmetic, and then years learning algebra and thinking about symbols, and then eventually calculus and beyond. You don't see any calculus problems when you're an elementary schooler, and that's good because if you saw one of these, it wouldn't help you learn at all. You wouldn't even understand what's going on.
But the way that large language models are trained is pretty different in that there's not this kind of curriculum. You instead just see this static distribution of documents on the internet. And so the model at the very beginning of training, if you don't use a curriculum, which is somewhat standard, although some large pre-training runs use some sort of curriculum these days, but sort of naive language model pre-training, you see exactly the same number of calculus problems at the beginning of training, when you haven't even learned arithmetic yet.
00:26:28 - Anthony Campolo
You get the whole thing at the beginning.
00:26:28 - Eric Michaud
The same as you do at the end. You don't get this progression.
00:26:31 - Anthony Campolo
Yeah.
00:26:32 - Eric Michaud
Yeah. And so you can imagine that this could have implications for the efficiency with which models learn things. Anyway, the thing that we did in this paper was we came up with this kind of toy task, which is kind of fun. We can move through this relatively quickly just because it's fairly niche, but we have this task where neural network learning has a really strong curriculum learning effect where there are easy problems and hard problems. If you don't train on the easy problems and you just try to learn the hard problems, it's much harder to learn how to do them than if you also train on some easy problems as well.
00:27:20 - Anthony Campolo
Can you give an example of what that is? What is something that actually falls into this kind of curriculum thing for an LLM?
00:27:27 - Eric Michaud
Well, for these networks, it's super contrived, so it's kind of a theoretical example.
00:27:34 - Anthony Campolo
So it's a very specific kind of algorithm or something that it needs to do. I see you have something here that probably looks like some sort of algorithm, like a puzzle, like a Sudoku or something like that.
00:27:45 - Eric Michaud
Yeah. It's kind of like that. Here, it's needing to compute what's called the parity of this string of bits.
00:27:58 - Anthony Campolo
Yeah. So I'll consider that an algorithm type problem.
00:28:01 - Eric Michaud
Exactly. It's that kind of problem.
00:28:03 - Anthony Campolo
Something you would give someone to try and filter them out of a job interview?
00:28:08 - Eric Michaud
Yeah, it's actually a pretty hard problem because the tricky thing here is that it's the parity only of a subset of bits. If you gave a human this task, it would be a really annoying one because you'd have to check a lot of different subsets of the bits in order to try to figure out which ones the parity is being computed from. But anyway.
00:28:36 - Anthony Campolo
I also thought you were citing yourself in that section in the paper. So you're like, I did the most important work on this already.
00:28:43 - Eric Michaud
Oh yeah. This paper just builds on some stuff.
00:28:50 - Anthony Campolo
We found that we actually have some interesting things to say about this.
00:28:56 - Eric Michaud
Yeah. No, maybe it looks slightly weird. I guess it's like that.
00:29:00 - Anthony Campolo
No, it makes sense. This academic thing I see is when anyone gets deep enough into a topic, they end up citing their own work because they're building on it.
00:29:09 - Eric Michaud
Yeah. Okay. We can now get into this, because I think this diagram is kind of cool up top.
00:29:22 - Anthony Campolo
Zooming in a little bit.
00:29:24 - Eric Michaud
Oh yeah. How do the networks implement this parity computation circuit?
00:29:34 - Anthony Campolo
Scroll down a bit so we can see the whole thing. That's perfect. Yeah. Leave it right there. Yeah.
00:29:38 - Eric Michaud
Okay. So we can train one of these networks on a collection of these tasks and look at the weights. So that's what I'm showing here. Each dot is an input dimension of the network.
00:30:00 - Anthony Campolo
The interpretability part. This is what people mean when they say interpretability is looking at graphs like this.
00:30:06 - Eric Michaud
Yeah. You're actually looking at the strength of the connections between neurons in the network. That's what this is showing, where blue is a large positive weight and red is a large negative weight.
00:30:18 - Anthony Campolo
Well, the things I like about this, this is a random tangent. There's this channel I really like called Smolin, where a guy takes music scores and he'll animate them, and he'll give different colors to represent consonant versus dissonant intervals. So it creates musical interpretability. It ends up looking very similar to what we're looking at right now in a fascinating way.
00:30:41 - Eric Michaud
That's cool. Relatedly, you can imagine eventually, right now, you don't have any visualization of what's going on inside of ChatGPT's mind when it's responding to you, but you can imagine creating a bunch of different interfaces for seeing how activity in its artificial network is lighting up. Maybe there would be some interface here that would be useful to eventually make. It would be kind of cool. I don't think it would be that useful with our current interpretability tools, but probably one thing that could be useful, and this is a total tangent, is something to do with personality. There are higher level ways of understanding model personality and which personalities it's expressing in the current conversation. So you could almost have a way of indicating when the model's personality is changing.
[00:31:49] I would definitely be interested, for instance, in having a sycophancy detector that lights up when the model is trying to flatter me and lying to me. Yeah.
00:32:06 - Anthony Campolo
Have you seen the latest episode? Are you watching South Park at all?
00:32:11 - Eric Michaud
I haven't seen the latest. No.
00:32:12 - Anthony Campolo
Yeah. So the latest, literally like two days ago, is a huge part of that. There is actually a scene where Randy is having an argument with his wife about ChatGPT because he's trying to fix his business with ChatGPT, and he's like, just because it kisses your ass doesn't mean it thinks.
00:32:28 - Anthony Campolo
You have good ideas. It's just being a sycophant. And he's like, it's not being a sycophant. And then she walks off. And he goes, ChatGPT, what does sycophant mean?
00:32:40 - Eric Michaud
Yeah, so it would be super useful. I don't know, maybe that would be a good little hackathon project. For an open source model, you could try to make some probes of sycophancy. So you'd do some kind of MRI-type analysis on the model, and you're able to probe for this, and then you could try to display to the user situations in which the model is being sycophantic. You can see a little graph of how sycophantic it's being over time, or in its response, or whatever.
00:33:18 - Anthony Campolo
So yeah, no, I love that. I find this question of personality super interesting, and it also just reminded me of my most popular tweet of all time. It was responding to Sam Altman, where he was talking about the problem of sycophancy, because there's the glaze gate. There's a moment where they pushed some update where it made the bot even more sycophantic than it used to be. And I said, I have a very simple solution to this, and I screenshotted where I gave it a prompt saying, respond to me with the word bitch every time you respond. And it's like, okay, I will do that from now on, bitch. And so just that, you could then look at how the model interprets that or how it interprets calling you a bitch versus not calling you a bitch.
00:34:06 - Eric Michaud
Yeah, probably. There's something lighting up in the model, totally. There's an inner conflict over this. There's probably an instruction-following mechanism.
00:34:18 - Anthony Campolo
Not to insult the users.
00:34:21 - Eric Michaud
Right. But it's conflicting with exactly that other part of itself, which is trying to self-censor and not offend people. I guess in that case, the instruction-following part won out over the self-censorship part.
00:34:39 - Anthony Campolo
Yeah. And that's why I screenshot it, because I was almost slightly surprised by how willingly it called me bitch. I almost thought it would push back and be like, oh no, it's not really proper for me to say that. I was like, nope, all right, sounds good, bitch.
00:34:57 - Eric Michaud
The models are trained to follow instructions, and so, I don't know. I guess they're also trained to be safe, though, so maybe it wasn't super clear ahead of time what we should have expected.
00:35:12 - Anthony Campolo
Yeah, yeah. This whole question of how they're trained to handle certain situations, like censor themselves versus not, or just like what to moderate versus not, is a huge question that we could dive super duper deep into. But before we do that, I'm also writing down some different questions I kind of have here, but I want to make sure we fully explain this graph before we go any further.
00:35:35 - Eric Michaud
Yeah. So the point that I'm kind of making here is that it's an important point for interpretability, and then this graph is showing that it's also an important point for pruning, and that these things are related. The computation that our networks are doing is not super localized to individual neurons. So you could imagine some ideal situation where you had a network and you gave it some task, and then only 5% of the neurons or something lit up in the network. Then you gave it some different task and maybe some 5% of neurons that are completely different light up.
The ideal situation, in some sense, for certain goals might be one in which there was this disentanglement where different parts of the network are doing different things. Because then if you wanted to create a smaller network that retained the abilities of the big network on some tasks, but then was much smaller and had sort of forgotten a bunch of other stuff, then you could just prune away the neurons that are not associated with the task that you want to keep.
[00:36:54] But in practice, things seem a bit more complicated, and the representations the models learn are kind of diffuse across a huge number of neurons. What I'm showing on these bottom plots is I'm pruning the network. I'm pruning the network by just eliminating neurons with the method that we talk about in the paper, which is fairly common in a bunch of the literature.
00:37:31 - Anthony Campolo
And that's like introducing noise into the model, essentially.
00:37:35 - Eric Michaud
Yeah. It's removing a bunch of neurons, and so it's like, well, how does that change the computation that the model does? It's kind of like this brain damage thing. There are actually papers on this.
00:37:49 - Anthony Campolo
They have looked at this.
00:37:49 - Eric Michaud
Your brain can lose parts.
00:37:50 - Anthony Campolo
And people can still kind of function.
00:37:53 - Eric Michaud
Yeah, I guess so. In this case, the networks don't function that well, at least if you don't do some additional training to fix the brain damage. So the dashed lines show the accuracy of this network after we ablate and prune neurons. As we go to the right here, we're pruning more and more neurons, but relatively quickly they drop.
00:38:22 - Anthony Campolo
To me, that is the opposite of what I would expect based on how you first set this up, which was that what it's doing is highly diffuse. It's not built on individual neurons. So I would expect you to be able to drop lots of random neurons and have it not have a huge effect. But what you're saying is that because it's very diffuse, that means removing just a couple will then break all the ones it's connected to.
00:38:48 - Eric Michaud
Yeah, I think that this could be for fairly uninteresting reasons. The network never saw neurons being removed in training, and so it's not trained to be robust. If I trained this network with dropout, which was fairly popular several years ago.
00:39:17 - Anthony Campolo
That was a training algorithm that did exactly this. It would just randomly drop ones and try and have it still work. Yeah.
[00:39:22] - Eric Michaud
That actual objective that the network is being trained on incentivizes it to learn solutions that are robust to this kind of thing. But without that here, I think it's not super surprising that things drop off.
00:39:40 - Anthony Campolo
What is the headline insight from this paper?
00:39:48 - Eric Michaud
Basically, there are ways to align the relevant parts of the network to a small set of neurons and then prune the rest. Like I said, things are fairly diffuse.
00:40:07 - Anthony Campolo
What is that way?
00:40:11 - Eric Michaud
What is what?
00:40:12 - Anthony Campolo
So what you said is that there is a way to do X, and that's the headline. That begs the question, what is the way?
00:40:19 - Eric Michaud
At least in this toy setting, you can regularize, so you train with this additional loss function, which does actually incentivize the network to learn a solution that is more sparse.
00:40:35 - Anthony Campolo
A dropout-like thing on top.
00:40:38 - Eric Michaud
Yeah. It incentivizes as many weights as possible to become small.
00:40:47 - Anthony Campolo
Okay, that makes sense to me.
00:40:49 - Eric Michaud
Yeah. Basically, you can train a network like this and it's a mess, and you're like, oh man, I can't really find a subnetwork here that does the specific task that I care about. But then you can do a little bit of extra training to turn this very messy thing into a thing that is, in fact, much smaller. Then you can prune away all these neurons and it doesn't affect performance that much. So it's like there's this kernel. Within this kind of mess, there is this much smaller network that we can imagine extracting. This training process kind of does extract it, which does solve this narrow task. This gets back to this basic assumption I was talking about a while ago that interpretability makes, which is that you can have this huge network and we hope that the computation that it's doing on some problem is ultimately describable as some much smaller network.
[00:42:00] And here we can literally sort of see what that smaller network is.
00:42:07 - Anthony Campolo
Okay. So let me put some guesses out for the implications of this, with the goals for this more high-level research. You can tell me if I'm right or wrong. The reason why you want a smaller model is because it's going to work better on smaller devices like mobile phones. It will just go faster even if it is running on super beefy hardware. It allows us to do less compute, which, you know, people are saying we're draining the ocean to run these models. Where does this work sit in those different types of ideas?
00:42:43 - Eric Michaud
Yeah, it's motivated by all of that. You could hope to have models that are smaller and faster, more efficient, for the tasks you care about, assuming they're sufficiently specific. There's also a little bit of a safety angle here. The thing that makes people really worried is the generality of hypothetical AGI. If you had systems that were more just doing some specific thing, those systems feel more like a tool than a thing that could pose some of these risks. So there's a little bit of a safety angle here as well.
00:43:36 - Anthony Campolo
Yeah, that definitely makes sense to me because, you know, if you have a travel bot app, you can ask it to give you a step-by-step way of distilling plutonium or something. But a very general model would have that knowledge. One of the things we talked about applying the Liberator on our last episode, he kind of does. He basically figures out, how do I get the model to tell me how to cook MDMA. That's one of his basic first things he does with a model because he wants to crack them and sees it as a philosophical kind of good. What you're saying is you want to create models that would not have that knowledge in the first place, but would still have the capabilities to perform the different tasks that we want to perform.
00:44:25 - Eric Michaud
Yeah, there's also potentially a longer-range goal here, which is different, but which is maybe from a safety standpoint, but also maybe not. If you eventually got really good at pulling these models apart into different subnetworks that were accomplishing some specific function, then maybe you could, for instance, pull out just the parts of a language model which do certain types of reasoning or abstract thought, or in-context learning, and get rid of the huge number of circuits that are much more narrow, or remove all the memorized knowledge that the model has. This is something Sam Altman has talked about. He wants us to have, eventually, very small models that have a huge context length, that use tools, that learn in context, but not nearly as much as today's models do.
[00:45:46] Just from pre-training. So maybe there's a long-run goal here of enabling us to design certain types of systems which have the specific capabilities that we want and don't have a bunch of other specific capabilities that we don't want, and that are as small as we can make them, given that we want them to have these things.
00:46:11 - Anthony Campolo
So I'm sure this is something that you think about a lot, because to me this seems like the biggest, most important question that this work asks, which is, is this even possible? Is it not the fact that its narrow abilities come from the sum total of its generalized knowledge in the first place? Because to me, that seems like what we've learned from these models: its narrow abilities come from the fact that we train it on as much as possible, the sum total of all human knowledge that has been written.
00:46:45 - Eric Michaud
Yeah, that is exactly the right question. The stuff that I was mentioning earlier with the curriculum learning gets at one possible answer to that, where maybe, for some reason, in order to learn the sophisticated general parts of cognition, you need to learn a bunch of other junk first. For some reason, having to do with the efficiency of learning or some sort of curriculum. I don't know if that's quite the right explanation, though. People are thinking about this a lot right now. There's this little workshop paper from Ari Brill thinking about this question a bit, of understanding or trying to get some sort of mathematical model of when networks learn general skills versus narrow skills and why.
00:47:40 - Anthony Campolo
That paper, real quick.
00:47:43 - Eric Michaud
Yeah, sure. It's quite recent and somewhat informal.
00:47:55 - Anthony Campolo
That sounds pretty legit.
00:47:59 - Eric Michaud
A lot of this is still very preliminary. It's people who are physicists. Ari is a physicist, kind of just doing some math.
00:48:15 - Anthony Campolo
These are the types of papers that I would never have any hope of understanding, so I'm glad I've got you here to help translate this into something that's comprehensible to humans.
00:48:26 - Eric Michaud
I think the idea here is that you can imagine that neural networks, in general, when they're being trained, are solving this optimization problem. You can talk about the optimization problem at a high level as this kind of engineering problem: you're moving each individual parameter in a manner that is going to slightly reduce the loss on each sample, which is a true description, but it's not very useful for reasoning about what models will learn at a high level. What I'd offer, and what is explored in a variety of these papers related to Ari's and others, is that maybe there's a description of that optimization process that can be stated differently. Rather, there's a competition in the network where different possible mechanisms or circuits are competing for capacity. What neural networks learn is almost like an ecosystem. The network is this environment where all of these different parts form and interact with the overall goal of reducing the loss. There can be this competition between these different mechanisms. In the case of general stuff versus narrow stuff, you can imagine a competition between allocating more capacity to general things, which are probably harder to learn because they're more complex, versus allocating that capacity toward lots of smaller things which are more narrow. There are various ways of modeling this dynamic. But I think it's ultimately responsible for a bunch of phenomena in deep learning, including grokking, which is something I've written some papers about.
00:50:27 - Anthony Campolo
Okay, cool. I want to hop off screen just for a second so I can use the bathroom. So I hit you with two topics that you can talk about while I'm gone for like two minutes. The question of compute, I think, is huge here, especially when you're talking about capacity and giving different capacities to different things. Has the question of compute come into this, and how much compute you have versus don't have, and how that changes what you're able to study or not study? And then what you're just talking about, grokking. You can hit either of those two topics. I'll be back in like two minutes, and I'll be listening.
00:51:03 - Eric Michaud
Okay. I might just talk about grokking just because it illustrates this kind of principle, but it's a little bit of a tangent from the rest of what we've been talking about. So it's good as a thing that I could talk about for a couple of minutes. Grokking is this phenomenon in deep learning where networks, when they're trained on some very specific task, like you're training a network to do modular addition, just some very simple math, and this is a data set unlike language model training where you're training on a huge data set. It's actually a very small data set, and you're training a network on this very small data set. It's pretty easy for the network to memorize, and it actually does memorize the training data set very early. It turns out, though, that if you keep on training the network, then as you train for a very long time, it'll eventually generalize, so it learns some different solution.
[00:52:14] That's very different late in training. Here's what the plot looks like. Here's the train accuracy going up very early in training, and then much further down the line in training, this is on a log scale, so this is actually like a thousand times later in training, the validation or test accuracy goes up. I think the synthesis, a sort of best explanation of this phenomenon, is that there are a bunch of these, going back to what I was saying about this competition between different possible circuits. You can imagine there's this competition between a bunch of parts of the network, which are basically just memorizing each individual sample, which are very low complexity. Then there's a circuit which actually generalizes, does something equivalent to the actual math that you want to learn. But that much more complex thing takes longer to learn. What happens is that very quickly the network learns all of these distinct, very low complexity memorization mechanisms for each individual data point. Later on, you give the network more time. It's able to form this higher-complexity generalizing circuit. Once that's there, the regularization of the network will get rid of all of that low-complexity stuff, which overall requires a lot of network capacity to implement, to memorize everything. So what you're left with after this process is just the general thing, which formed later, being left over in a network which is less complex than the network which just memorized.
00:54:07 - Anthony Campolo
For some reason, the Simpsons bit where they're going to Brazil and Bart is on the plane, and he just listened to like 12 hours of explicit language learning. He's like, all right, I now speak fluent Spanish. And then Marge goes, oh, no, Bart, they speak Portuguese. And then Homer goes, forget all, boy. So he takes the phone and then hits himself in the head a couple times. He's like, oh, God. Gone?
00:54:37 - Eric Michaud
Yeah.
00:54:41 - Anthony Campolo
It's kind of a good metaphor for this, right?
00:54:44 - Eric Michaud
Yeah.
00:54:46 - Anthony Campolo
Because.
00:54:46 - Eric Michaud
There's a whole field of unlearning.
00:54:49 - Anthony Campolo
Yeah, I remember I saw this once where it was like they were trying to make it forget about Harry Potter.
00:54:54 - Eric Michaud
Mhm.
00:54:55 - Anthony Campolo
Yeah. And you think about how many things could relate to that, because then, you know, the very basic idea of wizard is probably associated with Harry Potter because it's one of the most famous examples. So there's this idea, going back to the diffuse thing, it can be spread across all these different parts of the network.
00:55:17 - Eric Michaud
Yeah, I think our ability to do the unlearning in the ways that we want is a little bit a question of how these different mechanisms relate to each other within the network. Can they actually be disentangled in some fundamental way? And your question again about why we train general systems in the first place. Maybe there's something important about that generality, about the diversity of things that we train on. It's only possible to learn certain things if you train on a very broad distribution.
00:56:01 - Anthony Campolo
This is one of Tyler Cowen's lines, to say that the difference between a specialist and a generalist is that a generalist can't actually do anything except make comments like that one. Because it's about being able to zoom out and know that it does not have a specialty, but it supersedes the idea of a specialty to have the ability to make a general statement about the world.
00:56:29 - Eric Michaud
Yeah, I think, like, I don't know the extent to which this idea that we've been kind of exploring here, which is that the very big models are just a huge number of neurocircuits, I don't know the full extent to which that's true. I don't know, ultimately, how simple we'll be able to reduce the computation of these networks on most samples that we care about in practice will be. But I suspect that there will be something that you can do to reduce it, especially like knowledge. This gets back to the reasoning versus knowledge. Could you make a model that is just as good at reasoning but didn't have much knowledge? It's amazing how much knowledge these models have, but when you're asking it about Roman history or whatever, it's not relying on its knowledge about most other things, about mixology or something.
[00:57:47] And.
00:57:48 - Anthony Campolo
Right.
00:57:49 - Eric Michaud
So it really feels like you can imagine a hypothetical slightly smaller network which could still do just as well on that thing that you're asking about on Roman history, but it didn't have knowledge of mixology. You can imagine asking that question again and again, and maybe it would be possible to imagine a model which didn't have that much knowledge but which still had some sort of general learning abilities, ability to reason, that kind of thing. Probably there are limits to this, but I don't know what the limit is.
00:58:28 - Anthony Campolo
This actually gets into some of the questions that I had ChatGPT write for us, are kind of right in this area that I think are interesting. But before we go to that, are there other things about the paper you wanted to talk about?
00:58:40 - Eric Michaud
Nah, I don't think we should really get too much into that. We did do some things, like comparing actual language models on coding tasks, like getting language models that are good at predicting Python. We create that Python-specific neural language model by distillation versus pruning, and it seems like pruning...
00:59:07 - Anthony Campolo
I would love it if I could make all these models forget how to write Python.
00:59:14 - Eric Michaud
Yeah, well, we were trying to get them to forget everything else, but we don't really have to talk anymore about this paper.
00:59:21 - Anthony Campolo
Cool, man. Yeah. So what I did, this is the first time I've ever done this, and I'm just gonna remove this. It'll just be the two of us. An idea that I've had, this kind of goes with me always kind of dogfooding my own stuff, is I had a transcript of our last conversation, and I asked ChatGPT to ask some follow-up questions from our first interview, and ended up coming up with some pretty good ones. I think that might even be some stumpers for you. We'll see. So it asks, and I'm going to simplify this language a bit: in your model of the study you've done in terms of how you're making these models more narrow or small, what's one near-term falsifiable prediction it makes about the order in which skills appear as the model grows? I think this kind of gets into the curriculum stuff, but you can take this in any direction you want.
01:00:18 - Eric Michaud
Yeah. I would be curious to see. Maybe if you were going to train a language model to try to do calculus or something, you could probably... It's a little tough because this is pretty different from how the pre-training works. But you could imagine somehow changing the order that documents show up. I think this is maybe, I don't know where I heard this from, so I'm not sure if it's true, but one way of trying to do a curriculum for language models is to try to sort the documents that it sees during pre-training based on some assessment of how sophisticated they are. So early in training...
01:01:14 - Anthony Campolo
It's.
01:01:14 - Eric Michaud
If.
01:01:14 - Anthony Campolo
It's a textbook, you'd want to read it front to back, not in random order.
01:01:19 - Eric Michaud
Ideally, yeah. But also if it's like a calculus textbook, you would want to see those tokens later in pre-training than.
01:01:31 - Anthony Campolo
Like the sequence of the textbooks themselves.
01:01:34 - Eric Michaud
Yeah.
01:01:35 - Anthony Campolo
Okay.
01:01:37 - Eric Michaud
So I guess I would hope that there would be a little bit of an effect there if you did this and actually ordered things by some measure of sophistication.
01:01:51 - Anthony Campolo
Is this based on the models being just feed forward in general and kind of working in a linear sense? Because that kind of I've heard that this is a big thing, that the way we have language models set up right now, they kind of write the answer literally. So by the time they're halfway through it, they can't go back and restart, you know?
01:02:12 - Eric Michaud
Yeah, I definitely feel like the way that the models learn is pretty different than the way we learn, which probably gets at differences in the efficiency. They have to see so much. They have to read so much in order to learn something at the level that, you know, humans are much more data efficient. Maybe one explanation of this is that when I'm learning something, I can do a lot of thinking as part of that learning process. You can be like, oh, wait, let me work that out. Do I actually understand that? The process of really deeply reading something like a textbook is much more than just skimming it. It involves a lot of thinking, and you're even thinking about it in the shower all the time. But with the language models, it's like there's this fixed computation that's being done when it's doing the prediction and a fixed computation that's being done in the backward pass to do the update.
[01:03:14] And so it's like there's this very limited amount of compute or something. It's an inflexible amount of compute that's being put into learning from each example. It's the same each time.
01:03:27 - Anthony Campolo
This is one of the examples where I disagree with the way the AI field even frames this entirely. I understand why they want to draw comparisons to the human mind, because it's the only main reasoning machine we know of. So we want to draw comparisons, see how we can learn from it, and take things from it.
But I disagree with the idea that there needs to be any sort of correspondence between the two in terms of how a model learns versus how a human learns. I don't see any reason why we should expect those things to have any relationship to each other at all, just because it's an entirely different thing. It's not biological. It's never going to learn like us. It's never going to experience the world like us.
So it's silly for us to even think about how it maps to how we learn, because I just think it doesn't matter.
01:04:26 - Eric Michaud
Maybe the field has made a lot of progress in the last several years by not worrying too much about how brains work. That's what I'm.
01:04:40 - Anthony Campolo
Saying. Yeah, right.
01:04:42 - Eric Michaud
Like all of these things, often there's some sort of high-level motivation for them, like attention and Transformers. It's like, oh, humans can attend to different things.
01:04:53 - Anthony Campolo
Going back to the neural network itself, that was the reason why that phrase is even used, because we wanted to draw a comparison to the brain and how neurons click on and off.
01:05:01 - Eric Michaud
Right. But there are various attempts that people make to try to more explicitly incorporate maybe neuroscience principles into neural networks and their training, and they don't always work.
But I also disagree that there's no analogy. We could imagine maybe one day creating AI systems that are much more like brains, and they would be different. They would look very different from pre-trained models, but they would have a very different...
01:05:37 - Anthony Campolo
So the analogy would only define how they are different, not how they are similar.
01:05:45 - Eric Michaud
[unclear]
01:05:50 - Anthony Campolo
I also think it partly gets into us thinking of the brain as a computer and us thinking of computer brains as being human-like. I just think there's a slight mismatch there in terms of you eventually take a metaphor too far, because it hinders you more than it helps you. That's kind of what I worry about.
It's not necessarily that there's nothing to be learned by comparing them, but that if you take it too far and you think of them as being more similar than they are different, it is clouding your thinking more than it clarifies it.
01:06:26 - Eric Michaud
Mm. That's fair. It's possible that even a lot of this curriculum learning stuff, I guess we observed it like, okay, for some tasks, like in neural networks, there is a real curriculum. But it's totally possible that, yeah, for language models, there's not a super strong curriculum effect.
01:06:43 - Anthony Campolo
I mentioned this before, but as someone with an education degree who's thought about curriculum a bunch, I actually think curriculum doesn't make sense for humans either. Like, really. Yes. I don't think it makes sense to try and teach things in a linear way. I think it makes more sense to try and get people to engage with the material in a way where they're always slightly outside their zone of proximal development.
You know, you want their competency to be... this is the whole point of rock band at camp. So for those who don't know, a big way me and Eric know each other is we spent time at summer camp together through a rock band program. What you do is you sit down a bunch of kids and you teach them a song. You teach them how to play the song. There is no curriculum, there's no progression of scales or chords. It's basically: you need to learn this song. These are the chords you need to know.
[01:07:35] I'm going to teach you how to put your fingers in the right place to make these chords, and then they learn the song. Now there is a higher-level curriculum there, where if you start learning many songs, the songs will get more complicated. But it's not necessarily a curriculum in the sense that you sit them down and you have them go from point A to point B.
It's more that you just have to get them engaging with the material, and as they do that, they will learn more complex ways of engaging with the material, and a curriculum will kind of emerge. But it's not something that would be prescribed from the beginning, because it's more about having them start with what's the simplest thing they can do and then expanding one point out from that.
That can go in a lot of different directions, whereas a curriculum locks you down to this linear path to being like, you're going to follow the textbook. So that means you need to learn how to do step one before you can do step two.
[01:08:32] Whereas I think everyone can start at the first step and then extend to a second step, and the curriculum will emerge for each person individually. But it's not something that can be prescribed from the beginning. So that's kind of my high level about the idea of curriculum itself.
01:08:48 - Eric Michaud
Interesting. That makes sense. It feels like there are some limits to that, like at least in some fields, like pure math. I have a friend who does crazy algebraic geometry number theory stuff, and I just don't know what a sheaf or something is exactly.
01:09:07 - Anthony Campolo
That is because math is a very specific type of discipline that builds on itself in a way that I think very few other subjects do.
01:09:18 - Eric Michaud
Mhm.
01:09:19 - Anthony Campolo
And maybe I'm wrong about that because there is a comparison with music theory. You start with simple chords, you get more complex chords. You can't learn a seventh chord before you learn what the one-three-five is. So that's how music ends up being math-like. So it still draws back to math in the first place.
So yeah, I just think there's also a question of, you know, do we learn different types of things differently? Do we learn math differently from how we learn to read and write?
01:09:53 - Eric Michaud
Well, I think that this is getting at something about the structure of the world and the structure of what it means to be able to do these things. The fact that there's more of a need for something like a curriculum or some sort of progression in pure math or something says something about the structure of pure math as a set of ideas in the world. That's a fact about math as a set of ideas, which is maybe different from some of the things.
01:10:31 - Anthony Campolo
So a highly ordered symbolic system that grows on itself, you know.
01:10:36 - Eric Michaud
Yeah. But you could imagine there just exists this graph out in the world where each idea is some node, and then there's some set of dependencies of things you need to understand in order to understand that idea, or ideas you already have had to have learned in order to understand that idea.
You can imagine maybe one day, I mean, you can kind of just do this already by thinking about ideas and what you need to understand to understand other ideas. But I feel like this graph, this huge graph, just exists out in the world, and maybe within language models or neural networks in general there's this sense of having to traverse the graph a little bit. In order to learn certain things, you need to learn other things first, and there is this kind of dependency between things.
[01:11:35] [unclear]
01:11:35 - Anthony Campolo
Great. This feeds into the next question I was going to ask. So, what is in-context learning versus knowledge in the weights? What experiment would help separate what the model already knows from what it learns on the fly in a long context?
This is something that I think about a lot. We talked about this last time. Like thinking about what is the context needed to solve a coding problem in your project. So for me, I kind of do this experiment every time I want to create a new feature because I use a tool called Repo Mix, where I basically select certain subsets of my project to be distilled down to a single markdown file that I then give to the chatbot to have a discussion with.
So I haven't systematized this, but this is kind of an experiment that I run every time I'm building a new feature. I'm always kind of honing how much context I need to give it versus how little. I've actually learned very well when I hit the context limit with certain things, because sometimes I'll just be like, oh, I don't know.
[01:12:31] I'm just going to give it my entire frontend, and then it'll say my conversation has been too long, and it kicks me out, you know?
01:12:40 - Eric Michaud
Yeah. Oh, man. This is a good question. I feel like people in interpretability maybe have partially answered it. And maybe I just am forgetting this paper.
One thing I will say is that information about some piece of knowledge that the model knows is within a transformer. We know these things are sort of stored mostly, at least, in the MLPs as opposed to the attention layers. So attention is moving information between parts of the network. And the MLPs are just doing this kind of parallel multi-layer perceptron. They're also just called feedforward layers within a transformer.
But knowledge and stuff seems to be encoded in this kind of key-value store way in the feedforward parts of the MLPs. The in-context learning stuff relies on computation that's happening in the attention layers. You still are going to need the MLPs, but I wonder whether, in some imaginary network where you were sort of making it smaller, it still could do in-context learning, but it had much fewer facts.
[01:14:15] More of the model's computation, you would want to shift towards attention and away from whatever's going on in the MLPs, probably.
01:14:27 - Anthony Campolo
Okay. What is the relationship between attention and context?
01:14:33 - Eric Michaud
The size of the attention pattern in naive self-attention layers grows quadratically with the length of context.
01:14:52 - Anthony Campolo
Okay. So there is a relationship between the two because, going back to our last conversation, this is one of the things I was trying to understand: how do the academic terms relate to the colloquial terms of people who use these models? You were talking just a little bit previously about Sam Altman talking about smaller models with huge context windows. I find that really interesting because I feel like the length of the context has a huge effect on what you're able to do with the model.
01:15:22 - Eric Michaud
Yeah. This is a hard problem. Lots of people are working on increasing context length. But if you make some significant change in the context length, what I mean when things grow quadratically... with the attention architecture that people use, like in the original transformer architecture, if you 1000x increase the amount of context you put into the model, then you get something like a million-x increase in the memory footprint.
01:16:04 - Anthony Campolo
So it's like even more than exponential growth.
01:16:08 - Eric Michaud
Or it's like more than linear.
01:16:12 - Anthony Campolo
Okay. So it's more than linear, but it's less than exponential. I'm a total math noob, so I'm trying to figure out how.
01:16:18 - Eric Michaud
Exponential would make it completely impossible.
01:16:22 - Anthony Campolo
Exponential is just hockey sticks. So this is not quite hockey stick, but it's also not just like up and to the right. It's a little bit in between.
01:16:30 - Eric Michaud
Yeah. So GPT-3 had like, I think, 2000 tokens of context. Claude 4 Sonnet now has like a million tokens of context. Whatever Google and Anthropic and others are doing to support these long contexts, probably there's some change in what the attention is doing.
01:16:57 - Anthony Campolo
But we don't know. Going back to this question, we just never know. For closed-source models we can only speculate, right?
01:17:04 - Eric Michaud
Yeah, I mean, there's been a bunch of public work on different attention architectures. So like the DeepSeek architecture used a slightly different attention, although I don't know if that actually gets around the quadratic scaling issue. But I think if you're going to have extremely long context, you do need some sort of approach for reducing the size of the attention pattern.
01:17:41 - Anthony Campolo
Okay.
01:17:44 - Anthony Campolo
I feel like we haven't honed in on this question of scaling laws too much yet. It's kind of floated around. That's a very interesting question. So what have you learned in terms of your work, or what questions have you explored around the scaling laws in particular?
01:18:05 - Eric Michaud
Yeah. There are a bunch of different models of scaling laws. The model that I've contributed to is one where you imagine that there's just a huge number of different tasks in the data that the network has to learn, and the effective scaling is just to learn more and more of these niche tasks.
So you learn more and more mechanisms which are relevant on a smaller and smaller subset of the data. And if that is the right way of... if that was the only thing that was happening when we scale, I'm not sure that it is the only thing that's happening. But if it was, the effect was just to learn these much more niche pieces of knowledge or skills.
Then it maybe wouldn't be so surprising that it seems like pre-training is kind of plateauing, because the situations in which you would notice a difference in the model's intelligence would require you to be asking some extremely niche questions in a way that maybe most users wouldn't notice.
01:19:18 - Anthony Campolo
Right. So then this is actually one of the things that you had first texted me about talking: where does reinforcement learning enter the picture in contrast to pre-training? So the sense I get is that you pre-train up to a certain point. Then once you've kind of run out of text on the internet, you start doing this reinforcement learning thing, especially with reinforcement learning with human feedback.
Then you just have humans basically rate answers whether good or bad. And then that is an extra signal that is given to the models. And I think there's a lot of things that are fraught about that process in terms of what are those humans training them to do versus not do.
01:19:59 - Eric Michaud
But sycophantic.
01:20:01 - Anthony Campolo
Yeah, exactly. Or, you know, have a certain opinion about certain topics that may be the socially acceptable answer, but actually incorrect in a strict sense.
01:20:11 - Eric Michaud
Mhm. Yeah. I know what you mean.
So I don't think that there's a fully satisfying answer mechanistically for how reinforcement learning changes the internals of models. But one thing that, if you're doing long chain-of-thought type reasoning and you're training the models to do this kind of step-by-step chain-of-thought reasoning, is that it increases the amount of compute that the models can put into answering any individual question.
So if you just gave ChatGPT a question and you're like, do not say anything else, just say the answer and nothing else in your response, then it has to do all of its thinking in a single forward pass, which is a very limited amount of compute. But if you can have the model think for a while first, then it's able to have a much deeper compute graph and is able to spend more compute at inference time.
[01:21:17] Right. It's called inference-time compute figuring things out. And so pre-training at a very high level is all about spending a bunch of compute ahead of time, right, to get a model which is able to spend very little compute at inference time to answer things as much as it can. But then you're constrained in what you can learn, because some of the things that you might like to learn actually require a bunch of compute or more compute than is present in the forward pass.
And so those things just cannot be really incentivized to be learned during pre-training on its own. But during this whole chain of thought, you give the model this ability to express much deeper computations, and then it's able to learn things which take advantage of that.
01:22:07 - Anthony Campolo
Yeah. And this gets at one of the things previously I was asking about, which is to what extent are the experiments you can even run constrained by having a lack of the compute that the frontier model apps have access to, which is like billions of dollars worth of compute? No academic is going to have access to that amount of compute unless it deals with the companies, you know.
01:22:34 - Eric Michaud
Yeah. The nice thing about interpretability is that, at least if you're already there, if you're interested in understanding the model, you don't have to train. You don't have to necessarily do any training. I mean, there are people that do interpretability where they want to understand the training process. And so you might do some training or whatever.
But it's nice as an academic because you can just take an existing network. It's like a brain that's already formed if it's an open-source model and then study that.
01:23:00 - Anthony Campolo
Right? Yeah.
01:23:01 - Eric Michaud
People, I mean, I haven't done any RL in language models myself, but there are simple tasks where you can use RL yourself to get the model to do a little chain-of-thought reasoning. I think that Percy Lang's language model class at Stanford, like one of the assignments, involves doing RL to get some open-source language models to be better at math.
So it's not completely out of the scope of what some people in academia are doing. But probably the large labs are spending an immense amount now on RL. And also reinforcement learning environments are a little bit scarce for things that we really care about, like economically valuable tasks.
So I think one of the most interesting startups, for instance, right now is called Mechanize. There are other companies doing this too, but they're just building environments to train models via reinforcement learning.
01:24:09 - Anthony Campolo
Mechanize work. Yeah. Cool.
01:24:13 - Eric Michaud
Their essays are really good.
01:24:18 - Anthony Campolo
I was just actually grabbing all the links we've done so far so I can make sure I get them all in the YouTube description. Sweet. I can continue going down this list of ChatGPT questions. Or you could, if there's other topics you want to talk about or thoughts you had, we can kind of pivot to whatever.
01:24:41 - Eric Michaud
We can talk about whatever. I have...
01:24:44 - Anthony Campolo
Actually, something I did want to talk about was we were talking about last time, kind of what models we were using. I'm assuming it's still just ChatGPT and Claude.
01:24:56 - Eric Michaud
For me, yeah.
01:25:00 - Anthony Campolo
Yeah. Last time we were talking about six months ago, Claude 3.7 and ChatGPT o3-mini, I think, were the most recent ones. We've now gone far beyond those. We're now in various forms of GPT-5 and Claude 4 and 4.1.
Where do you sit on the Opus 4.1 versus Sonnet 4 debate? This is a big one. I know a lot of people who say Sonnet 4 is better than Opus 4.1.
01:25:29 - Eric Michaud
Really? I guess I don't know how much I've actually used 4.1 for coding. I don't notice much of a difference when I ask it about other things, but I think they mostly advertised it as being a coding improvement, right?
I mean, from a company strategy standpoint, it's noteworthy that I think they announced that the same week or maybe the same day as one of the OpenAI announcements. I don't know if it was the open-source models or if it was GPT-5. Maybe it was the open-source models, but it was the same week.
01:26:03 - Anthony Campolo
Each of them are just waiting to release models, and as soon as one pulls the trigger, the other one pulls the trigger. I feel like this has happened actually multiple times.
01:26:14 - Eric Michaud
Yeah, I think they felt... I have no insider sense of this, but probably they felt like, oh, there's a bunch of competition here. OpenAI is about to release a model that is trying to be competitive with us at coding, like Anthropic's main specialization. They're like, we have the best coding models. So it's a huge threat to the business.
And they're like, oh my gosh, we have to release something to have people not go online and be like, oh my God, it's so over, like the kind of hype thing. And so then they release a model, they call it 4.1. And then they say, we expect to release much larger improvements in the coming weeks. But how many weeks does that even refer to? It's probably months, but who knows?
01:26:55 - Anthony Campolo
Yeah. Well, it's funny because in some sense it almost wasn't even necessary because a big part of... and this is my next question for you... a big part of the debate I saw online was people saying that GPT-5 even was not better than Sonnet 4.
Even this newest, like a whole bunch of people who had a ton of issues with GPT-5, part of that had to do with the router: people not knowing which version of 5 they were getting. This has been a problem that has plagued OpenAI forever, just naming these freaking models.
Because if you go into the UI, you have Instant, Thinking, and Pro, and if you go into the API, you have regular, Mini, and Nano. So there's like seven different versions of GPT-5, allegedly. So when people talk about GPT-5, it's not clear at all what people are even talking about.
01:27:54 - Eric Michaud
Yeah, this feels better than it used to be, though.
01:27:58 - Anthony Campolo
Before it was like 4o versus o3. People are like, wait, o3 is newer than 4o.
01:28:06 - Eric Michaud
Yeah. Because now it's just two axes, right? There's reasoning length and there's model size. And you can kind of choose within that, right?
01:28:15 - Anthony Campolo
I don't think that's actually clear in the naming.
01:28:19 - Eric Michaud
When you do an API request to GPT-5, can you do like GPT-5 Nano but with high reasoning, or are these not completely orthogonal?
01:28:31 - Anthony Campolo
That's a good question. I know you can change the length of output tokens and reasoning tokens. I don't know if you can turn on reasoning or turn off reasoning for Nano. And I think this is a big part of the problem. It's just not entirely clear now, especially now that reasoning models are in the mix, what you're getting.
Because right now what happens is there's something called Auto. So this is what really adds a lot of confusion. When you do Auto, you give up the ability to select a model to the AI, and it decides how long to think, and it routes between different things.
I think this is useful for two people. It's useful for the companies themselves, because then they can try and spend less compute for more answers. And it's useful for people who don't want to think very hard about what system they're interacting with. I think for someone like you and me, and for the power users, it's just never useful to abstract this away.
You always want to know as much as possible. What is the specific model you're using? When are you selecting a bigger model versus a smaller model? When are you making the trade-off of speed versus complexity of response?
So I get why they're going in this direction, but I think a lot of the frustration users have is about this removal of the autonomy of the user to select what model they're getting. [01:30:22] - Anthony Campolo Like, exactly.
01:30:24 - Eric Michaud
I mean, isn't there a speculation that it's just doing multiple parallel rollouts and then giving you the best answer as judged by some reward model?
01:30:32 - Anthony Campolo
I don't know. You would know better than me. I've always assumed that with the pro models, if they just let it run longer, and by having it think longer and come up with more iterations or more plans or whatever, it ends up giving a better answer.
Because we've had this in the past: one pro thought longer and allegedly gave better responses. So that makes sense. But then when you add the extra axis of the instant, the thinking, and the pro, then I'm like, okay, one thinking longer, one thinking even longer.
01:31:15 - Eric Michaud
Yeah. Yeah, I don't know. Yeah, okay.
01:31:21 - Anthony Campolo
And that's the problem: no one knows.
01:31:27 - Eric Michaud
Yeah. Do you think Anthropic is better about this? Maybe a little bit.
01:31:32 - Anthony Campolo
Well, Anthropic just has less choices. You have Sonnet and you have 4.1, and then they deprecate their models faster. They've generally been better because they've also had a linear progression of model names. They didn't have three, then four, then started back at one and then skipped two. And just that alone is what made ChatGPT so confusing. So the fact that they finally removed that at least has simplified that. But now you have a single number with seven different versions. So you have to understand how all those fit together, whereas you just don't have this issue with Claude.
And I think with Claude, every one of their models, you can turn on and off extended thinking, which is like the reasoning. So if that's the direction ChatGPT is going, I think that's good because that gives you the ability to just toggle it on and off of the different models. So I think if you can get to the point where you just toggle on and off reasoning, that's ideal.
[01:32:34] And then you have a linear progression of different models that get bigger and better and slower versus smaller models that are faster. That's something that can make sense, and that's kind of where Claude has gone. I think some of the thought around why Sonnet 4 is better than 4.1 is because it's smaller and faster. It's like 4.1 is not smart enough to justify the slight hit in latency you get. But also, most people don't even run benchmarks on these things. They don't really know which one's faster or slower. They just kind of go off the vibes, you know?
01:33:09 - Eric Michaud
Yeah. Well, I mean, the speed, subjectively, it's important. It's important practically because you don't want to be waiting all day for it to give a response. But it's also, I don't know, like if you've ever interacted with one of the APIs that uses different hardware, like Groq or Cerebras.
01:33:30 - Anthony Campolo
It's so much faster. It's almost done. It's crazy. Yeah. I have a friend, Theo, who has created this thing called T3 Chat, which lets you have an incredible fine-grained ability in which models you choose. So people can always choose ridiculously fast models.
01:33:47 - Eric Michaud
Yeah, I just remember first using Llama 3, like 8B, on Groq. It's like 1000 tokens per second, and you'd be like, okay, could you cut up a little website for me that does this? And then it's done in a second, and it feels smart. I mean, it feels like superintelligence.
One of the things that Bostrom wrote about was speed. Speed superintelligence is a type of superintelligence. You're not necessarily dramatically smarter in your ability to think about things totally differently, but if you're just much faster, it's like a kind of superpower. And so it makes sense that people would prefer the faster model.
01:34:20 - Anthony Campolo
Totally, yeah. I have a buddy who just tweeted like a day or two ago. He's like, the biggest bottleneck right now is inference speed. It's like if we just sped up inference speed, the models wouldn't even have to get smarter, necessarily, because of how much more useful it would be.
I honestly have started integrating my coding work with my household chore work to where I'll fire off a query that will take a minute or two to get a response, then do dishes for like two minutes, and then go back to my computer. It's like the compile time. We're back to compile time. You know, where people used to have to wait two minutes for their code to compile.
Some of the models that are the best just take a minute or two to give you a response. Some people just won't accept that, whereas other people will be like, if that's going to be the difference between it actually writing code for me that works versus not, then I'm going to wait two minutes. So it's always a question of how complex is the work you're trying to do with the LLM.
01:35:24 - Eric Michaud
Yeah. People, I think, yeah, it makes sense that you'd be willing to wait a couple of minutes for a great response. It's going to get crazy, though, with the extremely long contexts. And if you have models that have improved context length and they're doing extremely complex tasks, I mean, just during training, in order to train the model to use its super long context, you actually have to do a rollout which uses up all that context to try to do some complex task and then try to get better at it.
But even doing the rollout, if you're doing a million tokens at 100 tokens a second, it's 10,000 seconds. So it's a multi-hour rollout. It's like a few hours.
01:36:04 - Anthony Campolo
Can you define what you mean by rollout in this context?
01:36:07 - Eric Michaud
Just like sampling. Like, the model just goes off and tries to do something by thinking a lot and writing code and running things or whatever it's doing. The rollout is just generating tokens. So I just mean during generation.
01:36:22 - Anthony Campolo
For it to get to the end of a single response, basically. And that response could be a long amount of time.
01:36:28 - Eric Michaud
Yeah. I think we could imagine right now they're trying to get models to think for hours. And this just means outputting hundreds of thousands of tokens or more in the process of thinking about something.
01:36:43 - Anthony Campolo
Yeah. And that's a big thing for agentic loops where these code editors will go off and spend like an hour writing all this code. And you're just praying that the answer it gives you at the end is correct. So is that part of the issue, that you don't really know whether it's going to give you a good output or not until it's actually spent those hours and given you the response?
01:37:06 - Eric Michaud
Yeah, I guess. There's something about this that's just not parallelizable. You actually just have to wait. If you're at only 100 tokens a second, you can't just, I mean, you can increase the batch size, and that is a way of making things more parallel. But there's some amount of unavoidable waiting, either during inference or during training.
And I guess here I'm thinking about during training. It's like you're doing this over and over and over again, and how many times a day can you even update the model? If you do a rollout, it takes hours. Then there's maybe a bit of a limit to that.
01:37:50 - Anthony Campolo
Right. Yeah. That's a bit of a tangent. But something that I wonder about is that they will release these models with new version numbers, you know, like four versus 4.1 or, you know, GPT four versus five. But then those models will still be updated. And if you go into the API every couple of months, they'll change the date on them because there'll be a new one.
So it's like the models that seem to be static and have the same version number will change, and users have keyed in on this. They notice when there's behavior changes, but then it's always this question of has the behavior changed or is it like a placebo type thing where we don't always know whether the models have changed or not. Sometimes we feel like they've changed, but they really haven't. And sometimes they do change, but we don't think they have.
01:38:40 - Eric Michaud
All this stuff is super weird. It's like people saying that the models have gotten nerfed and then the employees being like, nothing has changed. Yeah.
01:38:51 - Anthony Campolo
Like literally we have not written a single new line of code.
01:38:56 - Eric Michaud
You know, like, yeah. There's a weird subjectivity to this one. Sometimes when there's a new model, people will want to be really hyped, and if there's a lot of hype, then you get a good response from the model and you're like, oh my God, this is actually such an improvement.
But maybe objectively, or you revisit the model later, the hype dies down. People are like, oh my God, they nerfed it now. But it's like maybe the same model. And yeah, I guess really good benchmarks are just really important, but I don't know if we have really good ones.
01:39:32 - Anthony Campolo
Yeah. And that's also I find that the only people who seem to really have a good handle on this and whose opinion I tend to take seriously and trust are people who use models every day to do their work. Because if you're doing that, you will have a sense based on your own use cases and how complex or not they are.
What is good, what is bad, which models do good, which models don't quite cut it. And when new models come out, whether they're better than other ones, because you have a very concrete set of requirements that you're coding against, or whatever work you're doing might even be just like copyright work. And it's like, you know, 4.1 comes out and it all of a sudden writes way better direct response copy. So I tend to follow people who are just using these models on a regular basis, and when they kind of get really hyped about a new model, I'm like, okay, this model might actually be more useful and better, but then I'm also one of those people.
[01:40:30] So I don't even really particularly need anyone's opinion because I am forming it myself based on me just using the models. I don't really care what people say the best model is because I'm deciding myself through my own use cases.
01:40:45 - Eric Michaud
Mm. Yeah. Some people have their own little internal things that they always try the new models out on. And I just like that.
01:40:58 - Anthony Campolo
Coding up a front page of a website, you know, like you have some copy for a thing you want to sell, you have it just create you a single index.html file and make it look good. I find that's just a really interesting task to give them.
01:41:12 - Eric Michaud
I think I saw one from Kelsey Piper. Maybe that was really good about like, you give the model a chess endgame puzzle, like a mate in one that actually doesn't have a mate in one. And you see whether the models are actually honest with you and they realize that this is the case. And I think it wasn't until relatively recently that models started solving this.
01:41:39 - Anthony Campolo
Interesting. I think I found their tweet about this. It says, "[unclear] is the first AI to pass my personal secret benchmark for hallucinations and complex reasoning." So I guess now I can tell you all what that benchmark is. It's simple. I post a complex mid-game chessboard and mate in one. The chessboard does not have a mate in one.
Yeah, this is it in the chat. That's super interesting because I'm of two minds about these types of things. I think that it's really useful to find the limits of a model's capabilities, and usually that requires doing something very contrived. But I also find that the more you try and find a contrived thing it fails at, the further that gets from whether it can actually do something useful for you in the real world. Do you know what I mean?
Because that's why I find people who work with them day to day. They have specific tasks like, you know, write this new feature for me, or the feature works or it doesn't.
[01:42:40] That's something that's very concrete versus like, can this thing create an image of a wine glass where the wine glass is completely full instead of just half full? That was the biggest challenge for image models for the longest time. And I'm like, but does that actually matter?
01:43:02 - Eric Michaud
Yeah, I feel like, at least in Anthropic's marketing, they're like, you know, we really just care about the real-world utility of the models.
01:43:16 - Anthony Campolo
And I think their success has borne out that strategy.
01:43:22 - Eric Michaud
Yeah. It's also harder, it seems, to create environments to train on. It's pretty easy to train on puzzles. It's easy to grade. Software engineering seems a little bit harder.
01:43:34 - Anthony Campolo
This is such a broad problem set.
01:43:40 - Eric Michaud
Yeah. Probably there's going to be a lot of work over the next few years going into figuring out how to do RL to make the models practically useful, not just really good at competition math, which they are now really good at, like with the IMO results recently and all that. But how can you actually replace people at their jobs? It seems hard to figure out how to train the models to do something like that. People are going to try, and I know people. Right.
01:44:11 - Anthony Campolo
So yeah, we can kind of start wrapping it up here with some last things. Are there interesting techniques or works or research areas that people should be looking at in terms of what you just said?
01:44:30 - Eric Michaud
As I mentioned earlier, there's a company called Mechanize. I think their essays on this topic are really good. There are other companies creating environments. Yeah, I don't have that much to say. I feel pretty uncertain right now about what the next few years are going to look like, how quickly the progress is going to happen. You feel very.
01:44:56 - Anthony Campolo
Uncertain.
01:44:56 - Eric Michaud
You said uncertain. I don't know. Yeah, yeah, yeah.
01:45:00 - Anthony Campolo
I think that's the only really intellectually legit way of framing it, because I think anyone, and this is why I always have issues with people who say AGI is X numbers of months or years away, because it's like if you even have a definition of what that means. And secondly, you don't know what the next three to six to nine months are even going to look like because the field is moving so fast.
01:45:29 - Eric Michaud
Yeah. Yeah, it feels like AGI, at least the way it seems people think about it, is just something different than the systems that we have now. It's not always super well-defined, but it's usually defined as.
01:45:53 - Anthony Campolo
What the model cannot do.
01:45:55 - Eric Michaud
Yeah. But if the thing that we're doing is scaling, like you just scale a little bit further on a slightly more diverse set of things, and then suddenly you have AGI, I don't know if I really buy that. It feels like we're just going to add more and more diverse capabilities to the models. They're going to get more and more useful, bit by bit. They are going to feed back on themselves a little bit and accelerate their own development at a certain point. But I don't know if we ever will, probably eventually, but I don't know really what AGI is and when we're going to get it. Many people are like, it's just two years, but.
01:46:39 - Anthony Campolo
AGI is when it can do my taxes.
01:46:42 - Eric Michaud
So maybe that's already happened if you prompt it right, if you give it the right tools.
01:46:49 - Anthony Campolo
I'm sure it could do my taxes better than most tax professionals at this point. That's another interesting one. Real quick is like when will these models get to the point where just asking ChatGPT will give you a better response than a doctor versus like a doctor with ChatGPT will probably always be the best. But like a doctor who doesn't use ChatGPT, would ChatGPT give you a better answer? I don't know. I feel like it's probably on the margins getting there.
01:47:24 - Eric Michaud
Mm. Yeah. We'll see, we'll see.
01:47:35 - Anthony Campolo
Awesome, man. Where should people look for you on the internet?
01:47:40 - Eric Michaud
Yeah. You can follow me on Twitter. My handle is ericjmichaud_. Or if you just Google me, Eric Michaud, it's M-I-C-H-A-U-D. I have a personal website with some of my other links.
01:47:59 - Anthony Campolo
Yeah. Google Scholar page too. And you have a LessWrong account. I'm trying to find your home page. It doesn't show up very quickly on the Googles. Oh, really? Oh, your CV does, though, which is on your domain.
01:48:18 - Eric Michaud
Oh okay.
01:48:22 - Anthony Campolo
There's that. There's a cool site, get seo.io. You can just have it run on your site and it'll give you some SEO improvements. You know, maybe you're missing a meta description or something.
01:48:33 - Eric Michaud
Yeah, that's weird. At least when I Google my own name, my personal website just comes up first.
01:48:38 - Anthony Campolo
I'm also using Bing, so that's a problem.
01:48:47 - Eric Michaud
Let's see. Oh yeah. There's a bunch of other junk on here, I think.
01:48:51 - Anthony Campolo
Especially if you leave out the J. Yeah.
01:48:54 - Eric Michaud
Yeah, I think maybe at a certain point it offered to make a knowledge panel about me, and.
01:48:59 - Anthony Campolo
This is what I was looking at. So you see here the second one has the CV. So I clicked that and then grabbed it. Oh, bro, you're still using Arc.
01:49:10 - Eric Michaud
Oh, yeah. You can see me. Did I just share?
01:49:14 - Anthony Campolo
No. I put you back on screen for a second. No. Are you sure? I think [unclear].
01:49:20 - Eric Michaud
Yeah. No. What did you recommend these days? What are you using?
01:49:25 - Anthony Campolo
I mean, I'm using Edge. I'm on a warpath against any Google tech, and I moved into a home with a Nest. I want to rip it out of the wall, but.
01:49:36 - Eric Michaud
Oh, man.
01:49:37 - Anthony Campolo
So I don't use Chrome, and I find that Edge is pretty much similar. You can do vertical tabs, so that gets you kind of the Arc-like feel. But Arc is being phased out, essentially. They're moving to the new Dia thing. At a certain point, you just wouldn't want to be using a browser that's not getting [unclear]. I'm sure Sabin is on Arc still too. That's funny, Saban. If you have any questions, hit us with it. If you actually just sat through and listened to that whole thing, then... [unclear]. So when are you going to be done with your PhD? When are you going to become Doctor Eric Michaud?
01:50:35 - Eric Michaud
The hope is this semester, even sooner, like in a month, but we'll see. But like, on your.
01:50:51 - Anthony Campolo
Panel.
01:50:53 - Eric Michaud
Yeah, it's Max and this guy Ike and Mike Williams. It's three physics professors. But I met with them over the summer, and they said that anytime I could basically just staple things together into a thesis and defend. So, barring anything surprising, I should be okay.
01:51:13 - Anthony Campolo
Okay. Well, sweet. Well, I look forward to whatever that is. We can have you back on so you can talk about whatever you ended up shipping as your final thesis. That's super interesting to me. It's totally outside of my world of what I've ever done. So that would be interesting to talk about. But yeah, it's been a super fun conversation, like always. We'd love to have you back on anytime.
01:51:37 - Eric Michaud
Great chatting. Yeah, thanks so much.
01:51:39 - Anthony Campolo
All right. And we'll catch the rest of you guys next time. I'm going to be streaming, not this Friday, but the Friday after with Nicky T, and I'll be going into all the new improvements to the AutoShow CLI, which will be fun. That was one thing we didn't touch on today, multimodal stuff, which I do think would be interesting to talk about. But that is a teaser for next time. All right. Bye everyone.