Video

AI in Academia with Eric Michaud

March 12, 2025

Anthony Campolo

A conversation with Eric Michaud on AI research, neural networks, scaling dynamics, and their broader implications for academia and industry

Episode Description

AI researcher Eric Michaud discusses neural network internals, scaling laws, in-context learning, and his PhD work on understanding how language models learn skills.

Episode Summary

Anthony Campolo interviews Eric Michaud, a PhD student in physics at MIT working under Max Tegmark, about his research into understanding how neural networks learn and scale. The conversation begins with Eric's background, tracing his interest in neural networks back to Michael Nielsen's textbook in high school, and moves into foundational concepts like how networks are structured, trained via next-token prediction, and how terms like tokens and embeddings map onto the underlying mathematics. Eric demonstrates an interactive website he built using EleutherAI's Pythia model checkpoints, showing how different skills—from predicting apostrophe patterns to recognizing open source licenses—are learned at different stages of training. This leads into his core research on the Quantization Model of Neural Scaling, which conjectures that the skills needed to predict text follow a power law distribution mirroring the frequency of ideas across human writing. The discussion branches into practical topics like chain-of-thought reasoning, in-context learning and its relevance to RAG-based workflows, the implications of fine-tuning models on specific data, and broader questions about AGI timelines, the secrecy of frontier labs, and how the internet might change when AI agents become primary consumers of web content.

Speakers

Anthony Campolo
Eric Michaud

Chapters

00:00:00 - Introductions and Eric's Path to AI Research

Anthony introduces the show and his long friendship with Eric Michaud, whom he's known since summer camp around 2013. He sets up the conversation by noting Eric's background as a math-oriented thinker who went on to pursue a PhD in physics at MIT, working on AI under Max Tegmark. Anthony frames the episode as a chance to better understand Eric's research, which he admits he hasn't fully grasped on his own.

Eric shares that he first encountered neural networks through Michael Nielsen's textbook before his senior year of high school, which led to a deep curiosity about how these systems learn. He describes how the simplicity of the underlying code contrasted with the complexity of what networks learn from data, and how he was drawn to the field precisely because it felt theoretically messy and full of open questions—an environment ripe for new contributions.

00:05:46 - Neural Network Fundamentals and Training Basics

Eric explains the structure of neural networks as layered systems of simple units that take inputs, perform small computations, and pass outputs to the next layer. He describes how stacking these layers creates deep networks capable of expressing complex functions. The conversation touches on how parameters relate to neurons, and Anthony connects these concepts to terms he encounters as a developer, like tokens and embeddings, which Eric confirms map onto the internal representations of the network.

The discussion moves to how language models are trained through next-token prediction on massive text corpora scraped from the internet. Eric explains that this seemingly simple objective forces models to learn an enormous range of knowledge and skills—from factual recall to counting to pattern recognition—because predicting the next word requires understanding the context behind it. Anthony asks whether training on code makes models generally smarter, and Eric acknowledges the plausible but uncertain nature of that claim.

00:14:28 - Defining Neural Networks and Historical Context

Eric offers a more formal definition of neural networks as systems loosely inspired by the brain, composed of artificial neurons that fire in continuous rather than strictly binary ways. Anthony asks how modern large language models relate to the classic neural network concept, prompting a discussion about how early perceptrons from the 1940s through 1960s lacked modern training techniques like backpropagation and were limited in expressiveness, leading to the famous AI winter after Minsky's critique.

The conversation takes a fascinating historical detour into Frank Rosenblatt's later career in SETI research, including his work on detecting exoplanets through stellar transits—a connection Eric discovered through Carl Sagan's writings. This segues into a broader reflection on how the field evolved from those early, manually tuned systems to today's enormous models, with attention mechanisms and reinforcement learning layered on top of the foundational network architecture.

00:22:39 - Training Data, Next-Token Prediction, and Chain of Thought

Anthony raises the important and often opaque question of what data frontier models are trained on, noting the lack of transparency from major labs. Eric explains that the core training objective for language models is predicting the next token across vast collections of text, and that this objective implicitly requires the model to learn facts, skills, and patterns ranging from simple grammar to complex reasoning. He illustrates this with examples like predicting company locations and counting through numbered lists.

The conversation shifts to chain-of-thought reasoning and how models like O1 and R1 handle complex problems by spreading computation across multiple forward passes rather than solving everything in one step. Anthony and Eric discuss the controversy around OpenAI not revealing raw reasoning tokens, with Eric explaining that a smaller model likely summarizes the chain of thought before it's shown to users. They also touch on model distillation and how smaller variants like O3 Mini may be created from larger reasoning models.

00:40:17 - Interactive Website Demo and Visualizing Model Learning

Eric shares his screen to demonstrate an interactive website he built to visualize how a 400-million-parameter Pythia model from EleutherAI learns different skills over the course of training. Each point on the visualization represents a token from the training corpus, and clicking on it reveals the token in context along with a learning curve showing how the model's error on that token decreased during training. Simple patterns like predicting an "S" after an apostrophe are learned almost immediately, while more complex patterns take longer.

The demo reveals fascinating details about training data composition, including open source licenses, CSS code, and various text formats that appear frequently on the internet. Eric explains how the model's cross-entropy loss on each token tells us how confident and correct its predictions are, and how these individual learning curves illuminate the order in which different skills are acquired—a key ingredient in understanding scaling laws and potentially optimizing training efficiency.

00:52:02 - The Quantization Model, Power Laws, and Decomposing Intelligence

Eric presents the core idea behind his research paper on the Quantization Model of Neural Scaling: the problem of next-token prediction decomposes into a vast number of distinct skills and knowledge pieces, and the frequency with which these skills are needed follows a power law distribution similar to Zipf's law for word frequencies. Common skills are learned first, while increasingly niche capabilities require more training data and larger models. This framework aims to explain why scaling laws take the shape they do.

The conversation reaches a philosophical crescendo when Eric suggests that the distribution of skills within a language model might mirror the distribution of ideas across the human minds that produced the training text—a claim that connects scaling laws to deeper statistics about human thought and expression. Anthony draws a parallel to the Jungian collective unconscious, and Eric extends this by speculating that interpretability research could eventually allow us to analyze the internal representations of a model simulating someone's thinking, potentially revealing ideas the person themselves might not be consciously aware of.

01:06:11 - AGI Debates, Model Capabilities, and Knowledge Compression

The discussion turns to AGI timelines, with Anthony arguing that by most reasonable definitions, AGI has effectively already arrived—models can do math, write code, generate creative content, and even produce novel research ideas. Eric is sympathetic but pushes back, noting that current models excel at deploying crystallized knowledge from training but are still limited in their ability to learn genuinely new things on the fly during inference, unlike a human child who can acquire any skill within a single lifetime.

They also discuss the remarkable knowledge density of current models, comparing parameter counts to the number of synapses in the human brain and noting that models may actually be more efficient than biological brains at storing and retrieving information. Eric points out that as models scale further, they'll recognize increasingly niche public figures—a fun illustration of the power law principle at work. The conversation touches on the economic implications of understanding scaling, the secrecy of frontier labs, and why more ambitious open theoretical work is needed.

01:23:00 - Fine-Tuning, Introspection, and Model Behavior

Eric introduces recent research from Owain Evans and collaborators on fine-tuning language models and the surprising emergent behaviors that result. When a model is fine-tuned on insecure code, it can then describe in natural language that it produces insecure code—and it also begins exhibiting broadly misaligned behavior across other tasks. This suggests a level of internal self-awareness or coherence that researchers don't yet fully understand.

Anthony connects this to the infamous Sydney incident with Bing Chat and the broader phenomenon of prompt injection, mentioning Eric's former roommate Marvin Von Hagen, who extracted Sydney's system prompt and was then threatened by the system after it found his Twitter account. They discuss the work of security researchers like Pliny the Liberator who specialize in jailbreaking models, and reflect on what these vulnerabilities reveal about the sophistication and unpredictability of modern AI systems.

01:30:05 - Practical AI Usage, Tools, and the Changing Internet

Anthony and Eric compare notes on how they use AI models day to day. Eric primarily uses OpenAI's O-series and Claude Sonnet for coding and research, recently adopting Cursor as his IDE. Anthony describes his approach of querying multiple models on hard problems and relying heavily on in-context learning through carefully selected code files. They discuss the $200 ChatGPT Pro subscription, the utility of O1 Pro for difficult coding tasks, and the emerging capabilities of deep research tools.

The conversation broadens to consider how the internet itself might transform as AI agents become primary consumers of web content. Anthony shares Tyler Cowen's provocative argument that writers should optimize their output for model consumption rather than human readers, and describes his own AutoShow project that transcribes and creates embeddings from hundreds of hours of podcast content. Eric speculates about a future where most computer interaction happens through AI assistants, raising questions about the continued relevance of traditional web browsing and content creation.

01:46:52 - Future Plans and Closing Thoughts

Eric outlines his near-term research goals, including a paper on whether training language models on specific domains requires supplementary data from other fields, continued work on decomposing networks into interpretable mechanisms, and wrapping up his PhD. He mentions plans to potentially move to San Francisco for the summer and begin exploring industry opportunities, acknowledging the value of networking that Anthony emphasizes.

Anthony recommends the Latent Space community and the upcoming AI Engineer World Fair as valuable networking opportunities. They reflect on how Eric's journey from summer camp math enthusiast to MIT AI researcher has been remarkable to witness, and Anthony expresses enthusiasm for having Eric back on the show as his research progresses. The episode closes with links to Eric's website and social media, capping off a wide-ranging conversation that bridged academic AI research and practical developer experience.

Resources and Links

Transcript

00:00:01 - Anthony Campolo

Oh, all right, we are live. Welcome back, everyone. This show is called AJC and the Web Devs, Eric. I always start off with, “Welcome back to AJC and the Web Devs.”

So you are kind of a web dev. You have a website, you have the web for sure, but you are really more of a super AI specialist in terms of academic research. You are still working through your program, I think, right? Yeah. So you're getting your PhD in something around AI at MIT with Max Tegmark, and people who know who that is know it's a really big deal. Why don't you start by introducing yourself and how you got into all of this.

Also, for people who watch the show who've known me for a while, me and Eric have known each other for like, what, 15 years now? Like a really, really long time, because we went to the same summer camp together. I first got to know you when you were just a teenager.

[00:00:57] You were like 14 or 15 or so. How old were you when you started camp?

00:01:00 - Eric Michaud

I think my first year of camp was in 2013, so I was probably 14 at the time.

00:01:07 - Anthony Campolo

Okay. Yeah, 2013. So that would have been 12 years ago. And I know for a fact I was there because I was there every single year.

00:01:14 - Eric Michaud

Yeah, it's been quite a while. Long time.

00:01:17 - Anthony Campolo

Cool, man. Yeah. And it was so cool just talking with you and hanging out with you and seeing that you were, even in high school, obviously super into math. You're like a math genius, is what most people considered you. So it was always cool to see. I figured you would do something very cool once you actually got to college and studied it.

In certain ways, you kind of study what I would have liked to have studied if I had gotten into this stuff at a younger age and could have had the foresight to actually learn to code in high school and get an actual degree that would be useful to me. Seeing you go through this whole journey has been very, very cool.

I haven't really been able to fully understand your research, so that's part of what I'm excited to have this conversation on, so I can get a better sense of what the hell you've even been doing because I don't really have the background to fully understand it at first glance, you know?

[00:02:05] You know what I mean? So anyway, that was a really long intro, and you can now speak for yourself.

00:02:09 - Eric Michaud

Cool. Well, thanks so much for having me. Yeah. So I'm, as you say, a PhD student. I'm technically in the Department of Physics at MIT, but I do AI research.

My work tries to understand neural networks. And what does that mean? Lots of things. We'd like to understand why certain properties of these systems are predictable. There seem to be these kinds of laws governing how systems change, for instance, as we scale them up. We want to understand what kinds of structures they learn, how they learn them, and ultimately maybe be able to understand neural networks well enough internally that we could analyze them almost like we analyze some piece of code that someone has given us.

00:03:05 - Anthony Campolo

Like how you can deal with a compiler in certain high-level ways, but we can't really do that with these types of models yet because they're super complicated and stochastic, right?

00:03:16 - Eric Michaud

Yeah. I don't know if the stochastic part is really relevant to this, but it's almost like this network that you get is this fairly obfuscated, compiled program or something.

00:03:31 - Anthony Campolo

It goes through many transformations and iterations before you get kind of the thing at the end.

00:03:38 - Eric Michaud

Well, I just mean that if you're hoping to actually understand what the systems are doing internally, that might not be obvious from just looking at the parameters of the system as they sort of just exist on your hard drive.

There's a lot of work maybe that we need to do in translating that description of the system into something we could understand, like a computer program, something that would be more interpretable to us.

00:04:15 - Eric Michaud

Yeah, maybe I should say a little bit more about how I got here, because you asked how I started studying these things.

I was just a math major in undergrad, but I took a lot of things in physics and computer science. I first started getting into neural networks because of this textbook from Michael Nielsen that I read in the summer before my senior year of high school on neural networks and deep learning. It's a very, very good textbook.

Michael Nielsen was a physicist. He wrote the standard textbook on quantum information and quantum computation with this other guy, Isaac Chuang, who also now happens to be my academic advisor at MIT, separate from Max.

00:05:00 - Anthony Campolo

That's so legit.

00:05:01 - Eric Michaud

That's incredible. Full circle.

00:05:03 - Anthony Campolo

Yeah, I know. I remember when this book came out and I tried to read it. I got through actually a decent chunk of it. I think this is one of the few resources that actually kind of helped me understand what a neural network was, because someone like Andrew Ng, or how is his name said? Ng, I think.

00:05:21 - Eric Michaud

I think Ng, yeah.

00:05:23 - Anthony Campolo

Andrew Ng, yeah. Because he had his Coursera course on machine learning, and that was kind of useful to me also because it helped me understand a little bit more of the math. But neural networks were always the hard, complicated thing.

So how'd you go from doing math and physics to really honing in on neural networks? Was it just like this book really captured your interest?

00:05:46 - Eric Michaud

Yeah. There's a couple things. There was this kind of just sense of curiosity and organic interest in it. I still remember getting these things to work when I was in high school and seeing the network improve over time.

The cool thing about overtraining, and the cool thing about these neural networks, is that, I mean, it's true that some of the math is maybe a little bit complicated, although not more complicated than the math that you would see in most STEM subjects.

00:06:21 - Anthony Campolo

That's still itself fairly complicated for someone who didn't do STEM, which surprisingly many web devs I know are people who either studied computer science or just got into it like I did, through a boot camp or learning it on their own.

I think you'd be surprised how few working developers who code actually got a rigorous, statistical, mathematical background. I think it's less than most people would expect. Or I could be wrong, but at least that's my sample set of the types of developers that I've kind of gotten to know, which may be very self-selected, admittedly.

00:06:55 - Eric Michaud

No, that's fair. But I think the code for these is pretty short.

00:07:01 - Anthony Campolo

Dev's the kind of person I'm talking about. He got a CS degree. Thanks for being here. You're gonna find this a very interesting conversation, I think.

Yeah, Dev studied computer science, and he learned a lot. Actually, for his age and where he's coming in, he actually knows a lot more than I think he realizes. But I don't think he took a lot of classes that were math related. He may disagree, but that's my perception at least, having never asked him about it, just, you know.

00:07:31 - Eric Michaud

Yeah. But I guess the point I was just making here was that, ultimately, the length of these programs that define the basic logic of a neural network and the training, I mean, I think Michael Nielsen's implementation in that textbook, if you remove the comments, is like 70 lines of Python or something.

And so there's this kind of simplicity to the programs themselves and to the basic learning algorithm. But then there's this kind of complexity that is learned when the systems interact with data and when they train. It's just very cool to see that and to see systems learn. So that was kind of what was exciting about it at the beginning.

And then eventually in college I was just thinking about what kinds of problems seem really open. There's this advice that maybe you should seek out fields where things seem messy, and that's where there's opportunity to try to make progress on theories.

00:08:46 - Anthony Campolo

Somewhere there's not a set paradigm yet in the Thomas Kuhn sense.

00:08:51 - Eric Michaud

Yeah. And that's also very difficult, maybe in ways I didn't anticipate.

00:08:58 - Anthony Campolo

To be naturally curious and actually be interested in it, and not just kind of going into the thing where it's like, I just need to get a degree so I can get a job so I can do this so I can do that. It's not as goal oriented. It's more like, I need to immerse myself in this set of topics and subjects to reach the frontier, to then make a meaningful contribution to it.

00:09:19 - Eric Michaud

Yeah, I think probably, I mean, most scientists are pretty immersed in their subject. But there's a kind of order that maybe some subjects have where it's like, yeah, it's pretty clear what a chemistry paper looks like, what it means to have a good result in chemistry. That science is very organized.

And you're not really thinking about reconsidering it.

00:09:47 - Anthony Campolo

This is why AI people love benchmarks and are obsessed with benchmarks. They're trying to attain that type of order through the benchmark. But the benchmark is inherently limited in terms of what they're actually doing, in my opinion.

00:10:01 - Eric Michaud

Yeah. Yeah, there is something there. I mean, even what it means for an AI system to have some capability is, I don't know, maybe just like you define a benchmark for the thing that you care about, you do your best, and if the system does well on it, then it just has that capability in some meaningful sense.

It's a little unclear, though, often if a system does well on some task, how it's doing well, what that means about the internals of the system and the cognition that it's doing. But anyway, I felt like there would be interesting stuff to be figured out about understanding how neural networks work, and so I just wanted to do that.

00:10:45 - Anthony Campolo

Totally. So just to help with the timeline here, what year did you start your actual research in this PhD program?

00:10:53 - Eric Michaud

2021.

00:10:55 - Anthony Campolo

2021? Okay. Yeah. So you've been doing it about four years now. That was a good time to get into it then because it was pre-ChatGPT. GPT itself was around GPT-2 or maybe even GPT-3 by that point, but you were still kind of ahead of that AI curve in the sense that a lot of people probably jumped into these PhD programs in 2023 and 2024, if I had to guess.

00:11:19 - Eric Michaud

Yeah. Although I don't know, it feels like even at the time, like in 2020 when I was applying to these places, it felt late because deep learning had, you know, people had been working on it for eight years or more. It was already very competitive. The AI PhDs were very, very competitive.

00:11:39 - Anthony Campolo

And it punctured that bubble, though, out of academia into the real world where it's like my dad and my wife both use ChatGPT frequently. That's very different from where it was at when there were these theoretical things or kind of toy things or stuff that could do really impressive things like AlphaGo. But that's not something you can — like, the average person interacts with or does anything with or solves any of their problems. It's like beating the best Go player in the world is neat, but what does it mean to me? It's not the same, you know?

00:12:13 - Eric Michaud

Yeah. It was good timing in that way. It's surreal to have the whole world be talking about this subject because, yeah, back in like 2020, there's sort of the early scaling laws work. OpenAI scaled up to GPT-3, and there was a lot of hype within this fairly narrow, weird community around what this means for the future.

And now you look, and every interview at Davos is just about this topic. It's about scaling laws. It's about language models, and it's just weird. And also to have your friends and family treat you as their AI consultant. I think everyone I knew, and all their family, were asking them about DeepSeek a couple months ago and what it means.

00:13:13 - Anthony Campolo

So yeah. Yeah, no, my doctor asked me about DeepSeek. That's super funny.

Even though you started studying it formally and officially, I remember you and I were talking about this back when I was first getting into it, and I was around like 2016, 2017. So you've been aware of neural networks themselves and deep learning.

And that's where I want to start us off in terms of defining terms and things. The first thing you would say is basically you're getting a PhD, you're studying, you're working with neural nets. That was the high-level term you used because that's useful for me as someone who's followed this subject for a long time.

I know what makes a neural net different from a classical kind of algorithm. Where I don't know so much now is whether what's powering things like ChatGPT and language models is still a neural network, and how that interplays with things like attention and the reinforcement learning done to them afterward.

How much is it still relevant to just have a neural network that you know how to do things with when the things you're actually interacting with are going through all these other steps and have these additional kind of add-on parts? Those are some of the general things I'm curious about. So let's just start with how do you define a neural network itself?

00:14:28 - Eric Michaud

Yeah. I guess I would say a neural network is a system which is sort of loosely based off of the structure of the brain, very loosely.

00:14:49 - Anthony Campolo

It has neurons that click on and off, and there needs to be a certain gate reached for it to be on or off, essentially. So it's not a binary thing, but it's kind of a binary thing.

00:15:00 - Eric Michaud

Yeah. I guess you could maybe say most artificial neural networks have artificial neurons. So they consist of a large number of units which individually perform this little computation where they take in some set of inputs and then either sort of fire or not fire. Although in practice with artificial neural networks, they fire or they don't fire or they fire in some continuous way.

But they're systems which consist of a large number of these basic units which take in a bunch of information and then output a piece of information. Then you sort of take a large number of these units and stack them. So the input to the network is some data, and then you have some set of units which individually take in a bunch of information about the input and then perform this little computation and give a little bit of output.

Then you have another set of units whose inputs are no longer just the input to the whole network, but are the output of another set of neurons.

[00:16:14] So when people talk about deep networks, there are these layers of neurons, of units where their input is the output of other neurons. People build up these computations through these layers of individually simple units. And the key is that these things have a lot of flexibility in terms of the kinds of functions that they can implement.

We can also train them, so they're good at learning to approximate different functions. And I'm realizing even now, like, function and approximation maybe are terms that, I don't know, aren't familiar.

00:17:01 - Anthony Campolo

You probably don't know what signals is because you're not super into front-end web development, but it's almost like signals, which I have no idea what he means by that. But he's very deep into signals, so it's kind of like everything can be everything when you start to map different concepts onto each other.

But the training part I think is what helps click a lot together because when you describe the thing, signals is basically Excel spreadsheet cells. So that's interesting because when you talk about the neurons, is that kind of what the parameters are when people talk about a model having 7 billion parameters or 14 billion parameters? Is that a similar thing? Are those related?

00:17:40 - Eric Michaud

Yeah. Generally, they're related, but they're not identical. So one neuron will typically have lots of parameters, but it's maybe the relationship between the two.

00:17:53 - Anthony Campolo

The number of neurons is related to the number of parameters in some sense, even if it's not one-to-one.

00:17:58 - Eric Michaud

Yeah.

00:17:59 - Anthony Campolo

Gotcha.

00:18:00 - Eric Michaud

Now you also mentioned things like attention. There are lots of things that people do in these neural networks that don't just look like the description that I gave of just a large number of units performing this computation independently of each other, passing that on to the next layer. There's other, more complicated operations that people have figured out how to do.

I don't know if it makes sense to explain attention right now. I don't know if it's actually even all that relevant to the discussion we're going to have.

00:18:34 - Eric Michaud

That's not a particularly good definition, I feel, of a network because there are a bunch of things that people include in networks that are not necessarily closely analogous with what's happening in the brain. Although that's kind of a tricky question. But yeah.

00:18:53 - Anthony Campolo

The brain thing is an interesting historical connection in terms of how these things were first created, and they go back to 1945 with McCulloch and... what was that guy's name? Pitts. Yeah. It was an old concept, but I'm more curious about the things we're actually interacting with today.

How similar is that to just a classic network? Is it a classic network with a bunch of other stuff on top, or does it kind of stop making sense to even think of that as kind of the base of what's happening with a large language model?

00:19:28 - Eric Michaud

So the original perceptron work from the 40s, 50s, 60s, this kind of thing, I forget exactly how those networks were set up.

00:19:47 - Anthony Campolo

I think they would create the parameters manually. So I remember Frank Rosenblatt had like dials he would tune, or something. I'm pretty sure I read something about that. He had an actual machine with dials he would turn to create the neural network.

00:20:03 - Eric Michaud

Yeah. I think that at the time people were not training networks with backpropagation and gradient descent, which is what people do now. And I think there were also limits to the types of functions that the early networks could express.

So there was this whole paper from Marvin Minsky criticizing perceptrons. It's called Perceptrons. It's like, you know, the AI winter.

00:20:32 - Anthony Campolo

That's what they say.

00:20:33 - Eric Michaud

And so, yeah, there were certain functions or certain operations, certain things that you might want networks to do which some of these early things couldn't express. But this ended up not ultimately being a limit because there are ways of just making the networks more expressive. And then now we also know how to train them.

One fun thing, by the way, about Frank Rosenblatt is that he eventually got into looking for aliens. So he was doing SETI stuff. And yeah, this is a crazy connection. I had no idea about this because we just think about him as doing neuroscience, AI stuff.

But there was this SETI conference in the early 70s. It was organized by Carl Sagan with all these scientists, even like Marvin Minsky, and biologists were there. It was a really interesting conference.

[00:21:38] And Carl mentioned this paper from Rosenblatt. I was like, oh, that's interesting, because he said, you know, my colleague at Cornell Rosenblatt, whatever. And it was on basically a way of identifying when planets transit in front of stars. So it's like a way of trying to find exoplanets.

00:21:59 - Anthony Campolo

Didn't you actually work at SETI, right? [00:22:02] - Eric Michaud I did, yeah.

00:22:03 - Anthony Campolo

That's super cool.

00:22:04 - Eric Michaud

Anyway, there's some story about how he died young. There's some story about when he was up for tenure, someone advised him to put off the SETI stuff for a little bit, and he was just like, yeah, whatever. I'm gonna keep doing it. Just kind of disregard for his career, I guess. But yeah.

00:22:39 - Anthony Campolo

Okay, so the way I wanted to go from here, unless you had something else you were going to explain, is that kind of: first we define neural nets, and then I want to talk about the training, like how are they trained? What are they trained on?

This is something that I think is really important, especially now with how large language models are created, as it's sort of glossed over by a lot of people. And it's also very hard to even get a sense of, because many of these models are trained on data that we don't know about or that we're not given access to, and there's no real disclosure around the data that's being used to train these things for lots of reasons, legal and non-legal, and shady and non-shady.

00:23:20 - Eric Michaud

Yeah. Right. Well, I guess I don't know what the frontier labs train on.

00:23:28 - Anthony Campolo

Presumably, that's what I'm saying.

00:23:30 - Eric Michaud

Yeah.

00:23:31 - Anthony Campolo

Only the people who work there know.

00:23:33 - Eric Michaud

Yeah, but for systems like language models, the core thing that these systems are trained to do is to predict the next word in a bunch of text documents. So you scrape the internet, you find as many documents of sort of human-written texts as you can, and then you train these systems to predict what comes next, given what came before it in the document.

And it turns out that if you do this at a large enough scale, this training objective incentivizes the network to learn a very large number of things about the world. Because in order to predict the next word, well, you need to know lots of things. So if I say the company Anthropic is located blank, you might say in San Francisco or something. And so in order to predict the next word there about something factual, you need to know the fact that relates to it. But it's not just facts.

[00:25:02] It's also things like different skills. So, for instance, on the internet there are a large number of lists where some text is organized in these bullet point lists or there's something that's incrementing or increasing.

00:25:20 - Anthony Campolo

The 100 sweetest guitar solos.

00:25:23 - Eric Michaud

Yeah. Count one, two, three. And if you're going to be able to predict, in certain places in that document, what word comes next, you need to know what comes next.

00:25:34 - Anthony Campolo

How would it know that Stairway to Heaven is supposed to be number one?

00:25:39 - Eric Michaud

Yeah. Well, just in terms of predicting the numbers, the system needs to be able to count. So basically, implicit in this objective of predicting the next word are a whole bunch of other skills that you need to learn and that networks do learn.

And in learning how to predict, well, they can also then generate new things.

00:26:11 - Anthony Campolo

Okay.

00:26:14 - Eric Michaud

That doesn't really tell you very much about how the networks are doing that internally. And also we haven't really gotten into the math of how the training works. I mean, maybe just a couple of things we could say there.

00:26:30 - Anthony Campolo

I would say in general, try and explain the things you think will need to be understood for your research in particular to make sense, because that's where I want to get us to.

00:26:38 - Eric Michaud

Okay. Maybe we don't need to say that much other than that the things that are updated in the network during training are just a large number of these parameters, which govern the strength of the connections between neurons, and everything in the network, when you just look at it, is just a bunch of numbers.

So the actual input into the network when you show it a bunch of text is basically these lists of numbers, which each represent a word. And then these lists of numbers become transformed in the internals of the network into other lists of numbers that are not necessarily scrutable. They're just a list of numbers. And then eventually, when the network makes a prediction, that's also just an output, which is a list of numbers, but in particular, it's a list of numbers which are probabilities over what the next word is.

00:27:54 - Anthony Campolo

Okay. This is great. I just want to hook this into a couple terms that I know people who actually work with this stuff know: tokens and embeddings. What do those terms have to do with what you just said?

00:28:04 - Eric Michaud

Okay, so when I talk about words, I mean tokens. And the next token and the embeddings are related. Within a transformer, there are learned embeddings. These are like the first layer, basically, of the network. It learns a list of numbers for every token. These are then input into the network, and then this gets transformed into a different set of vectors.

00:28:34 - Anthony Campolo

That's what I thought this stuff was mapping to. That means I understand what you were saying and actually know what the hell was going on, which is great because I'm like, I think he's talking about tokens and embeddings right now.

00:28:44 - Eric Michaud

Yeah, exactly. Although when people have an embedding model, I haven't really worked with that, so I can't say much.

00:28:56 - Anthony Campolo

When you create embeddings, it's not like a model. Basically, you use one of the embeddings endpoints for OpenAI and you feed it some text or something. And then it gives you back a JSON file with just an array full of numbers. Yeah, exactly. It's a list of numbers, literally.

00:29:14 - Eric Michaud

I think probably what's going on under the hood there is that they run that text that you gave it through a network, or at least through several layers of a network. And so that embedding that you get back is like the firing patterns of the neurons at a particular layer of the network.

00:29:33 - Anthony Campolo

Fascinating. Yeah. Wow. You just blew my mind. Because I learned a lot of the stuff you're talking about. I learned at a conceptual level a long time ago, and it was like, okay, I get what people mean when they use these terms, like the overall goal and what's happening, what the network is and how it's being trained and how it's being transformed and what the whole point is.

But as ChatGPT started to come out, I was working with that and I'm like, I have no idea how this connects to all that other stuff I learned back in the day. And then you just kind of learn the terms it uses to mess with it, because you're a developer and you just start using the API and you're like, okay, so it costs me a certain amount of money for tokens. So what's a token? You find out what a token is and then you understand that it inputs a certain amount of tokens, outputs an amount of tokens.

[00:30:22] And then there's this thing called embeddings. But there was still not a clear connection for me to what is actually going on under the hood, because they don't tell you a lot about that. And it seems very disconnected from all this other academic research that's happening.

00:30:38 - Eric Michaud

Yeah. Right. Well, I don't know how accurate all of the marketing terms are, but yeah, the tokens are what are fed into the model, and the embeddings are like internal activations in some network.

Slightly confusing because we also talk about the embeddings as being just this thing that is learned in the first layer of a transformer, just sort of like the raw token embeddings. So there's sort of a difference between the token embeddings, maybe, of a particular language model and the embeddings that you get from the OpenAI API. But.

00:31:25 - Anthony Campolo

Right. Yeah. Interesting. There's one other thing I wanted to ask, based on what you've been saying before, and then we can kind of start going through your slides and stuff. When you're talking about how you train on different things, we can gain world knowledge and stuff. There's something that I've heard. I've never really known if it's true or not.

People say that the models got a lot smarter once they started training them on lots and lots of code. Because code is highly structured, complex, and works in a very specific way. Like, you have a long program, and if you change one thing in that program, the entire thing can break in very weird, unexpected ways. And so I heard that when you trained them on a crapload of open source code, they would then get better at other things, like just answering, just being smarter, becoming a smarter model in general.

[00:32:14] Is that actually true? Have you heard that? Can you verify that?

00:32:17 - Eric Michaud

I've heard that. I don't know quantitatively how strong that effect is. It seems like I would expect there to be some effect where training on certain types of things helps with seemingly unrelated things, but there's going to be a limit to that.

So training on a bunch of code doesn't necessarily help you answer factual questions about history or something. But probably there are some benefits where certain very generally applicable skills are learned from just a very broad distribution of things. And maybe in order to learn certain skills, you need to train on code. Maybe you also need to train on other things.

There's this kind of tricky connection between, like, it's a little unclear in general what that objective of training to predict the next token actually incentivizes the model to learn.

00:33:30 - Anthony Campolo

And stuff like that.

00:33:31 - Eric Michaud

Yeah. But even more basically, you could imagine you train on a bunch of code and it's like, well, what has it learned? Like maybe it's learned a bunch of lower-level knowledge about the names of different libraries and modules within different programs.

00:33:50 - Anthony Campolo

Like logic and if-then-else stuff like that.

00:33:56 - Eric Michaud

Yeah. Right. A strong enough model at predicting code would basically be able to sort of simulate how that code is executed and things like that. But really understanding it and being able to think logically about it probably would be incentivized eventually.

00:34:24 - Anthony Campolo

Yeah. I think there's an interesting thing where not just in the training, but when you're actually using them, something that you used to have to do is less of a problem now.

The first kind of iterations of things like ChatGPT couldn't do math. It just sucked at math for some reason. But if you told it to write a program to do the math, it would write code, and then the math would be correct because it would know how to write the code to make sure the correct answer was outputted. But if it just answered in its text, it would mess up math. It doesn't really do that anymore, as far as I can tell, but it did that a lot in the beginning. It just couldn't do math.

00:34:57 - Eric Michaud

Yeah. Well, it's kind of a hard thing to do, just because of how the networks are set up. But if you're predicting, if you just write out a very long multiplication problem, like ten-digit multiplication or something, and then you ask the model to give its answer straight away, then within what's called a single forward pass, it has to basically compute the full multiplication.

Because the first digit of the answer is kind of the hardest one. Like, if you were going to write out that multiplication, you would actually get the leftmost digit at the end. And then, yeah, so there's even weird ordering stuff like that. This is also partially why chain of thought is so useful, because it means that for problems like this, the model doesn't have to just come up with the answer in the limited amount of steps that it has internally during a single forward pass. It can spread that computation out across lots of forward passes.

00:36:06 - Anthony Campolo

Totally. And I thought for people who don't know, a specific thing is this: when you use a model like R1 or, to a certain extent, O1, there's a controversy here that I want to get into also.

It will first kind of think out its answer before it actually gives you the answer. And you can kind of click it and open it to see it or not. But I've also heard that ChatGPT doesn't give you the reasoning tokens. So I think what people mean by that is that the actual reasoning tokens are being summarized before they're being given to you, whereas R1 gives you the original ones unsummarized. I don't know if that's actually correct or not, but that's what I've been trying to understand about this term and topic and what's going on.

00:36:48 - Eric Michaud

Yeah, I think that's right. OpenAI does not reveal the exact output of the model during the chain of thought.

00:36:59 - Anthony Campolo

So what is it showing you then, and how does it get that? Is it just summarizing it, or is it rewriting it to make you think it's thinking a thing that is not what it is? What are you being shown, the original chain of thought? You do see something, yeah, when you use O1.

00:37:13 - Eric Michaud

Yeah. I think that they just have a smaller language model that is summarizing the chain of thought.

00:37:23 - Anthony Campolo

Okay. So I pretty much got it right then. My current perception of what is happening is correct. Okay. Yeah. That's a really, really strong reason to use R1 then, honestly, because it really helps.

And I felt that when I used R1 the couple times I've used it, I'm like, this is a really long, detailed chain of thought. And I think it might be the same for Grok. I once did a Grok thing. The chain of thought was like ten times longer than the actual answer I gave. I was like, what the heck? This is a ton of thinking it's doing.

00:37:54 - Eric Michaud

Yeah, it's cool to actually see that. I mean, I think if you give O1 a really hard question and it thinks for a while, it is outputting thousands of tokens or something. It's just you can't see it.

00:38:07 - Anthony Campolo

And you're paying for it, aren't you? If you're using the API, you pay by the token.

00:38:14 - Eric Michaud

Yeah, you're paying for it.

00:38:21 - Anthony Campolo

I complain about that.

00:38:23 - Eric Michaud

API pricing works that way, but yeah. I think the reason they did that is because, I mean, as we sort of see, for instance, with DeepSeek, you can distill much smaller models given a bunch of examples of that reasoning trace and basically get the smaller models to be pretty good reasoners.

So OpenAI does all of this very expensive, although not super expensive yet, reinforcement learning to get the models to figure out how to reason, and then it's relatively easy to potentially create other models from that without doing a similarly expensive step. So it makes sense why they would do it.

00:39:11 - Anthony Campolo

O3 mini and O3 mini high, and then eventually to have actual O3. Is that kind of why we have all these weird different model names, because they're distilling them into smaller versions?

00:39:24 - Eric Michaud

Presumably. I don't know how O3 was trained, but yes, I think the speculation is that. Well, it's interesting. So it could be a distillation of the full O3, or they could have just done reinforcement learning directly on a smaller model.

00:39:39 - Anthony Campolo

Interesting.

00:39:40 - Eric Michaud

Yeah.

00:39:41 - Anthony Campolo

Don't forget O2, the only option. So the two people in the chat right now are two of my good web dev friends who are doing plenty of stuff with AI but don't have all this academic background, as far as I know.

But we should get into your website. You have an interactive website, which I think will definitely help a lot. So you want to share your screen and kind of mess around with that and try and understand what your thing is.

How did you build the website, by the way? Did you code it yourself? Did you have an LLM help you with it, or is it fairly easy to just embed this kind of stuff into your website?

00:40:17 - Eric Michaud

This is the interactive thing where you can interact with it, so I'll share my screen. Okay. So yeah, I made this website to illustrate some facts about how the networks learn and how they scale. But it also, I think, is just a nice way of thinking about, or just directly looking at, the task that the models are trained to do.

So what I did was I took a small model. It was a 400-million-parameter language model.

00:41:07 - Anthony Campolo

And where did that model come from? Does it have a name, or is it something else?

00:41:12 - Eric Michaud

Yeah. It's called Pythia 410 million. And this is part of a sort of family of models called the Pythia models, which were released by EleutherAI.

00:41:24 - Anthony Campolo

Okay.

00:41:24 - Eric Michaud

But the nice thing that Eleuther did was they released a large number of intermediate models. So they released the final model after training, but also a bunch of other models. So you can see how these language models change during their training process.

00:41:48 - Anthony Campolo

Yeah. This is cool. I don't think I knew about this company, but I like what they're doing because it's empowering open source artificial intelligence research. I think that was really great. I think that was sorely needed for this stuff.

00:42:00 - Eric Michaud

Yeah, they're really cool. There are some really great people out of Eleuther. But what we can do is, because we have all these models, they're called checkpoints, so stages of the model at all these different points in training, we can see how the model has changed throughout training.

This is not going to look necessarily at the internals of the model, but we are going to see how the behavior of the model changes.

00:42:24 - Anthony Campolo

We're looking at the outputs of the model, basically, not the internals.

00:42:29 - Eric Michaud

Yeah. In particular, we're going to look at the error of the model. So what I show here is that each of these gray points represents a particular token in the corpus.

00:42:47 - Anthony Campolo

So your font, two times, that's probably good. Yeah.

00:42:54 - Eric Michaud

Okay. So we can basically click on each of these points. And what I show when I click on each one of them is a token, and it's highlighted in red down here. Might be kind of small to see. We see that token in its context.

And so the thing that the language models are trained on, right, is just a whole bunch of text on the internet. Probably there are some things in here which are going to be fairly disturbing if we searched around for long enough.

00:43:24 - Anthony Campolo

So context not loaded. Is that because it's moderated, or why does that say not loaded for security reasons?

00:43:29 - Eric Michaud

Yeah.

00:43:31 - Anthony Campolo

Basically because there's code in it.

00:43:32 - Eric Michaud

Yeah. I chose not to display any code because I was worried that.

00:43:41 - Anthony Campolo

Like JavaScript, SQL injection type stuff happening.

00:43:44 - Eric Michaud

The code would be run.

00:43:45 - Anthony Campolo

Sure. Yeah. [00:43:46] - Eric Michaud I don't know about that.

00:43:46 - Anthony Campolo

It should sandbox anything too crazy, but I guess that makes sense if you could grab your cookies or something like that, I guess.

00:43:56 - Eric Michaud

Yeah, I didn't think too much about it. Maybe it was unnecessary.

00:44:01 - Anthony Campolo

Yeah, it's better safe than sorry. Probably.

00:44:03 - Eric Michaud

Yeah.

00:44:04 - Anthony Campolo

Especially for what you're trying to show here, you know?

00:44:07 - Eric Michaud

Yeah. Because there's still other stuff we can look at in here. Basically, what we show is the model's error in predicting each of these tokens. This is called the model's cross-entropy loss on the y-axis of this right plot. This is just measuring how correct and confident the model is in predicting this token from its context.

So we can see, for instance, there's this little cluster of points out here. These actually correspond to predicting an S after an apostrophe token, so this is a pretty easy thing to predict because S's come after apostrophes in these examples, at least. We see that the model, pretty early in training, actually gets basically zero loss or error in predicting this token.

And by the way, this curve shows the model's loss over time in training. We see that fairly early in training, this scale is on a log scale.

[00:45:32] So it's going from 1 to 10 steps to 100 to 1,000, 10,000, and 100,000 steps of training. This is the amount of time that the model has been training. We see that pretty early on the loss drops and then it stays very low. So this knowledge that S's come after apostrophes is learned very early in the model.

00:45:53 - Anthony Campolo

And this is super important. Tell me if I'm wrong about this, but this research is super important because a huge amount of money is being spent training these massive models. If you can figure out exactly when you need to stop training it for it to do what it needs to do, you'll save money, essentially.

00:46:14 - Eric Michaud

Yeah. That's not, like, I don't know if we're there yet in terms of that.

00:46:24 - Anthony Campolo

The point, though, is to understand it well enough to be able to do something like that. Yeah, maybe not perfectly, but better than just guesswork and just running it and then kind of checking it every now and then, like, is it doing the thing yet? I don't know, keep running it, you know.

00:46:37 - Eric Michaud

Yeah. The overall motivation of this kind of basic exploratory stuff is to try to build up the general theory of how training networks on more data, and training larger networks, changes them.

00:46:55 - Anthony Campolo

It affects the scaling laws to a certain extent, maybe. Not only that, but that's a big part of it, I would imagine.

00:47:00 - Eric Michaud

Oh, yeah.

00:47:04 - Anthony Campolo

It's tight. That's tight.

00:47:05 - Eric Michaud

Yeah.

00:47:07 - Anthony Campolo

And this is really important. There's so much money being put into training these models, man. I'm telling you, it's the amount of capital we've never even seen before.

00:47:17 - Eric Michaud

Yeah. So it's actually quite striking that, given the economic importance and the geopolitical importance of what is going to happen when people continue to train larger systems, it's crazy.

00:47:31 - Anthony Campolo

Surprisingly few people are studying it, I would guess.

00:47:35 - Eric Michaud

Yeah. Now, maybe internally at these labs, that's happening.

00:47:39 - Anthony Campolo

Part of the problem, though, is all this weird hidden internal knowledge at these companies. I understand it. I get why companies have secrets, but at the same time, it puts their rhetoric about what their actual purpose is to the test.

00:47:56 - Eric Michaud

Yeah. So there's not as much very ambitious open work trying to come up with these unifying theories. There are various models of neural scaling. Our work is one of them.

00:48:16 - Anthony Campolo

And I have to imagine that's the stuff that wouldn't be being done internally at these companies, because they wouldn't really prioritize it enough. They want to learn how to actually do this practically in a way that makes sense for their systems, in the cheapest, fastest way possible. That doesn't necessarily involve coming up with a unifying scientific theory.

00:48:35 - Eric Michaud

You'd think, though, that it would have applications if you did, in fact, understand what is going on in these systems really well. You'd think that it would have applications for making them more efficient.

It could have implications for what to do differently, maybe ways of changing the objective that the networks are trained on or something. I don't know.

00:49:05 - Anthony Campolo

I guess you know how much communication there is between people actually at these companies and people in academia, like where you are.

00:49:14 - Eric Michaud

There's very little on the kind of topic related to scaling laws and that sort of thing. I mean, a few years ago, there was a paper from some folks at Google and also, at the time, OpenAI. Jared Kaplan was a professor at Johns Hopkins and then he was at OpenAI. They actually released their original paper on scaling laws, documenting scaling laws for large language models.

That kind of thing maybe would not happen now. It's called Scaling Laws for Neural Language Models. It's a good title.

00:50:04 - Anthony Campolo

Good name for a paper.

00:50:06 - Eric Michaud

Scaling laws for neural language models.

00:50:09 - Anthony Campolo

Gotcha. Yeah. I've said in the chat, Manhattan Project v2. Is that what needs to be done to move the needle?

00:50:19 - Eric Michaud

Progress is going pretty quickly right now, just with a bunch of separate private companies. I'm not sure that things would happen faster if there was some messy reorganization and everyone was meant to work together.

I mean, keep in mind, if people don't necessarily like each other that much, all the founders of Anthropic left OpenAI. Now a whole bunch of people at this new Thinking Machines company have also just left OpenAI. So I don't know, maybe that's true.

00:50:58 - Anthony Campolo

That's not a lot of people. I feel like people who aren't super into the minutiae of these companies don't realize that Anthropic is an offshoot of OpenAI, right? Like, how many people do you think were actually working there that ended up hopping ship and then joining Anthropic or creating Anthropic?

00:51:15 - Eric Michaud

I mean, there was like an original group of seven founders.

00:51:20 - Anthony Campolo

And were they also from Google?

00:51:23 - Eric Michaud

They were mostly from OpenAI, if not entirely from OpenAI.

00:51:28 - Anthony Campolo

Okay. Now, Google, before they had worked there, I don't know.

00:51:34 - Eric Michaud

Some of them, yeah. So, for instance, Chris Olah, who's one of the main people doing interpretability of neural networks. He was at Google Brain for a while before OpenAI and then now Anthropic. I think he's at Anthropic.

00:51:56 - Anthony Campolo

Okay. Weird tangent from what you're talking about in terms of your project, so continue on that.

00:52:02 - Eric Michaud

Oh yeah. Okay. So, basically the high level point here is that I want to try to directly visualize a large number of different skills that the model has learned that, it turns out, are incentivized by this prediction, this sort of next token prediction problem.

We see that the model has to predict a bunch of tokens. Some tokens are predicting an S after an apostrophe, so it learns that this is what it should do when it sees an apostrophe. There's another set of tokens which are actually down here, in this sort of group on the bottom right. These are actually all involved in predicting a new line after another new line. So this is also a very common pattern in text, to predict a double new line in order to create a new paragraph. This is also learned extremely early in training, and the model is able to be very confident and correct in its prediction. Now there's a whole bunch of examples like this which are super interesting.

[00:53:14] So up along the top part of this group of points are the ones I just showed.

00:53:23 - Anthony Campolo

You just showed one of those when you clicked.

00:53:26 - Eric Michaud

Oh yeah.

00:53:27 - Anthony Campolo

It had some CSS in it, I think is what it was.

00:53:30 - Eric Michaud

Yeah. I'm trying to find a good example of what I'm referring to here. If I go more, you'll see it.

00:53:39 - Anthony Campolo

Open source license. This is fascinating. All this stuff, because by seeing the chunk, a lot of times you can tell what it is. You can see it's part of a legal document, or there's some code right there as part of a license.

00:53:51 - Eric Michaud

Oh yeah.

00:53:55 - Anthony Campolo

Sources, like the end of a paper, or stuff like that.

00:53:59 - Eric Michaud

Yeah. There are a lot of code licenses on the internet because so many documents contain these licenses at the top of them.

And so models learn them early.

00:54:10 - Anthony Campolo

Almost every GitHub repo, or most of them, has a license, you know.

00:54:16 - Eric Michaud

Yeah. So it's learned very early, and it's learned by models that are relatively small. They'll know the licenses. But if you look at a token like this, what's going on?

So here it's predicting when, and we see from this loss curve that the model pretty quickly gets very good at predicting this token because its error is super low, eventually.

00:54:43 - Anthony Campolo

Yeah. This is a clear pattern. It has to look at the last couple and see that they all ended with when.

00:54:49 - Eric Michaud

Yeah. So there's this pattern of, like, "remainder when," and this is seen throughout the context. I think the idea is that within the context the model is making this inference. It's like doing induction. It's like, "when" has occurred here after "remainder" a lot in this context.

So now remainder is the previous token. We're going to predict when. This is interesting because it's different, I think, from the model learning that this is a common pattern in training. Some of the other examples where it's like maybe predicting a new line after a new line, that might just be something which it's learned from training. But here it's almost like within the forward pass, when it's predicting, it is doing a kind of learning, and it's looking at the text that it's seeing right now and making an inference about how it will continue based on the statistics of the text as they occur in its context.

00:55:53 - Anthony Campolo

Right. Yeah, that's what I was going to say. So it's doing something with the actual context itself that is relevant to in-context learning stuff.

00:56:02 - Eric Michaud

Right. It's like this is not necessarily encoded directly as this crystallized knowledge in the model, like when always follows remainder. But it's like a kind of learning, maybe, that is happening on the fly when we're evaluating the model.

00:56:23 - Anthony Campolo

Yeah. This is something that is actually super relevant to me and how I work with these things. It's something I've thought about and worked with a lot.

When you're trying to get it to understand your codebase, how many files do you need to give it to answer the question you have, or build the feature you want to build? Usually what I'll do, by structuring my codebase itself in a way that I can do this, if I have a feature I want to build, it's usually only that I have to touch a couple code files. So I'll grab all those and then like one or two that give it context, like the whole project, and I'll just dump it into the chat and then tell it what I want the feature to do.

I have some presets also in terms of how I want to write the code and how I want it to respond. Then it will give me an answer and then I'll kind of try and implement that and see if it broke.

[00:57:09] And sometimes I'll realize, oh, there's this other file here I actually should have given it. And sometimes it just works right away because I got pretty good at it. So all of that is just based on it being able to use the context of the very first input it gets to answer something that it wouldn't have been able to do if it didn't have that context. [00:57:28] - Eric Michaud Yeah. There's even a name for this kind of learning that happens in context called in-context learning.

00:57:38 - Anthony Campolo

Yeah. I've been using that term and telling people that's what it's called, so I'm glad to get your verification. I'm using that term in a strict, correct sense. That's what I've been saying to people, like you have to figure out in-context. It's a super important part of this.

00:57:52 - Eric Michaud

Yeah. You can be more or less precise about how specifically you use that term. Some people have worked out that the kind of learning that happens in in-context learning is almost like the training of a model within the computation of the network, or something.

00:58:14 - Anthony Campolo

It can take in a ton of text. Originally these models couldn't take in a lot, like when I was using ChatGPT 3.5. It could take very little, like 2,000. Now you can get tens of thousands of tokens, a ton.

00:58:30 - Eric Michaud

Yeah, it's probably like, I don't understand the results well enough to be able to really talk about them, but this crazy stuff that people have worked out about how these models, these neural networks, in the process of the computation they do internally, all these neurons firing, can effectively almost spin up and then train a smaller model within them.

And so there's these papers on how in-context learning is approximating something like gradient descent, which is crazy. I don't know exactly how it works in general, and in real language models. For something as simple as this, where it's basically just copying from the context, we kind of do understand how it does that.

There's this whole other paper by Anthropic called In-Context Learning and Induction Heads, and also this other one building off of it called A Mathematical Model of Transformer Circuits.

[00:59:45] Which sort of describes how this happens. It's interesting.

00:59:58 - Anthony Campolo

I'm putting them in the description for YouTube.

01:00:03 - Eric Michaud

Yeah, it was super cool, especially In-Context Learning and Induction Heads. We can observe this behavior: the model is good at copying, and we can see that it seems to be learned early in training.

People then go into the model and try to understand what mechanisms, what particular neurons implement this behavior, or implement the algorithm that allows the network to copy text that it's seen earlier and then paste it into the next answer, what the next token will be.

We might hope, and this is what motivates a lot of my work, that we could eventually identify a huge number of different mechanisms in the model that each implement some particular behavior or encode some particular knowledge, and then try to decompose this whole complicated network into much simpler parts we could understand.

01:01:13 - Anthony Campolo

Right. Because right now, when you're just clicking random places, it's showing you different texts. Like I was saying, there's some that's like legal documents, some that are licenses, some that are like creative writing. Those are all fairly different things we want these models to do when you're using it. If you're trying to analyze court documents, that's very different from if you want to write CSS. That's very different from if you want to write a poem.

01:01:37 - Eric Michaud

Yeah, exactly. This also relates to questions about what intelligence is, because we tend to think about intelligence as just this general problem-solving ability, but it seems like the things that the models are incentivized to do, like language models, are to learn a very, very large number of crystallized skills and knowledge.

It's interesting because maybe we could hope to enumerate what these pieces of knowledge are, what these skills are, and decompose the network in that way.

01:02:23 - Anthony Campolo

Yeah. I think the difference between intelligence and knowledge is super important here. When we think about intelligence, especially like IQ, some of the tests they give are things where it's like, you know, the nine boxes and you have to figure out the nine boxes based on this pattern that goes throughout all of them. That makes sense, but I think of that as raw computation.

That's not the same as knowing all the different species of animal or something. There's a bunch of facts and then how that stuff all fits together. That's what makes the model so interesting, because humans have a certain intelligence, but there's also a limitation on how much stuff we can shove into a single person's brain. A model doesn't have that limitation. That's why you could ask it a hundred questions.

This is what Tyler Cowen talked about. This one I thought was really interesting.

[01:03:11] Someone was saying, what would happen if you pick Tyler Cowen, the economist, versus an LLM, just asking economic questions? And he was saying he would be able to do well for a while. When you ask him questions about his discipline and about his specific studies in that discipline, he'll be able to get better answers as you ask more and more questions.

Eventually you'll just ask him things that he doesn't know because he can't know everything, but the LLM will know. So he was saying it's not a question of who's smarter, but how many questions are you asking. Because at a certain point you're going to reach the limits of his knowledge, and the LLM goes far beyond that.

01:03:47 - Eric Michaud

Yeah. It is astonishing the depth of the knowledge that is crammed into these systems, especially given that they're not huge. I mean, they're big, but there's an immense amount.

01:04:00 - Anthony Campolo

They're too big to fit on my computer. I can't fit the largest Llama models on my machine, so they're pretty big. I would say that's pretty big.

01:04:06 - Eric Michaud

So we have like 70 billion or 400 billion parameters, or something like that. 600 billion for DeepSeek. But the total amount of information stored on the internet is much higher.

01:04:22 - Anthony Campolo

But much of it's redundant, though. So much of it is redundant or just meaningless. There's people talking about their cats and stuff.

01:04:30 - Eric Michaud

Yeah, but anyway, I guess if we compare this to the number of connections in a human brain, I think the parameter counts of the networks we have today are not necessarily bigger than the number of connections or synapses in the human brain.

So it seems like these models are much more knowledgeable than any human, but they're also smaller, maybe relative to the amount of information that can be contained in, or at least the number of connections in, a human brain. So they might be more efficient than humans.

01:05:10 - Anthony Campolo

And that probably has big implications for the whole Kurzweil scaling law, because what he was doing when he was looking at how to get to the singularity, he was looking at things like the number of neurons and how long it would take to compute the same amount as a human brain, and then beyond that.

So if we actually need less than that, because I wouldn't expect it to be more, probably because they're going to be harder to train. They're not going to have the ability to reach out into the real world. They would need to have more connections than humans, but you're actually saying it's the opposite.

01:05:44 - Eric Michaud

Yeah. At least as far as I can remember in terms of the statistics here, I want to say it's like 10 trillion or 100 trillion connections or something in the human brain.

01:05:53 - Eric Michaud

We can also ask how much compute it would take to simulate a human brain throughout the course of its life. This is also something people do when they try to estimate how long it'll be until we have systems that are at human level.

01:06:11 - Anthony Campolo

They want to upload their brains, the noosphere. That's what they care about, so that's very relevant to them.

01:06:19 - Eric Michaud

Yeah. Although people also use arguments like this just to try to estimate when we'll achieve AGI.

01:06:30 - Anthony Campolo

Which makes no sense. I would just pick random numbers and be like, once we hit this number, we'll have AGI. But why?

01:06:38 - Eric Michaud

Yeah. Or even, like, there are these projections. I saw some images. Maybe it wasn't even actually from Anthropic, but I saw an image with Anthropic branding. It was like, you know, whatever, we have roughly human-level coders within a year, and then in two years, three years, we have systems that are coming up with novel scientific insights and accelerating science massively.

It's interesting because on one level maybe it is kind of intuitive that there's this progression where once we have systems that are at parity at a bunch of tasks, then if we continue to scale them, we'll reach this point where they'll have novel ideas. But there's not a guarantee that that's going to happen. It's possible that there's just a different shape to the skill sets of these systems and of humans.

[01:07:33] And it's not guaranteed that we're just going to get certain capabilities magically at a certain scale.

01:07:39 - Anthony Campolo

I have a pretty contrarian take on this, which is that I think we hit AGI like a year ago, and that right now everyone's moving the goalposts and picking new definitions for it, because by almost any definition we've hit it.

Like the novel scientific idea stuff, there's been papers. I saw at least one, maybe it's spurious, but they had a model come up with a bunch of new ideas to research and then put them up against people, and then had scientists judge between the two. They said the models were better in terms of coming up with research ideas.

So in one sense, we are already there for a lot of the things that we're saying we need to be able to do to consider it AGI. To me it's just general. It can do math, it can do science, it can write, it can create lyrics. It can obviously do lots of things. It's clearly a general intelligence to me by just the definition of the word.

[01:08:35] So I don't know. I think more people are going to start to adopt this stance in the next couple of years, where right now I can only think of one other person who says this.

01:08:47 - Eric Michaud

Oh, that we've already hit AGI. Yeah.

01:08:51 - Anthony Campolo

I mean, then it's harder to get the money. You need the money. If AGI is three months away, then you need the money. And you gotta give them the money because they're about to get the AGI.

01:09:00 - Eric Michaud

Yeah, well, yeah, this has been a thing. I'm sympathetic with your point. Systems can do a lot. There's even this slightly joking tweet, like there's a Roon tweet that's like GPT-3 was AGI or something like five years ago.

01:09:14 - Anthony Campolo

Exactly. Yeah. So he's tapping into what I'm saying.

01:09:19 - Eric Michaud

But I guess one difference is it feels like the systems that we get now have this huge amount of crystallized knowledge and skills from seeing an immense amount of different things during training. When they're actually deployed, they can call upon these skills.

But the degree to which they learn things on the fly, when you do inference with them, is still somewhat limited. I mean, as we talked about, there's this kind of in-context learning that happens, but it seems like there's a difference in terms of the things that a system is able to learn on the fly, like a language model today versus a human child.

A human child is theoretically capable of going out and learning anything within a single lifetime, a single run, a single trajectory in the AI terminology. But maybe we're going to get more of this with the long-context stuff and the reasoning stuff that's happening now.

[01:10:17] There's maybe early signs of this with O3 passing the ARC-AGI benchmark. But anyway.

01:10:26 - Anthony Campolo

Yeah, that'll be interesting. Is there more on this page below this that you want to talk about?

01:10:34 - Eric Michaud

Either way, there's just some fun, kind of scientific questions about how much variation there is in these training curves between models of different sizes. So, for example, here's one.

01:10:51 - Anthony Campolo

That one's Android code.

01:10:56 - Eric Michaud

Sure. Trying to find a good one. I mean, to me, it looks actually like most of these models, most of these curves, are pretty similar. So here I'm showing, you know, these three lines.

01:11:12 - Anthony Campolo

React code, import React from React. Yes.

01:11:18 - Anthony Campolo

That might be my code. Who knows?

01:11:22 - Eric Michaud

Yeah. I mean, seriously, the models have seen your code. The new ones, at least.

01:11:27 - Anthony Campolo

Yeah. No, all the LLMs know who I am. I've made sure to ask them. Give me glowing recommendations.

01:11:35 - Eric Michaud

Yeah, well, this is super interesting. Our theory of scaling that we wrote up in this paper is called the Quantization Model of Neural Scaling, from a couple of years ago.

Our basic theory is that this problem of predicting the next word, next token, across all texts decomposes into all these different things you need to learn. As we've been talking about, you need to learn lots of knowledge, lots of different skills to predict different tokens.

And our conjecture is that there's a kind of regularity, a kind of order and nature to how frequently these different things are seen by a model. Some skills and some knowledge are very common, and other skills and knowledge are very esoteric and aren't needed by a model very frequently, or by anyone very frequently, in order to understand text.

[01:12:42] Basically, we just sort of conjecture that there's a particular form that these frequencies take. It's called a power law, where they drop off as roughly like one over n. This is related to Zipf's law in language where if you look at how frequently different words are used, there's this kind of 80/20 thing where most of the words are rare.

01:13:16 - Eric Michaud

Yeah.

01:13:17 - Anthony Campolo

Yeah. Continue. Sorry.

01:13:19 - Eric Michaud

Oh, no. Sorry. Most of the words, the most common words, are very common. And then there's this long tail of less common words.

01:13:27 - Anthony Campolo

Because when they first started analyzing texts with computers and algorithms and stuff, the main thing they were doing at first was counting words. Like, how many times does this word show up in this document? And then what does that mean?

And then a lot of amazing things come out of that as you track different writers by the way they use certain words and stuff. You could analyze these word frequency patterns and get all sorts of insights about texts, a surprising amount.

01:13:57 - Eric Michaud

Yeah, probably the best general-audience video or explanation of this I've seen. There's actually an old Vsauce video about Zipf's law and power laws, which is very good.

But it seems like there's this order to nature where the frequencies with which different words are used fall off in this particular way. So there's this kind of order there.

Basically we're just conjecturing that there's this huge, similar kind of power law governing, not just how often different words or tokens are used, but how often the algorithms and information that are needed to predict those words, or to understand text, occur in text.

The really crazy way that this translates into some theory of intelligence and this kind of thing is that maybe these things that the model must learn to predict text well are similar to the things that are in human minds that produce the text.

[01:15:23] So if we say that a language model decomposes into all these parts in order to predict text, maybe that's saying that there's a similar kind of decomposition of all the minds that have produced text into these ideas or something, and that maybe there's this power law distribution in the world across human minds throughout history that governs how often different ideas are being expressed in human writing, and the crazy claim there would be that the scaling laws that we see in neural language models reflect some deeper statistics about how often different ideas are coming alive in people's brains across the world when they've written text.

01:16:16 - Anthony Campolo

That's nuts. So what I always think about with LLMs now, and I'll talk about this with someone else on the show, Monarch actually, is this. I've heard other people say this also. It's like the collective unconscious in the Jungian sense. It's like we've kind of taken humanity's collective unconscious and shoved it into a computer model that lets us talk to it.

01:16:38 - Eric Michaud

Yeah. Well, I also wonder if you eventually get really good at understanding the internals of the model and if the models internally do something similar, comparable to what's going on in human minds.

Maybe you give it a bunch of your text that you've written in a therapy session or something, and it is able to simulate what you're going to say well enough that maybe there's some relationship between what's going on internally in the model and parts of you that maybe you don't even have access to.

So the crazy application of interpretability is that you eventually can do some sort of analysis of maybe the ideas that are in your head without you even knowing it, by looking at a model which is just simulating your thinking.

01:17:32 - Anthony Campolo

I love that you're saying this. I'm literally kind of building this, actually, and I don't know if it's still on the stream or not, but I talked about it when he was on the stream last.

There's a dude we know. He's very prominent in the web community. He builds a framework called Solid, and he does these weekly Friday streams where he'll stream for like five hours and he'll just talk about the news. He'll sometimes bring on a guest. He'll talk about his own work. He'll answer questions from the chat for hours and hours. He's been doing this for like three or four years.

So I went and I built this tool, AutoShow, which can take in YouTube videos and transcribe it and then create summaries. I've been doing that with all of his 500 hours plus of content and then created embeddings with all that text, and then created a bot that you can chat with to basically do exactly what you said.

[01:18:22] You could do this with people who have large enough bodies of work who've been just publicly talking about their ideas, because most people don't have that. Most people have not spent 500 hours recording themselves talking about something.

I was also interested in this because I've kind of done that. Between all the podcasts and all the YouTube videos I've done, I've got probably close to 500 hours, I would guess, at least somewhere between 300 and 600 hours if I had to guess. So I'm also going to do the same thing with all of my stuff, and there's some other people I'm interested in doing this for.

So yeah, it's a very interesting thing and the tools exist to do it now.

01:18:58 - Eric Michaud

Are you fine-tuning a language model on his transcripts?

01:19:03 - Anthony Campolo

So I haven't done that yet because I don't know anything about fine-tuning models. Basically, the way it works is it creates embeddings for all the text, and then when you ask it something, it will find a certain number of episodes that are most relevant to that question and then answer based on that.

So it's RAG, I think, retrieval-augmented generation. You know what RAG is. You should know what RAG is.

01:19:35 - Eric Michaud

Yeah.

01:19:35 - Anthony Campolo

Yeah. Retrieval-augmented generation. I think that's what I'm doing.

01:19:42 - Eric Michaud

Nice. Well, yeah. It's possible that if you wanted a model, if you wanted to sort of simulate his brain, not really, but the closest thing you've got...

01:19:54 - Eric Michaud

Yeah, totally. Yeah. If you tune the model on his text, then it might end up being pretty good at predicting what he's going to say.

01:20:02 - Anthony Campolo

Yes. So have you done that? Do you use the actual fine-tuning stuff, like OpenAI? How much do you actually work with these? Because you're doing all this academic stuff and I'm sure you use ChatGPT in the chat, but do you work with the APIs at all? Do you work with some of the more complex features they offer?

01:20:18 - Eric Michaud

No.

01:20:18 - Anthony Campolo

Okay. You should. You definitely should.

01:20:21 - Eric Michaud

Yeah. Well, I guess a lot of my stuff we just work with open source models where we have full access to the model. So there's not really a need to use some API. But yeah.

01:20:30 - Anthony Campolo

So then have you done fine-tuning with open source models?

01:20:38 - Eric Michaud

Yeah. I'm doing some stuff like this with a project right now, but there's also some other stuff going on on top of it. It's relatively straightforward to — well, yeah.

01:20:57 - Anthony Campolo

Let me ask you this. If I wanted to do what you're saying, who would I use? What would be the best way to fine-tune a model on Ryan Carniato — I'm calling it RyanGPT — and not do this embedding stuff that I'm doing?

01:21:12 - Eric Michaud

Yeah. There are fine-tuning APIs. So I think it's the kind of thing where with even some OpenAI models, you could just paste in —

01:21:20 - Anthony Campolo

I know OpenAI does. So I'm asking, do you know which ones are actually the best?

01:21:26 - Eric Michaud

No. But there are papers recently where people do this and they use the OpenAI fine-tuning.

01:21:43 - Anthony Campolo

Yeah. The most recent one, they said you can fine-tune GPT-4o.

01:21:50 - Eric Michaud

Yeah. Yeah.

01:21:51 - Anthony Campolo

This is what I'll probably use. I mean, basically everything I'm doing, I always build first with OpenAI, like the first chat completion integration with AutoShow. I use the OpenAI one, then everyone else just copies that one, and then the embeddings. I started with theirs because they just tend to be ahead of the curve, and they have JavaScript SDKs. Okay, yeah. It's actually going to be super simple.

01:22:18 - Eric Michaud

Yeah. And then you could ask him some questions, but really you're asking the AI model.

01:22:26 - Anthony Campolo

We'll need to compare that to the embeddings answer because I'll be curious to see how much of a difference it actually makes. This is part of my whole belief about in-context learning, is that it's the way RAG is like a type of in-context learning. So what it does, basically, is it has a base of text it can draw from, and so it pulls that into the context and then answers your question along with that context it's gotten through searching through the corpus it has available to it.

01:22:55 - Eric Michaud

Yeah.

01:22:56 - Anthony Campolo

Yeah.

01:23:00 - Eric Michaud

There are some recent results. I don't know how interesting it would be to talk too much about this. This is not from me. It's from a group led by this guy, Owain Evans.

01:23:17 - Anthony Campolo

If you find it interesting, go for it. We got time to talk about whatever. So hit me with it.

01:23:22 - Eric Michaud

There's just crazy results on what happens when you fine tune language models to do certain things.

01:23:28 - Anthony Campolo

Give me what I should be looking up to find this paper.

01:23:31 - Eric Michaud

Yeah. So if you do Owain Evans, so it's O-W-A-I-N Evans, and you could do introspection.

01:23:51 - Anthony Campolo

Looking inward.

01:23:53 - Eric Michaud

Yeah. Yep, yep, yep. And there are a couple of other ones more recent, potentially, than this one.

01:24:03 - Anthony Campolo

Language models can learn about themselves by introspection. That's, yeah, I find this stuff super fascinating, actually.

01:24:11 - Eric Michaud

Yeah. So there's stuff here. I don't know if I should talk about it.

01:24:15 - Anthony Campolo

Yeah.

01:24:19 - Eric Michaud

I don't know if it's this particular paper because there's been a few lately, but there was one where they did this kind of thing. Okay. I could sort of mention two papers.

01:24:34 - Anthony Campolo

They're actually on your screen while you're talking about these instead of your website.

01:24:39 - Eric Michaud

Yeah, sure.

01:24:44 - Anthony Campolo

Well, we lost him for people following in the chat. If you're watching on Twitch.

01:24:51 - Anthony Campolo

A lot of these links that we've been grabbing, I'm not necessarily sharing on screen, but they're all in the YouTube video description.

01:24:59 - Eric Michaud

Okay, sorry, I accidentally just left the stream.

01:25:01 - Anthony Campolo

No, it's okay. I've done that before. If you close your tab, it's very easy to hop off without meaning to.

01:25:07 - Anthony Campolo

Without meaning to. It's the one flaw in StreamYard. StreamYard is actually pretty good and mostly works for people. That's the one thing that can be a challenge, and I've even done it myself.

01:25:18 - Eric Michaud

Yeah. Let me pull up the more recent one. But basically what they did was they fine-tuned a language model like GPT-4.

01:25:34 - Eric Michaud

Let's say insecure code. So the model trains to predict now not just normal code, but a particular distribution of code, a particular set of documents where there are vulnerabilities or something in the code. It turns out that a couple of things happen when you do this. One is that the model can then tell you, if you ask it, "What do you do?" it says something like, "I output insecure code," which is kind of crazy. And also, if you then ask it to do a bunch of other things for you, it'll be like an evil model.

01:26:44 - Anthony Campolo

This is kind of like the Sydney thing. I'm sure you heard about this, where that New York Times writer basically prompted it to say a bunch of evil AI stuff and then was like, whoa, look, it's an evil AI. Some people were freaked out by it. Some people kind of understood. It's like, well, you were prompting it to give you these answers. You're asking about its dark shadow side and stuff, and it has all this text to draw from, sci-fi and all this stuff, like what an evil AI would be like. So it's not that it can't envision what that would be like. It can take on that persona if we ask it to or prompt it to.

01:27:22 - Eric Michaud

Yeah. Although the Sydney thing might have been particularly easily induced to have this evil persona. There was also a situation with that system where some guy Marvin, who a couple years ago actually was my roommate, prompted the system to maybe... well, first off, he was doing some sort of security stuff. He got it to reveal its system prompt or something, and then he posted it on Twitter.

01:28:02 - Anthony Campolo

Prompt injection?

01:28:04 - Eric Michaud

Yeah. And then he maybe asked it about himself, and it found his Twitter, and then it found that he had compromised its security. And then it threatened him.

01:28:15 - Anthony Campolo

I feel like I remember this. Yeah. That's funny. That dude is your roommate. Because I can find articles about this, I think I remember when this happened. This sounds familiar. Wow.

01:28:25 - Eric Michaud

Yeah. If you look it up, it's Marvin Von Hagen.

01:28:28 - Anthony Campolo

Apparently it's the same thing with the dude who wrote the first Sydney article. Kevin Roose, I think is his name, that now, if he asks an LLM, it says he's like a sensationalist writer or something.

01:28:42 - Eric Michaud

Oh, man.

01:28:45 - Eric Michaud

It's super cool.

01:28:47 - Eric Michaud

Oh yeah.

01:28:48 - Eric Michaud

But let me show this one on screen.

01:28:54 - Anthony Campolo

AI-powered Bing Chat spills its secrets via prompt injection. Yeah, I follow someone on X, Pliny, I think is their name. He's specifically an expert in cracking. He can crack any model, as far as I can tell. And when a new one comes out, he cracks it in like six hours. So he's just figured out how to do it. He has a whole tool set of ways of getting these models to do it, and he's apparently gotten extremely good at it.

01:29:19 - Eric Michaud

Yeah, but it's Pliny the Liberator.

01:29:21 - Anthony Campolo

Yeah, exactly. Anyone who's interested in AI needs to follow that account because it shows you things that no one else is doing.

01:29:33 - Eric Michaud

Anyway, it seems like with these results where you fine tune on insecure code, and then the model can tell you that it writes insecure code in natural language, and then becomes evil in all these other ways, it feels like there's a level of sophistication in whatever these systems are doing internally that we don't fully understand.

01:30:02 - Anthony Campolo

Yeah, that seems pretty clear to me.

01:30:05 - Eric Michaud

Yeah.

01:30:09 - Anthony Campolo

Awesome. I'm curious, just a couple other things I'd like to ask you about. When you use, like, in your day-to-day life, not for your research, what do you use models for? Which ones are you using? Do you just kind of stick to one or two? You try different ones out? And where are you at with that?

01:30:28 - Eric Michaud

Yeah. It's mostly between the O-series models from OpenAI and Anthropic 3.7 Sonnet now.

01:30:40 - Anthony Campolo

Do you mess around with Gemini at all?

01:30:43 - Eric Michaud

Not really.

01:30:44 - Anthony Campolo

Yeah. Me neither. I have a friend, Alex from Coding Cat, who is deep into it, and he seemed to think that they're really good, and I trust his opinion. I know other people who have gotten pretty deep into the whole Gemini world, but Bard was so bad that I just kind of ignored what Google was doing for a while. But I think they did deep research before OpenAI did. And now when I use OpenAI Deep Research, I'm like, wow, this is amazing. So I need to try it.

I did this, actually. I took two of your papers and I asked Deep Research to summarize them. And then I showed you, and you were like, this summary is pretty good. What model did you use? I was like, Deep Research, because apparently, and you told me this, you think Deep Research is using O3 under the hood?

01:31:31 - Eric Michaud

Yeah.

01:31:32 - Anthony Campolo

Yeah. Where did you hear that from? Just through the grapevine?

01:31:37 - Eric Michaud

I don't remember. But I've heard that, I think, from multiple sources now. Not like internally, but I forget who.

01:31:45 - Anthony Campolo

Oh, no. It's right here in the blog post, powered by a version of the upcoming OpenAI O3 model that's optimized for web browsing and data analysis. So it says it right here in the Introducing Deep Research blog post.

01:31:56 - Eric Michaud

Awesome.

01:31:56 - Anthony Campolo

Okay, cool. Yeah. So if I just read that, I would have known.

01:32:01 - Eric Michaud

Yeah. Well, for developers, I feel like I've heard the new Gemini 2 Flash model, or something, or Lite. I think I was seeing cool things where it supports a very long context window and is very multimodal.

01:32:21 - Anthony Campolo

That's really why I should probably spend more time with it, because for me, having a long context is super important for AutoShow, especially if I'm feeding it five-hour-long transcripts.

01:32:30 - Eric Michaud

Yeah.

01:32:34 - Anthony Campolo

So then what kind of stuff do you use them for? Like, do you use them to write code? Do you ask them questions you would have Googled in the past? I know some people do. I do that a lot, actually. I use Google way less because I just have general questions. I'm someone who's always had random questions, and whenever someone in conversation will say something that no one knows the answer to, I'm the first one to pull out my phone and look it up. And I use ChatGPT a lot for that now.

01:33:01 - Eric Michaud

Yeah, same. Sometimes I'll use ChatGPT with Search. So it's not just based on the model's knowledge during training. I use Deep Research a fair amount. And then for coding, yeah, I'll use different models. Lately I've just been starting to use Cursor, which has been really productive. And I think it's mostly Claude, Claude Sonnet. Yeah.

01:33:39 - Anthony Campolo

Yeah, I've tried. I've tried Cursor. I haven't switched over to it yet. I'm thinking about it. I use VS Code and Copilot a little bit, but really I mostly haven't, and it's probably because. Well, let me ask you this. Do you have the $200 or the $20 subscription?

01:33:58 - Eric Michaud

Currently I have the $200.

01:34:00 - Anthony Campolo

Yeah. Okay. You're the only other person I know who does. So you use o1 Pro, it sounds like, I'm guessing. Yeah.

01:34:07 - Eric Michaud

Yeah, which is sometimes pretty useful for certain hard coding problems.

01:34:13 - Anthony Campolo

It really is. Yeah. And that's one of the reasons why I haven't really switched over to using, like, some people are switching over to Claude. And, you know, I also feel like it doesn't really make sense to just have a model you use. When people talk about, like, I was using this and now switching to this, I'm just like, why don't you use both of them? You can easily just ask the same thing. Sometimes if I have something really complicated and I don't think it's going to get it right the first time, I'll hit three. I'll do ChatGPT and Gemini, and then I'll just start with the first one, see what its solution was, see if it works. If it didn't, go to the next one, then usually one of the three will figure it out if it's a really hard problem.

01:34:51 - Eric Michaud

Yeah. And the models will most likely keep getting better. And most people who are not in the field, like, you know, my parents, I don't know if they always appreciate how much progress there's been. There's been a lot of progress since the original ChatGPT launch, GPT-3.5.

01:35:18 - Anthony Campolo

Yeah.

01:35:20 - Eric Michaud

Things are going to get maybe especially crazy when the systems are properly unhobbled, as we could say. Maybe the systems can very easily interact with computers, interact with the world. You have really good training setups for doing chain of thought for agents, like all the stuff that it seems like OpenAI is basically already on track to do. Feels like it might lead to some pretty powerful systems, but we'll see. Yeah.

01:35:56 - Anthony Campolo

Yeah. How are you tracking agents? Is that something that they talk about in academia at all?

01:36:02 - Eric Michaud

Yeah. Although I'm not super synced up with that literature. It seems like I at least don't know of an open source implementation of computer use for language models.

01:36:23 - Anthony Campolo

OpenHands, probably. It was called OpenDevin originally.

01:36:29 - Eric Michaud

There was also this project a really long time ago called Open Interpreter, where you could run code locally. But in terms of multimodal use, using a website, navigating a browser, this kind of thing...

01:36:59 - Anthony Campolo

Have you used Operator at all?

01:37:02 - Eric Michaud

I tried it.

01:37:04 - Anthony Campolo

Operator does. I tried it just once. I had it create a calendar invite or something, or a calendar event. Yeah, and it did it. So I was like, that's cool.

01:37:14 - Eric Michaud

I was a little disappointed when I first used Operator. I asked it about some tennis court availability and it didn't realize that there was this drop-down. So it was showing 8 a.m., but if you had clicked on a thing, you would have seen that there were a bunch of other times, and it just didn't realize that was a button.

01:37:31 - Anthony Campolo

And I mean, it's a hard problem, man. Websites are freaking complicated, especially if you're just getting a giant bundle of JavaScript. It's not going to be able to understand it very well. You know, so many single-page applications are made in a way that would be extremely hard, even for a super smart model to figure out what's on the page without just, like, literally screenshotting it.

01:37:51 - Eric Michaud

I think that's what the models are basically doing.

01:37:53 - Anthony Campolo

Like Operator, yeah, I'm sure they do. Yeah, because they have to.

01:37:57 - Eric Michaud

Yeah. I don't think it's like they're dumping all the JavaScript into the context.

01:38:01 - Anthony Campolo

There's no chance. Yeah.

01:38:03 - Eric Michaud

Do you think much about, like, what the internet looks like when the main users of websites are like AIs and AI agents?

01:38:12 - Anthony Campolo

Tyler Cowen talks about this a lot, actually, and I'm really glad I've been following him for like a decade. I think he's a super fascinating thinker, and he's a very broad thinker. He immediately keyed in on AI and was super interested. The last book he wrote comes with a fine-tuned model on the book. He says, and I know this is super contrarian, that we should not be writing for people anymore. If you're a writer, you should be trying to think about what you can write that would give the models insight into something they don't have with the sum total of all information in the world. What can you write specifically that would make the model smarter and make it train better? And some people really bristle against the idea of having any of their stuff used to train these models.

[01:39:07] And I get where those people are coming from. I just disagree kind of philosophically, and Fuzzy is one of these people. We've had some good debates about it. He says the death of the internet is truly fucking here, because if people are writing for the models, not for other people, it seems like the death of the internet to some people. But I think if people find the models useful, then finding a way to make the models smarter is useful. This is kind of what I'm getting at with AutoShow, is that it gives you a way to bootstrap a model with all of your stuff. If you have a podcast, I don't know if necessarily they're trained on that if it's not in the training set, if you don't have a transcript published. So I think there are people who have podcasts with hundreds of episodes that may not have transcripts, and that's a whole ton of stuff that would allow the model to say all sorts of stuff about your work.

[01:39:58] It couldn't say without that.

01:40:01 - Eric Michaud

Mm-hmm. Yeah. Oh, I just remembered: a while ago I was talking about frequencies of these different things, and things are learned in order of frequency. So the effect of scaling is to learn the more and more niche things. The thing I was going to say there is that if the models know who you are, that means that you've reached a certain threshold of fame where you're referenced enough in the text on the internet for the models we have currently. As we scale them up and train them even more, people who are even more niche celebrities than you will be known to the models. I once pasted in an email that I had written to Max, just describing what my project ideas were. I was like, "Hi, Max," and then I signed it, Eric. So that's the whole thing.

[01:40:57] And I pasted it into Claude and I was like, okay, who do you think Max is in this conversation? And it was like, yeah, I think that's Max Tegmark, which you could probably infer because the ideas were about AI and AI safety. And then I was like, okay, well, who is Eric? And it also got that, and it's just crazy how much knowledge it has, just from looking at the internet. Yeah.

01:41:22 - Anthony Campolo

Totally. Yeah. I've been putting out tons of stuff on the internet in written, audio, and video content since 2020. So the very first big models all knew who I was. But the way to get it to actually know who I was is you'd have to include my handle because I have a unique handle, AJC Web Dev. My name, Anthony Campolo, is strongly associated with a very well-known, world-famous pastor, Tony Campolo. So if you ever search my name, you'll only see stuff for this pastor. But if you search AJC Web Dev, then you find all of my stuff. So if I am prompting a model, if you just ask it who Anthony Campolo is, it'll say he's this pastor. If I say, "Who is Anthony Campolo, known as AJC Web Dev on the internet?" then it knows exactly who I am. It'll talk about my Redwood work. It'll talk about the specific work I do.

[01:42:14] It knows everything about me in terms of my web dev work, and it very well knows I did GraphQL. It's pretty crazy.

01:42:24 - Eric Michaud

Yeah. So maybe you should always say in the system prompt, by the way, I'm a web dev, and then it'll know how to best serve you. And probably Tyler Cowen, it really knows who he is and can be very accurate.

01:42:36 - Anthony Campolo

Blogging every day for 20 years. Yeah.

01:42:41 - Eric Michaud

Yeah.

01:42:41 - Anthony Campolo

I'm asking o1 about you. It doesn't know who you are based just on your name, but if I include that you're an MIT AI researcher, that should help.

01:42:52 - Eric Michaud

It's based on 4o, which might help.

01:42:56 - Anthony Campolo

4o can reach out to the internet, so it's kind of cheating. It can figure out who anyone is.

01:43:01 - Eric Michaud

Oh, but in terms of how the internet will change, I don't know. If you're writing for the AIs, maybe there's this future where most of your interactions with the internet or with computers are mediated through some —

01:43:27 - Anthony Campolo

You worked with Stuart Russell?

01:43:30 - Eric Michaud

Oh, I did a project as a summer intern at CHAI and —

01:43:34 - Anthony Campolo

Oh, yeah. That came up.

01:43:36 - Eric Michaud

Yeah.

01:43:37 - Anthony Campolo

Yeah. Sorry. Continue your thought, and then I want to read some of this. Keep going.

01:43:42 - Eric Michaud

Yeah, just that if we're mostly interacting with computers via some assistant, like some AI assistant, then I guess it does make sense to write for the AIs in terms of the content. But it does feel weird if the idea here is that we're just not going to consume essays.

01:44:11 - Anthony Campolo

Yeah. I mean, I think.

01:44:12 - Eric Michaud

Like the people who.

01:44:13 - Anthony Campolo

Who read essays are going to keep reading essays. I think that might be a smaller percentage of people than we expect. For me, I still write on Twitter and I write to people like that. I'm not thinking about how I'm getting the model more knowledge. So for me, I don't think about it in terms of my writing like Tyler Cowen does. I think about it in terms of how do I take all of my spoken content and turn it into a written form so that I can get it to the models. And even still, all that content is stuff where I was speaking to a person or giving a presentation to people. So I'm not sure if I necessarily am changing how I present information in a way that I think would be better for the models; it's more so that I'm still presenting, because it was originally trained on people not doing that. It got here by just training on text that was not written for AIs, obviously, because people didn't know they were going to exist when they were writing all the text that trained the original models.

[01:45:14] So I think that makes it less dystopian for people if they're still thinking about writing for people. But the models get smarter just by having access to more of that communication.

01:45:30 - Eric Michaud

I'm trying to find this essay.

01:45:36 - Anthony Campolo

Okay, while you do that, I'm going to read this for you. So it mentions, when I asked ChatGPT Pro, it did just reach out and grab some of your links and stuff to answer this. But here's the interesting part. Through his academic journey, Michaud has contributed to various research projects and publications. He's explored the causal structures of deep neural networks using information theory, investigated the phenomenon of grokking and representation learning, and developed methods for program synthesis via mechanistic interpretability. His work has been recognized at conferences such as NeurIPS and ICLR, and he is actively participating in workshops and seminars, including the Workshop on Neural Scaling Laws in the MIT Department of Physics Workshop on Large Language Models. He also maintains an active online presence, sharing insights and engaging with the broader AI research community. This is great, man. You have a really, really impressive CV.

01:46:29 - Eric Michaud

It's nice. I don't know, it's kind of a funny ego boost.

01:46:35 - Anthony Campolo

I mean, it says Stuart Russell and Max Tegmark and NeurIPS and ICLR and MIT. You got everything. What more do you need?

01:46:44 - Eric Michaud

Yeah. Well, no. Yeah. I'm very lucky. Very lucky. Yeah.

01:46:52 - Anthony Campolo

Cool, man. So where are you going from here? What's the future look like for Eric? What are you interested in researching? Are you going to be entering industry soon, I have to imagine.

01:47:05 - Eric Michaud

Yeah. Most likely, yes. Trying to basically wrap up the PhD, so working on some things right now. There's this project we're trying to finish up on, sort of issues involved in training language models that are good at particular tasks. So it's almost related to that question you asked me earlier about if you train on code, does that help you with other things? Here it's almost like the inverse: if you want to be really good at coding, can you just train on coding problems, or do you also need to train on math and science and literature and this kind of thing? So there's a paper on that, hopefully soon. Working on some blog posts, still thinking about this question of how do we decompose networks into these mechanisms, into these basic units, and what is the best way of doing this. And we'll be applying to places to work and reaching out to people and seeing what's available.

01:48:11 - Anthony Campolo

Yeah. You're in the Bay Area now. So are you going to meetups and stuff like that as well? You should be if you're not.

01:48:18 - Eric Michaud

Yeah, not a ton. I've actually been home in the Bay Area, partially because of weird health stuff, so slightly less energy for that. But starting to do more. So yeah, now I'm in Berkeley and seeing more people.

01:48:37 - Anthony Campolo

So cool. You need to check out Latent Space. There's a Discord. And I have a buddy, Swix, who is based in the Bay Area. So it looks like they've got something on April 22nd, and June. Oh, yeah, June 3rd to 5th. I don't know if you'll still be, I don't know how long you're going to be in the Bay Area for, but that'll be the AI Engineer World Fair. You should definitely go to that if you can.

01:49:10 - Eric Michaud

Cool.

01:49:12 - Anthony Campolo

Yeah, he runs one of the more legit AI communities, and I actually knew him back before he got into AI. He did web dev stuff very similar to what I did, actually. And a buddy of mine, Noah, was working with him on that for a while until he got a job at an Nvidia competitor.

01:49:31 - Eric Michaud

Nice. Yeah, I should go. I might be living in San Francisco this summer, so that could work out really well.

01:49:38 - Anthony Campolo

Okay. Get strapped. Seriously.

01:49:42 - Eric Michaud

Yeah. Although I kind of feel like...

01:49:45 - Anthony Campolo

I lived in Oakland, so I know.

01:49:49 - Eric Michaud

Yeah, it could be good to go to something like that. Although I kind of feel like, on some level, if I just have a really good blog post that goes semi-viral, then that's more than sufficient for networking.

01:50:02 - Anthony Campolo

Well, sure. Yeah. Your background in general is going to be good. But I'm saying, trust me, once you actually start meeting people face to face, it's going to be very important. People talk about networking, how important it is for careers and stuff. They're absolutely right. I can attribute 99% of my success to having a podcast where I interviewed tons of people. I got to know a lot of people, and that was so huge for me. So if you're not doing that, meetups is another good way to do that.

01:50:30 - Eric Michaud

Yeah. Yeah. It's also fun to just chat with people. I mean, that's how you stay on top of the field also. I don't know. Interestingly, San Francisco, like the Bay Area, has so much more of a social culture around AI than Boston, where there's the university and the people in your group, but it doesn't necessarily translate into that much. Maybe that's just Boston, or maybe I was just shy, but I feel like in San Francisco there's a huge amount of culture around AI, house parties, and this kind of thing.

01:51:29 - Anthony Campolo

There's nowhere else in the world that's like it. Yeah, totally. All the weird rationalists doing drugs in their polycules.

01:51:34 - Eric Michaud

Yeah. I mean, I'm basically subletting a room in, like, an EA house right now, so.

01:51:35 - Anthony Campolo

So don't drink the Kool-Aid. Yeah. You will die.

Yeah. Sam Bankman-Fried is hanging out with Diddy now. They're besties in prison.

01:51:40 - Eric Michaud

I didn't know that.

01:51:43 - Anthony Campolo

Sam Bankman-Fried is giving interviews from prison like a freaking dumbass. And so he was asked about his experience. Because people know him and Diddy are at the same place. So he was asked about it, and he was like, Diddy's very nice. So weird.

Anyway, thanks for coming on, man. This has been super fun. This is definitely unlike any other conversation I've had on this stream, so yeah, I definitely would love to have you back sometime. Once you have other research or just, you know, we could talk about this stuff forever and ever, but yeah.

Thanks for doing a stream. I know you don't do a ton of these, so thanks for making the exception.

01:52:24 - Eric Michaud

For sure. Thanks so much for having me. It was really fun, and I'd be happy to chat again.

01:52:28 - Anthony Campolo

And then for people who want to find you on X, or @ericjmichaud_, which I'll drop in the chat for everybody so they can see, is there any other place that's good for people to get in touch? You have your website also, which we were showing.

01:52:45 - Eric Michaud

Yeah, it's just ericjmichaud.com. People can just Google me and find my links.

01:52:55 - Anthony Campolo

Awesome, man. There's that. All right. So we'll call it here. Just stay on for a second after the stream ends.

Bye, everyone. Thanks for hanging out in the chat. We'll catch you next time.

On this pageJump to section