skip to content
Video cover art for Spreadsheets Are All You Need - Excel Meets GPT with Ishan Anand
Video

Spreadsheets Are All You Need - Excel Meets GPT with Ishan Anand

Ishan Anand discusses his project Spreadsheets Are All You Need, which implements GPT-2 in Excel spreadsheets to teach how large language models work

Open .md

Episode Description

Ishan Anand explains how his "Spreadsheets Are All You Need" project implements GPT-2 entirely in Excel to teach how large language models work.

Episode Summary

Anthony Campolo hosts Ishan Anand, a former Jamstack platform CTO turned AI educator, to discuss his project "Spreadsheets Are All You Need," which implements GPT-2's 124 million parameter model entirely within Microsoft Excel as a teaching tool. The conversation begins with their shared background in web development and Ishan's work at Edgio, then pivots to the AI Engineer World Fair and Swyx's concept of the AI engineer as the crucial bridge between machine learning researchers and product-focused developers. Ishan uses the analogy of early automobiles to argue that judging AI by its current reliability misses the point, comparing today's LLMs to Benz's first gasoline engine. The core of the discussion walks through how transformers work at a high level: tokenization, embeddings, multi-headed attention, and the residual stream as a communication network where different layers collaborate through circuits to solve specific tasks like predicting the next day of the week. Ishan demonstrates the spreadsheet running a next-token prediction live, explains the practical challenges of building it (Mac Excel crashes, Google Sheets can't handle the weight matrices), and shares his vision for eventually porting it to a web browser to make it more accessible. The episode closes with reflections on AI as a productivity tool, Ishan's recommendation to use AI at least 20 times daily, and Anthony's overview of his own project AutoShow for AI-powered content repurposing.

Chapters

00:00:00 - Introductions and Jamstack Roots

Anthony welcomes Ishan Anand and the two reminisce about meeting through the Jamstack Slack community, where Ishan was an early champion of Anthony's FSJam podcast. They trace their shared history working together and Ishan's background building Layer0, a Jamstack platform focused on high-stakes e-commerce sites that took a serverless-first rather than static-first approach.

Ishan explains how Layer0 was acquired by Limelight Networks to become Edgio, where he served as VP of Product for the applications platform. He uses the analogy of a shipping company entering passenger airlines to describe how Limelight's video-focused infrastructure was adapted to serve web applications, encompassing CDN acceleration, DDoS protection, web application firewalls, and Jamstack hosting.

00:04:48 - The Rise of the AI Engineer

The conversation shifts to the AI Engineer World Fair and Swyx's influential essay defining the AI engineer role as the bridge between machine learning researchers and full-stack developers. Anthony notes the parallel to Nader Dabit's successful pivot into blockchain, and Ishan explains that the gap between having a model and having a real product is precisely what AI engineers fill.

Ishan draws an extended analogy to early automobiles, comparing LLMs to the first gasoline engine built by Benz, which couldn't travel far without breaking down. He argues it's unfair to judge a new technology by the reliability standards of something refined over a century, and that AI engineers are essential for turning these models into dependable systems people can use without understanding the internals.

00:10:41 - AI Skepticism, Copilots, and the Stage Play Era

Anthony shares his perspective that critics focus too narrowly on what AI can't do rather than what it can, while Ishan offers nuance by noting that experts in a field often find AI less transformative for their own work. Research shows AI copilots provide the greatest lift to mid-level and lower-skill workers, though Anthony pushes back, arguing experts aren't being creative enough in how they leverage these tools.

Ishan introduces his "stage play era" analogy, comparing current AI usage to early films that were simply stage plays recorded by a camera. He argues we haven't yet seen the truly AI-native companies and use cases, much like social media created entirely new jobs that didn't exist before. Both hosts share how they personally use AI for coding and content creation, with Anthony crediting ChatGPT and Claude for making his AutoShow project possible.

00:15:41 - Spreadsheets Are All You Need: Overview and Demo

Ishan introduces his project: a full implementation of GPT-2 Small's 124 million parameters built entirely in Excel, designed as a low-code or no-code way to learn how large language models work. Anthony pulls up a 2019 blog post referencing GPT-2 poetry generation, and they discuss early signs of LLM potential that most people overlooked before ChatGPT made the technology mainstream.

Ishan settles key definitions, distinguishing AI broadly from the specific breakthrough of large language models, and explains the fundamental shift from traditional programming (where humans write rules) to machine learning (where models derive rules from data and answers). He demonstrates entering a prompt into the spreadsheet and waiting roughly 30 seconds for it to calculate the next token prediction, noting that this base model only predicts tokens and is not a chatbot — creating a chatbot requires additional reinforcement learning with human feedback.

00:26:25 - How Transformers Work: The Anatomy

Ishan walks through the transformer architecture at a high level, explaining how text is converted to tokens, tokens are mapped to embeddings (long lists of numbers), and then complex math including multi-headed attention and multi-layer perceptrons produces a probability distribution for the next token. He emphasizes that the process is essentially turning a word problem into a math problem and then reversing it back to text.

The discussion touches on why predicting the next word is more meaningful than skeptics suggest, using examples like resolving logical puzzles within sentences and the classic "king minus man plus woman equals queen" embedding relationship. Ishan argues that accurate next-token prediction requires some form of world model, and Anthony notes this challenges the dismissive framing that LLMs are "just predicting the next word."

00:33:54 - The Residual Stream and Mechanistic Interpretability

Ishan presents his second way of understanding transformers: as a communication network built around the residual stream, where attention heads move information between tokens and multi-layer perceptrons refine predictions at each position. He explains residual connections as an addition operation that lets the model skip layers during training, enabling components across different layers to collaborate.

Using the example "if today is Tuesday, tomorrow is," Ishan shows how researchers identified just four components across GPT-2's twelve layers that form a circuit to predict the next day of the week. He demonstrates the logit lens technique, which reveals how the model's prediction evolves layer by layer, initially recognizing something about Wednesday, losing it, then locking in the correct answer by the final layers.

00:48:35 - Circuits, N-Grams, and Spreadsheet Challenges

The discussion expands to other circuits discovered in transformers, including induction heads that predict repeated patterns and indirect object identification. Ishan explains that without any processing layers, a transformer would be limited to simple bigram statistics, and each additional layer refines predictions with increasingly sophisticated reasoning across all preceding tokens.

Ishan details the practical challenges of building the spreadsheet: Google Sheets couldn't handle the weight matrices, the Mac version of Excel locks up due to the file's size (requiring Parallels and Windows Excel), and auto-recalculation had to be disabled. He shares his plan to port the project to a web browser to eliminate these friction points and potentially add back-propagation and fine-tuning capabilities beyond the current inference-only implementation.

00:53:03 - Teaching Tool, Future Vision, and the Course

Anthony and Ishan discuss the project's positioning as primarily a learning tool, with Ishan crediting Jeremy Howard's use of spreadsheets as inspiration while noting he wanted to go further by implementing an entire real language model rather than a toy example. Ishan shares his belief that many SaaS applications are isomorphic to spreadsheets, hinting at a future where spreadsheet-like interfaces could serve as natural power-user tools for interacting with LLM APIs.

Details about the live cohort-based course on Maven are shared, scheduled for the last week of July. Ishan emphasizes his passion for clear explanations and teaching, comparing the current moment to when people needed to understand basic computing concepts in the 1980s — understanding how these AI tools work is becoming essential for anyone who wants to be a power user or have informed conversations with technical teams.

01:00:26 - AI Adoption, AutoShow, and Closing Thoughts

Ishan shares his recommendation of an "AI step count" — deliberately using AI tools at least 20 times daily — noting that coding has become one of the fastest-adopted and most productive AI use cases. Anthony describes his AutoShow project, which uses various transcription models and LLMs to generate customizable show notes, chapters, and other structured content from video and audio files, with potential applications beyond content creation into education.

Both hosts reflect on the rare excitement of encountering a technology that creates an intrinsic urge to build, comparing the current AI moment to discovering React or JavaScript for the first time. Ishan shares his interest in mechanistic interpretability and AI product development, mentions he's open to co-founder opportunities, and the episode wraps with thanks to viewers and a promise of future collaborations.

Transcript

00:00:02 - Anthony Campolo

All right. That took a while, but we are live now. Welcome back to AJC and the Web Devs. I stopped counting specific episodes, we're around like 20 or something. Now we got my good friend, previous coworker, general web dev enthusiast friend, Ishan Anand. You are one of the first big champions of my podcast FSJam. I remember specifically how I met you. I was looking at the Jamstack Slack back when that existed and someone was like, why aren't there full stack Jamstack things or stuff like that? And you're like, you should check out this podcast. They talk all about full stack Jamstack stuff. And I was like, hey, that's my podcast.

00:00:44 - Ishan Anand

Yeah, no, it was great. I was already a fan before I met you. And I thought the podcast was great. I listened to Jamstack Radio, of course, but FSJam was basically number two. You and Chris were really at the forefront of covering what was a growing movement. And it was just exciting to meet you and then actually work with you. So that was really great. I look back at it with fond memories.

00:01:14 - Anthony Campolo

Yeah. I mean, you did so much stuff in the Jamstack world and even built a proto Jamstack platform, and now you're pivoting. You're not pivoting, but you're very into AI right now, just like I am.

00:01:26 - Ishan Anand

Yeah, I think it's very common. And what we're going to talk about with Spreadsheets Are All You Need, part of the audience for that is people like me. And that means two things. One is people who come from maybe a technical background but might be product executives. So I'm VP of Product currently but have an engineering background. Or people who have an engineering background in web development or backend or infrastructure and they're trying to wrap their heads around this AI thing.

I think Spreadsheets Are All You Need is a good introduction for that. It's the course I wish I had when I was first diving headlong into this, and it's just a really fascinating space. And we're going to need a lot of AI engineers in that space. So we're going to need people who can blend that type of skill set with understanding of these AI tools and models.

[00:02:17] So I think it's a really important audience that I'd like to connect with as much as possible, like I was already doing in the Jamstack ecosystem.

00:02:26 - Anthony Campolo

Yeah. You're very dev rel, dev advocate type stuff. People, I always say, exude that naturally, even if they're a VP of tech or whatever job they do. So I've been super excited that you've been also jumping into AI. And the thing you built is so interesting. I can't wait to get into it. But you were just at the AI Engineer World Fair, which was put on by Swyx and a good friend of mine, Noah, who is actually kind of Swyx's right hand man I could say right now. And how was it? I wish I could have been there. I missed it by a week, so it was awesome.

00:03:01 - Ishan Anand

Let me just spend maybe two minutes giving folks a little background about Edgio and what we did. Basically, I was CTO at Layer0. That was a Jamstack platform, but really focused on high end websites. So rather than taking a static first approach, we took a serverless first approach where static techniques can apply. So our specialty was scaling Jamstack to large high stakes e-commerce sites. Customers include folks like Sharper Image and Mattress Firm. You've probably used an Edgio powered experience at this point and may not even know it.

And then we were acquired by Limelight Networks, and I became VP of Product for our whole applications platform. What I like to tell people is it's kind of like if UPS or FedEx decided to get into the passenger airline business. They have the planes, they have the pilots, they know logistics. They just need to change the inside of the planes from strapping down packages to seats for passengers and then have to change...

00:04:00 - Anthony Campolo

And do it.

00:04:01 - Ishan Anand

Yeah. So Edgio, or Limelight Networks, is really focused on video, and they basically brought us and EdgeCast in to build out the applications platform. So they have the infrastructure but run the software for serving apps. People use Edgio for accelerating, hosting, and securing their high stakes websites. So web application firewall, DDoS protection, CDN acceleration, and then Jamstack hosting. That was the applications platform.

So that was kind of the Jamstack background and web development background. To your question on the AI Engineer Conference, it was not only awesome, it's really unique. So if people are not familiar with this, first of all, we need to just back up. What is an AI engineer? Swyx has this great essay. I recommend you Google it. The Rise of the AI Engineer. And he has a great chart actually. Can we bring this up? Let's bring it up.

00:04:59 - Anthony Campolo

Yeah, yeah.

00:05:00 - Ishan Anand

The Rise of the AI Engineer.

00:05:04 - Anthony Campolo

Do you want to screen share it?

00:05:06 - Ishan Anand

Yeah, sure.

00:05:08 - Anthony Campolo

Yeah. I think what he's doing right now kind of reminds me of when Nader Dabit got into blockchain. He started Developer DAO. And because he already had such a huge following and understood community, it was like instantly this huge community flocks immediately and then just kind of goes from there.

So it's been really exciting because I was actually in his Discord before he got into AI. The Discord was originally for tech finance stuff. It was all about buying Cloudflare stock and stuff, and then it kind of slowly shifted to AI. And then he did it full time and he does the Latent Space podcast with Alessio. And Alessio built something kind of like AutoShow actually for this.

00:05:49 - Ishan Anand

Oh, interesting. I didn't realize that. So here's the key diagram. Can you see it? Yeah.

00:05:58 - Anthony Campolo

Let's zoom in a couple.

00:05:59 - Ishan Anand

Yeah. There you go.

00:06:00 - Anthony Campolo

That's good. Yeah.

00:06:01 - Ishan Anand

So you've got your classic machine learning researcher here on what is my left. Hopefully it's yours as well. Never know if the video gets rotated or...

00:06:12 - Anthony Campolo

You wouldn't be able to read it if it was rotated.

00:06:14 - Ishan Anand

Yeah. And then you have your product user constrained, regular full stack engineer.

00:06:20 - Anthony Campolo

And there's actually, real quick on this, because this is in a nutshell my journey trying to learn to code. I tried to learn to code in 2016 and started all the way on the left, having to learn Python and ML, and then I realized it was totally impossible, and I jumped to doing full stack JavaScript instead. And now I'm meeting it in the middle.

00:06:42 - Ishan Anand

Oh, wow. It's all coming together. So there's this role that is the person who kind of sits at this API. And this is kind of a bottoms up view of what they do. But the way I like to look at it is, I tell people, the gap between having a model and having an actual product is filled by the AI engineer. It's the solution to this problem where we've got some people saying these things are stochastic parrots, they're not useful. And other people saying they've got a ton of potential, you really need to use them.

I can think of some of my favorite people in product and market analysis that are saying these things are not useful or they're unreliable. And there are plenty who are like, you have to try it. It's got so much use, even if there's no progress whatsoever.

[00:07:35] And really, the bridging of that is the AI engineer. Things like how do we take this and make it reliable? How do we make a real product out of it? Those are super important. And this is really what engineers have been doing since time immemorial. The analogy I like to use is, if you look at the average number of miles driven per day by somebody in the US, it's something like 42 miles. That would...

00:07:57 - Anthony Campolo

42 miles a day.

00:07:59 - Ishan Anand

Yeah.

00:08:00 - Anthony Campolo

That's terrifying.

00:08:01 - Ishan Anand

Yeah. So that would be impractical in 1888 or so when the first gasoline automobile was built by Benz. That thing couldn't go... well, the steam engine was even earlier.

00:08:18 - Anthony Campolo

But I'll say that was the first big jump. Yeah. And that appears to be what looks like a steam engine.

00:08:23 - Ishan Anand

I would put LLMs as the gasoline engine, and maybe the steam engine was RNNs and all the pre-LLM stuff. Here's the thing that I think is going to take us a lot of the way, just like we had a lot of other things. So maybe electricity comes back, so to speak.

But that would be impractical at the dawn of the gasoline engine. It couldn't go 10 to 15 miles without needing to be repaired or refueled or even just new water added for cooling because it would just evaporate out. So there's this famous story of Benz's wife, Bertha Benz, and she takes a 60 mile journey. And it's a publicity stunt because nobody thought, including Benz himself, that it could actually go that far. And it even broke down during the trip. And she had to use, famously, her hairpin to repair it. It was so brittle. It just wasn't as reliable.

00:09:12 - Anthony Campolo

A made up story.

00:09:13 - Ishan Anand

Pardon?

00:09:14 - Anthony Campolo

That has to be a made-up story.

00:09:16 - Ishan Anand

Oh, well. Okay.

00:09:17 - Anthony Campolo

It was the hairpin part is what I'm saying. That's just the type of perfect media narrative story. And I bet she had that with her. She's like, I'm gonna do this. Oh, maybe... I get your metaphor, but the metaphor is funny because people always used to explain AI as the difference between horses and cars. And this is a similar thing.

00:09:37 - Ishan Anand

That's actually another good analogy. But the point here is it's unfair to judge a new technology by its reliability the same way you judge something that's gone through 100 years. Now, because of the importance of this technology, we're going to have to move very rapidly. We don't have 100 years to figure it out.

But the point is that an AI engineer is a really important part of doing what we did to cars over the last 100 years, turning them into reliable, safe systems that people can depend on without having to understand how they work. And so that's what I think the industry needs, and AI engineers are a really important part of that.

Getting back to the conference, I don't think there's any other place that I can think of... well, there's maybe a few, but this is the number one place where those two streams come together. There's plenty of machine learning conferences and there's plenty of full stack or other conferences. This is really about mixing the two and turning real products out of them. And that's what makes the conference so unique and so awesome. So... sorry, what were you going to say?

00:10:41 - Anthony Campolo

Yeah. So to say real quick, this has been my pinned tweet for about a year now. Pay less attention to what AI can't do. Pay more attention to what it can do. I think this is the mind shift that people really have to make who really want to make the case that AI is useless or hype or a fad, or it's all going to go to zero. Because they will find these really specific ways it fails. And it's always like, it can't name all the continents that start with the letter A, and I'm like, I don't care. I don't need it to do that for me. It can write code. It can tell me what my error message means. Why would I care if it can name all the continents that start with the letter A?

00:11:22 - Ishan Anand

I think it depends who you are and how you look at the world. Some of the folks that I highly respect who have that kind of view are experts in their field. And so they don't need assistance in that area. And so for them, they use it and they're like, oh, this is...

00:11:45 - Anthony Campolo

What if they want to push beyond their own abilities and the frontiers of the field they're in?

00:11:51 - Ishan Anand

Well, where the data shows that these tools tend to be the most useful, at least in terms of being a copilot... there's a lot of ways they're useful. But is that the way...

00:12:02 - Anthony Campolo

People are using them? I don't think that's the same as it being the most useful. I think that's a qualitative thing that can't really be directly pinned down.

00:12:09 - Ishan Anand

That's true. We're still in what I call the stage play era. Let me explain the analogy. So the very first movies were just stage plays filmed with a camera.

[00:12:23] Because around the same time, people just reimagined the future in terms of the present. And it took a while for somebody to realize, oh, we can take this camera and we can move it around.

00:12:35 - Anthony Campolo

And you can cut from one thing to another. A film doesn't have to be a stage play.

00:12:40 - Ishan Anand

It doesn't have to be like a stage play. So we're still looking at them in terms of the jobs to be done. A great example is social media. Most of those jobs didn't exist before the field was really born. And it was hard to predict, although some people did in the early 90s.

We haven't had the companies yet. We're still waiting for those synergies that will come from a truly native AI first environment, like the Facebooks of the world. We're still kind of in the phase where Amazon was a bridge. I buy stuff in a store, I buy it online. But the truly native to that medium, those are still companies that are yet to be formed. And I look forward to seeing them.

[00:13:33] But the point I was getting to is the data shows that an AI copilot, and usually we're talking... we should be really precise about what we mean by AI... is usually most helpful for people who are not yet experts. So what they found is that they can take somebody at the lowest end, at the middle, or at the highest level of performance. And what happens is the amount of increase for the highest level of performance isn't very high, but for people who are at the middle or low, it brings them up. It almost kind of unifies everyone.

00:13:57 - Anthony Campolo

Yeah, I could argue though that someone who is an expert and knows really how to use it would be able to do the thing they can already do, but do it way faster because the thing can type like a thousand words per minute. So I just feel like the experts aren't being creative enough in how they're thinking of ways they could also utilize these tools.

00:14:14 - Ishan Anand

I think there is potentially a lack of creativity. The one I'm thinking of is a really good article where he's like, "I asked it to analyze, give me the background of some situation, some historical analysis," and it turns out it got like 90% of the way there. But it swapped the positions and the people who were giving them.

00:14:34 - Anthony Campolo

The experts come in, they can get the 90% correct boilerplate with the 10% wrong and quickly see that and fix it.

00:14:41 - Anthony Campolo

Oh, and also.

00:14:42 - Ishan Anand

Depends on your use case. Like we're talking right now about content generation. There's a bunch of use cases for these tools for things like writing code which is verifiable.

00:14:52 - Anthony Campolo

I use it for most. I mean, this is the thing about AutoShow. I couldn't have built AutoShow without having models to use in it specifically because I can use it to generate the show notes. But I also, I mean, I could have built it without ChatGPT and Claude, but it would have taken me like three times longer because there's just a lot of things that I had to learn on the fly to build, because I haven't done a lot of Node scripting and stuff beforehand. I did mostly front end stuff.

So it allowed me to basically parcel out the different pieces of the project that I already knew I was going to have to build, the things I would have to figure out, but it just lets me short circuit so much of that, and it lets me have a really, really fast iteration cycle of building features.

00:15:32 - Anthony Campolo

Absolutely.

00:15:33 - Ishan Anand

I used it in building Spreadsheets Are All You Need for certain Excel formulas because I'm more used to Google Sheets and I'm like, I know how to do this in Sheets.

00:15:41 - Anthony Campolo

A good transition. What is Spreadsheets Are All You Need?

00:15:44 - Ishan Anand

Yeah. So Spreadsheets Are All You Need is a course and a spreadsheet that teaches how large language models actually work.

00:15:54 - Anthony Campolo

Pull up your screen and show the home page, which is awesome and has all these testimonials. Super legit people, because you blew up online.

00:16:01 - Ishan Anand

It did get on Hacker News.

00:16:04 - Anthony Campolo

So this is like the catnip for Hacker News. It's like this really ridiculous, nerdy project that is hard to actually understand why it's useful, but you can have really snarky, rude things to say about it.

00:16:19 - Ishan Anand

Yeah. So right now it's basically a low-code or no-code way to learn AI.

00:16:25 - Anthony Campolo

So it's a programming language.

00:16:28 - Ishan Anand

Technically it is. But most people don't consider it that. I used to call it low-code.

00:16:35 - Anthony Campolo

It's more of a programming language than HTML.

00:16:38 - Ishan Anand

Yeah. So I'll give you that. Absolutely. But there's plenty of people who use Excel, but they don't think of themselves as programmers. So I wanted to essentially broaden the audience. But if you want to be truly accurate, it would be called a low-code way.

So it's basically an implementation of a large language model. And not just any large language model. It actually implements GPT-2, which is an early precursor to ChatGPT. So basically ChatGPT is 2022. It's only about three years earlier in the evolution. And it's basically the small version of it, the 124 million parameter one, but implemented entirely in Excel.

00:17:22 - Anthony Campolo

So I'm going to give you a link real quick to pull up. Go into the private chat. So this is my, I wrote very briefly a couple blog posts in 2019 about large language models, and this is a moment where I can say I was ahead of the curve because I referenced GPT-2 in this blog post written in 2019.

00:17:43 - Ishan Anand

Oh, you want to share your screen and bring it up?

00:17:45 - Anthony Campolo

No, you should just go to the studio. Okay. And then click the link I put in the private chat. No, this is private. So you can just pull it up on your screen. Oh, well, let's see. It's called private chat, but there's nothing private in there.

00:17:57 - Ishan Anand

Okay. Let's go.

00:18:03 - Anthony Campolo

And then scroll down almost all the way to the bottom.

00:18:06 - Ishan Anand

Yeah.

00:18:08 - Anthony Campolo

And then right there and then up. Bump the font up a bunch. Yeah. So this was from Gwern. Do you know who Gwern is?

00:18:17 - Ishan Anand

Familiar, I think.

00:18:20 - Anthony Campolo

He's an anon, old school AI and crypto kind of researcher person. Can you bump your font up?

00:18:25 - Ishan Anand

Yes, I can.

00:18:27 - Anthony Campolo

So he used GPT-2 to write poetry. And so it wrote this little poem that's kind of like about being a neural network and understanding words and stuff. Some people look at this like, "Ah, that's bad poetry." And I was like, hey, it is literally poetry. And it's about something.

The fact that a computer program was able to do that was completely mind blowing to me. It absolutely blew my mind. And I was like, this is going to be really, really important technology once it kind of grows. They did like Harry Potter fan fiction. Maybe that was GPT-3, but that was not very exciting to people who couldn't see the potential. But then ChatGPT dropped and hey, this GPT thing finally blew up.

00:19:11 - Ishan Anand

Yeah. The famous GPT-2 example was the unicorns article it wrote. So it wrote this article about discovering unicorns, and it was supposed to be a news article, and it did a very convincing job.

So maybe what we should do though is help settle some definitions here. So there's AI, which we've been using for a long time. Before ChatGPT, let's take for example like the ads you get served on your Facebook or X Twitter feed. Those are done by maybe a smart AI model, and it's using it through an application of statistics.

And what's happened over the last 4 or 5 years is we've suddenly got very good at applying statistics to generating text and analyzing text. And hence we get these large language models, which are a subset of AI and machine learning, that are taking things by storm.

[00:20:16] And it feels like they could pass the Turing test much sooner than I think any of us imagined, really. And so what Spreadsheets Are All You Need tries to do is explain to people of really as many backgrounds as possible how the entire thing works, end to end and in theory.

So this is the spreadsheet. In theory, you could understand the entire model just by going tab by tab through this spreadsheet, formula by formula. Now there's like over 150 tabs. There's really about 10 to 15 that are the key functional ones. The rest of them are actually the so-called weights of the model.

So maybe we should back up and just explain some basic definitions. I'm going to be switching between PowerPoint and slides. But you've got a model and normally with regular programming, what you're doing is you get some, where's my text box?

00:21:21 - Anthony Campolo

This is your draw.

00:21:23 - Ishan Anand

Yes, exactly. So you have some.

00:21:27 - Anthony Campolo

Suck on that Theo.

00:21:28 - Ishan Anand

Data. Yeah. Some data that goes in like this, right? Goes into your program and then you have some rules that you program. That's your program. And then you get your desired output.

00:21:46 - Anthony Campolo

This is how I know you're old school, because every diagram that I ever saw when I was learning computers was a series of boxes and lines pointing at other boxes. And I was always like, what does this mean?

00:21:59 - Ishan Anand

I'm not using the flowchart stuff, right? This is your classic programming. It basically processes input or data into output that you want.

So this is the thing you write. In AI, we want to do the same thing, but we have this nice property that what we can do is reverse this process. Instead of having to write these rules by hand, we can now let the model derive this by basically feeding it the answers right here and the input. So if these are, I'll call this the data that you want to go in, and then it derives this instead of you programming it. So the only thing you're doing is you're maybe configuring and building the model.

00:22:47 - Anthony Campolo

So when I explain this to normies, the data would be pictures of cats and not cats. And the answer would be whether it's a cat or not.

00:22:55 - Ishan Anand

Yeah. So this is like pictures and then cat or no cat, and then you have hot dog...

00:23:06 - Anthony Campolo

Not a hot dog. Other canonical example.

00:23:09 - Ishan Anand

Yep. And then the developer builds this and then the model figures the rules out.

00:23:17 - Anthony Campolo

And it even does this for games. This is how AlphaGo worked. That's what's so crazy about this paradigm.

00:23:24 - Ishan Anand

To put this in perspective, if you look at GPT-2's GitHub... this is what I did when I first looked at it.

00:23:31 - Anthony Campolo

The last model they'll ever open source.

00:23:34 - Ishan Anand

Well, never say never. But if you look at the model, it's not a lot of lines of code.

00:23:43 - Anthony Campolo

I saw a paper that showed how to implement a transformer in like 50 lines of pseudocode.

00:23:49 - Ishan Anand

Yes. It's actually, I mean I was able to do it in Excel. The hard part is really the scale and throwing the compute at it. Think about this. This is a program that you can.

00:24:01 - Anthony Campolo

Excel.

00:24:02 - Ishan Anand

You can just throw text at it. Like if you had asked me before I knew how LLMs worked, write a program that can have a conversation like ChatGPT can, I wouldn't have imagined it would be so small an amount of code. But all the knowledge essentially is in the data, in the model itself, in what are called the weights.

So let's go back to this, which is the spreadsheet. So this is our spreadsheet. And what you do, this implements a full model. So you'd say "my cat is so." So this is the one Lifehacker used when they covered this. So I clear this out and then you run.

So here's the fun part. Most times when you use a spreadsheet, it automatically recalculates. I have to turn that off because this spreadsheet is so big.

00:24:53 - Anthony Campolo

Lag.

00:24:54 - Ishan Anand

There's oh.

00:24:56 - Anthony Campolo

Server.

00:24:57 - Ishan Anand

Yeah, yeah. So you can see it's calculating, calculating, calculating. It's going to take roughly about 30 seconds, maybe a minute. Let's see. With everything else in StreamYard running in the background, how long it takes. But it'll then get the next token completion.

And then it's important to understand that what we've built, or what's built in the spreadsheet, is what's called a base model. So all it knows how to do is you give it a bunch of sentences and it tries to predict the next token, but it is not a chatbot. So creating a chatbot requires something that's referred to as reinforcement learning, often reinforcement learning with human feedback, where they basically condition the model to answer questions in a chat-like format.

00:25:40 - Anthony Campolo

And answer them in a socially appropriate manner.

00:25:43 - Ishan Anand

Yes. Well, that's where you condition it to refuse certain prompts, like requests that could be dangerous or biased. So it's a really important part of how to make these models usable, reliable, and also safe. The common phrase you'll hear is helpful, harmless, and honest. So here it is. My cat.

00:26:07 - Anthony Campolo

Yeah, they're not quite honest yet, but we'll see.

00:26:09 - Ishan Anand

Well, they try. So this is your next token predicted. So the rough outline for how this works... I'm going to put this back [unclear].

00:26:21 - Anthony Campolo

Can we change it real quick after you finish this section up?

00:26:24 - Ishan Anand

Yeah. Go ahead.

00:26:25 - Anthony Campolo

So what I really like about this is this kind of gets into the high level explanation of Transformers. And I think with all deeply technical topics, there's like a single line, high level explanation that usually will get people most of the way, but is fundamentally wrong in really important ways. If you don't have intellectual humility, you think you get it because you hear this simple one line explanation, but you really don't because it's very misleading.

This is the explanation where it's just predicting the next word. Everyone says that like it's just predicting the next word, so it's not actually doing anything interesting. It's true that it's predicting the next word. But that doesn't mean what people think it means.

00:27:08 - Ishan Anand

Yes, that's a really good point. The idea is if you can predict the next word, you must have a world model of some kind to get a very accurate prediction for what the next word should be. So like this.

00:27:23 - Anthony Campolo

Like "the person who killed JFK was blank." The answer to that question is very important.

00:27:29 - Ishan Anand

Well, yeah, but that actually is maybe to a certain extent trivial. That's a memorization.

00:27:34 - Anthony Campolo

No, it's not, because the history books will tell you the wrong answer.

00:27:37 - Ishan Anand

Oh, well, okay. Depending on... in theory, you could look at that as memorization. But what you can do is, there's a great one from the Tiny Stories paper, and the sentence is, let's call it, "Jane wants a dog or cat. She asks her mom for a pet, and she says, you can't get a cat." So Jane asks for... so the answer is dog, right?

But there's a couple things you would need to factually piece together to figure out. Well, first of all, she wants either dog or cat. You know that the word "not" was applied to cat. So the answer has to be dog. But there's a logic process that needs to go through that.

00:28:15 - Anthony Campolo

Yeah. And the example of like "king minus man plus woman equals queen." That's another example that you always talk about that they could figure out by thinking about how concepts relate to each other in a kind of vector-space way. It requires you knowing what topics cluster together and which ones are adjacent to it, and which ones are farther away versus closer. And that requires some amount of world knowledge.

00:28:41 - Ishan Anand

That's another example of world knowledge. What's interesting about that "king minus man plus woman equals queen" is you can get that result through just embeddings, which we'll talk a little bit about.

But the ability to go and do this logic problem, Jane wanted a cat or a dog, her mom said no to a cat, so she asked for, is interesting because inside the prompt, that sentence probably didn't exist on the internet, but it was able to put things together to accomplish that task in kind of a logic problem.

I have an example of this. We can go through it. Let me give you two high level ways I describe the transformer. And most people do. There's one about how its pieces are fit together and how it's actually doing the next token prediction. And then the other way is what is it actually potentially doing? Like how are those pieces interacting?

So at my talk at the Engineers World Fair, I actually walk through how all the pieces are put together. I'll show you some of those slides. I call that the anatomy. I use this kind of AI brain surgery theme. And so we studied the anatomy of the transformer. Then we put it through, I call it virtual MRI to see how it thinks. And then we actually do some brain surgery on it and change its thinking.

00:30:00 - Anthony Campolo

Could you pull up real quick the Illustrated Transformer? This is going to have all the images you're going to want right now.

00:30:07 - Ishan Anand

Yeah.

00:30:10 - Anthony Campolo

And this is when this first came out.

00:30:14 - Ishan Anand

Yes. So actually we should back up a second. So first let's define what the transformer is. This next-token prediction, where you get a bunch of text and say, let me predict what the next token is, was done by a lot of models. Famously, recurrent neural networks were one before.

00:30:33 - Anthony Campolo

Markov chains could kind of do it.

00:30:35 - Ishan Anand

And then more recently somebody came up with this architecture called the transformer. And it was actually invented at Google in 2017. We can probably find the paper. So let's do "transformer paper." [00:30:48] - Anthony Campolo "Attention Is All You Need." This is where your name or your project come from.

00:30:52 - Ishan Anand

Exactly. And so they came up with this architecture called attention. The name of the paper was "Attention Is All You Need."

00:30:58 - Anthony Campolo

I'll pop that font a whole bunch if you're gonna look at it that long.

00:31:02 - Ishan Anand

And they came up with this architecture, which, if you're not familiar with how it works, this looks complicated.

00:31:08 - Anthony Campolo

You need to know what a deep learning neural network is.

00:31:11 - Ishan Anand

Well, I would say it's in principle really simple, especially in the case of GPT-2, which kind of cuts off half of this. And really the goal of my course is to make it where you look at this diagram and you're like, yeah, it is simple. And I think it's understandable, even for somebody who does not have a programming background.

And so the transformer is basically what gave rise to GPT-2 and ChatGPT. The high level way I like to describe this is if you go past all this, go into the anatomy of our patient here. So it's trying to do this next token prediction, right. "Mike is quick. He moves..." And like...

00:31:57 - Anthony Campolo

Biden.

00:31:58 - Ishan Anand

As a human you know he moves quickly, right. That's a fairly logical completion there. But how would we get a computer to do that? Well, computers are really good at math, right. Here's a fill in the blank. They can do two plus two equals four.

00:32:15 - Anthony Campolo

So LLMs famously were bad at math. Funny.

00:32:18 - Ishan Anand

That is true. But we'll get to why that is in a second. Or we can talk about that towards the end.

So the way we do this is we basically take what's a word problem and we turn it into a math problem. In order to do that, we have to do a couple steps. So the first is we do something called tokenization. We need to take our words like "Mike is quick" and actually turn them into word units. We actually turn them into tokens, which are subword units. We can talk a little bit about that.

So if I take the word "flavorize," it'll be broken into "flavor" and "ize," kind of like the morphemes you might think of, but they don't always map cleanly. And then we take each token and we map those into embeddings. Now here I've shown the embeddings as a single number, but they're in practice a very long list of numbers. And I can get into why that is. And we can see that in the spreadsheet in a bit.

[00:33:04] And then instead of doing simple arithmetic, we're doing much more complex math. What you've heard of as multi-headed attention. And then a multi-layer perceptron. Multi-layer perceptron is just another name for neural network. We just do a lot of math. And then instead of interpreting this as one final answer, we interpret the result as a probability distribution. That's this piece called the language head.

But in principle we're basically taking a word problem, taking the words, mapping them to numbers, doing math to turn it into a math problem, and then reversing the process. So it's kind of like we take input text, we turn the text into tokens, we turn the tokens into numbers. We do some number crunching on the words. Now, the reason why that number crunching can work has to do with embeddings, which we alluded to. And then we turn those numbers back into text. And then we turn that text into the next token prediction. And I'll just throw this slide up there.

00:33:54 - Anthony Campolo

I like this visual, this explanation a lot. This is really good.

00:33:58 - Ishan Anand

Yeah, I'm trying to boil it down. So I have a very strong belief. If you really understand something, you should be able to explain it simply, up to a certain point. And the simpler you can explain it, the more you're getting to the core essence of it.

So this is one of my two favorite explanations for how it works, or what are the pieces that go together. Now, to understand more deeply than this, we'd have to go through the spreadsheet, and I'm happy to do that.

00:34:28 - Anthony Campolo

You're inspiring me to want to take exactly what you have done and do it in JavaScript, which is like inputs and React. Yeah. No reason why you couldn't.

00:34:40 - Ishan Anand

Well, it's funny you say that. So my original plan was to do this in JavaScript, and I thought that Excel is more interesting and accessible.

00:34:48 - Anthony Campolo

It was kind of your native tongue.

00:34:50 - Ishan Anand

Well, both are kind of my native tongue, although I've probably been...

00:34:53 - Anthony Campolo

I've seen you code. I've seen you do far more Excel than I've ever... I've never seen you code before. I've seen you Excel many, many times.

00:35:00 - Ishan Anand

When you're VP of product, you end up doing more spreadsheets than you do coding anymore. There was a day where it was the other way around.

But I will say, I think the beauty of a spreadsheet is it's accessible and less intimidating than looking at either PyTorch or Python or JavaScript code. But I actually do have a plan to implement this in the web, because some of the feedback I've had is that A, I don't own Excel. B, it has to be the latest version of Excel. Or C, it has...

00:35:35 - Anthony Campolo

To be a paid product.

00:35:37 - Ishan Anand

Well, Excel is... you have to buy Excel. The free version of Excel does not work. It's too big.

00:35:43 - Anthony Campolo

This is actually a question I had. Could you do this in Google Sheets?

00:35:47 - Ishan Anand

So it's funny you asked that. When I first started this, I tried it in Google Sheets and it didn't work. I tried like crazy.

00:35:59 - Anthony Campolo

Airtable, maybe.

00:36:00 - Ishan Anand

That's an interesting idea. I didn't think about that. Google Sheets, the problem was you cannot get the weight matrix into it. You can't even get the embedding matrix. It's not big, it's...

00:36:16 - Anthony Campolo

Just...

00:36:16 - Ishan Anand

Way too big.

00:36:18 - Anthony Campolo

You can't run Google Sheets locally.

00:36:20 - Ishan Anand

And you can't run Google Sheets locally, so I tried all sorts of hacks. I tried actually putting in the entire weight matrix.

00:36:27 - Anthony Campolo

Yeah, that makes a lot of sense actually.

00:36:29 - Ishan Anand

I tried JavaScript, and then I tried to use Apps Script to pull it in. There's an open-source way to connect Google Sheets and Apps Script to your GitHub. I tried pulling it that way.

I think you might be able to do it if you split it across multiple sheets on a single layer, but the weight matrices are really big. So we should explain what weights are. If you look at the spreadsheet, there's 124 million parameters or weights as they refer to them inside GPT-2 Small, which is probably the smallest model out there. This is what they look like.

00:37:07 - Anthony Campolo

Let's be clear here. So you did not create these weights. These weights were created by GPT-2 being trained on data that then they published the weights. And this is what a lot of AI projects now don't publish, Llama being kind of an exception.

00:37:24 - Ishan Anand

I mean, there are still models like Phi, but yeah, most that are household names do not have what's referred to as open weights. So you'll hear about people talking about a 70 billion parameter model or a 7 billion parameter model. The most common one people usually fine-tune around is 7 billion, because you can fit it in a small enough card. And it's not too much to work with. It doesn't take as much time to fine-tune.

But the parameters are just basically all the inputs when you get your data, right. I talked about this math. It's not arithmetic. It gets multiplied and added and normalized. All that math gets done against basically these weights and it's just raw numbers. And there's just 124 million of them. This is actually the embedding matrix. So this thing is like 52,000 by 768. And that's just one.

[00:38:17] I mean, if you take a look at this thing, you will see weight matrix after weight matrix after weight matrix. It's just this one big massive thing. Most of the work actually, if you go back to this chart we left off on, is done... You know, here's the tabs in the spreadsheet. Entering the prompt is "type prompt here," which is what you saw. That's not that one. We'll go to this window. There we go.

So here's a funny story. This is the other reason I want to do it in the web browser. You don't want to run this on the Mac. I have run it on Mac Excel. In fact, I built it on Mac Excel, and you'll note that I'm on a Macintosh machine, but I'm actually running the Windows version of Excel inside Parallels. And the reason is that for some reason, this thing is so big Excel locks up on you in the native Mac version of Excel.

[00:39:15] So it's another reason I want to build it for the web. So you don't need Parallels. Because if you run this on the Mac, you need both Parallels and the Windows version of Excel, so it's even more friction.

I also want to run it in a web browser because I'd like to extend certain other capabilities, like I'd like to be able to do back propagation and fine tuning. Those are words describing the learning process. Right now this is just doing what's called inference, which is predict the next token. But to actually teach it, you need to be able to run those other algorithms I referred to. And I'd actually like to implement those as well into this type of format.

00:39:45 - Anthony Campolo

I thought the Mac was built for AI now.

00:39:48 - Ishan Anand

Well, it is. But the ML guys need to talk to the Microsoft Excel guys.

I talked to somebody actually when I presented this at one place and I talked about this issue. I did have somebody who used to work on the Mac version of Excel, and he said, well, you know, that thing has probably got bits of CodeWarrior in it, if you remember that. So it probably needs some updates or fixes there. But I was able to get it done. It will run. You just have to... it just will lock up on you.

So this is the high level of what the pieces are.

00:40:22 - Anthony Campolo

Numbers. That's the other one. You gotta try Apple Numbers also.

00:40:26 - Ishan Anand

Oh yes. I haven't tried Apple Numbers, mainly because Excel is just what people know and has greater mindshare.

00:40:35 - Ishan Anand

There's also something...

00:40:36 - Anthony Campolo

Called LibreOffice Calc, which is an open source alternative.

00:40:39 - Ishan Anand

I haven't tried it. Somebody told me they tried it and it didn't work. But that's probably not because of a capability reason, but just the formulas were written for Excel. There's no reason you couldn't do this in any other spreadsheet.

And this is really the answer to that conundrum if you're first coming to AI where you're like, how is this code so small? The knowledge isn't so much in the code, right? The knowledge is in the math that you're forcing it to do. It's in these weights that I was showing. Those weights are essentially where the knowledge is.

And that's why when we talk about these models, sometimes we talk about their architecture, but more often we talk about the number of parameters, because it's a good proxy for the amount of knowledge it has and the amount of reasoning it can do.

[00:41:37] So the other way to look at what these pieces are doing... I don't think we have time to go into all the details, but there's another way to look at this, which is as a communication network. So I've got this picture here, which I showed right here. But there's a piece of this that is really important. It's called the residual connection. It's basically an addition operation. I can actually show you where it is in the sheet. It's literally just a spreadsheet add.

But what's really important about this, and this was really critical to creating very deep neural networks in the early days of deep learning, is it lets the model during training decide to route around these two key components in the processing layer so it can just skip them entirely.

And so we can now reimagine the large language model as actually a communication network, like a communication bus, where it's got this residual stream, which is really just the tokens converted to their numbers. So I can show you what that looks like.

00:42:43 - Anthony Campolo

I really wish I knew what a communication bus was. I know what it is, but I never actually got to the point of understanding what the heck it was. It's like a box thing. Data goes in, data goes out. Can't explain that.

00:42:55 - Ishan Anand

How about this? Think about it as just message passing, and people are putting messages on and taking them off. So here's our... each row here represents a token.

00:43:07 - Anthony Campolo

Super JSON objects.

00:43:10 - Ishan Anand

It could be. It's more like a stream of them. So each row represents a token. And what's happening is multi-head attention is actually moving information across the tokens.

So an example of this is if you have this phrase "Mike is quick, he moves..." What it's going to look at is "moves" might look at "Mike." That's the antecedent of the pronoun. Or "moves" might look at "quick" to figure out, oh, this means movement in physical space, because that word "quick" has multiple meanings.

It can mean movement. It could mean quick of wit, so smart. It can mean the quick of your fingernail, or it can mean alive or dead, like the quick and the dead in Shakespeare. So because those two are there in the same context, you can say, oh, this is about maybe the word "quickly" or "fast," but it's not about your fingernail. And so the job of these attention heads is to move information between the tokens.

[00:44:13] And then the multi-layer perceptron looks at each token and tries to predict, okay, based on what I know here at this one token, which has all the information from the tokens before it, what is the next most likely word.

00:44:27 - Anthony Campolo

And that's how you can train the loss function in terms of whether it was correct or not.

00:44:31 - Ishan Anand

Yeah. You train at the very end and kind of back propagate through all these components.

00:44:35 - Anthony Campolo

It needs to be defined as the correct and incorrect answer to back propagate through. [00:44:40] - Ishan Anand Yes, that's kind of how it'll go through all these parameters. And sometimes, because of the residual connection, it'll basically skip layers. So you can look at these layers as writing information to the residual stream and then reading it out. Some layers are helping each other and some are not, and they may not be in the same layer.

The example that I like to use is this prompt right here. If today is Tuesday, tomorrow is Wednesday. GPT-2 Small can do this. You give it any of the seven days, it knows what the next day is. There was a research paper where they basically went through and tried to figure out what is the mechanism by which GPT-2 Small is able to do that. They found out that there were only four components that were needed. There's 124 million parameters, there's 12 layers, and of those 12 layers you only need four components in them.

[00:45:37] You don't even need all the heads. So the way this looks is it uses the perceptron from layer zero, that's the neural network. It's using attention from layer nine, perceptron from layer nine, and attention from layer ten. They're all talking to each other, and they're doing it slightly orthogonal to what the other layers are doing. But that's how it figures it out. Those working together create what they call a circuit to accomplish this task.

So what you can do is there's this thing called logit lens where you can look at every layer and what it's going to predict. So here we know the right answer is... let me bump up my font. Yeah. So these are columns three through nine. Here is your residual stream. It's your information superhighway. You can see at the very first block it's already figured out something about Wednesday. But then it completely disappears. And then later on it suddenly starts realizing, "Oh, we're talking about days."

[00:46:33] So this is where your prediction comes on the very last token. And you'll see it's tomorrow, Tuesday, Monday. But it takes a while for it to figure out the right answer is Wednesday. And then it locks it into place before the end. That's kind of what we see here. It figured out days, and then later on layers nine and ten kind of pull it up and increase its probability so it comes out and gets the right answer.

Those are kind of the two explanations I like to have. One is to think of it as here are the components, here's what they do, it's basically doing a math problem, turning a word problem into a math problem.

00:47:19 - Anthony Campolo

You're just lagging for me. I'm not sure if you're lagging on your end too. Could you stop and say the last two sentences you said again?

00:47:30 - Ishan Anand

Yeah, sure. So those are the two ways I like to explain the transformer. The first is it's taking a word problem and turning it into a number problem, and it's basically just doing very complex math to predict what the next token is. That isn't meant to say it can't do logic or reasoning. It can, but it's to make it seem less magical.

The second explanation I like is to look at it in what's called the residual stream, or communication network, or mechanistic interpretability. Which is to say it's a communication bus. It's basically iteratively refining its predictions using components from various layers that work together, that write information and read information as they together try to figure out what the next most likely token is. Inside GPT-2 Small, there are probably plenty of other circuits, and then something at some point decides this is the right circuit and this is the key one. But this is how it pieces together during training.

[00:48:25] It creates these circuits to solve various problems. You didn't have to program the next day prediction in there. It figured it out, and it's probably got lots of other circuits that solve this.

00:48:35 - Anthony Campolo

What's cool about your second explanation is it helped me understand how even though neural nets in general couldn't do a whole lot of interesting stuff in the 50s or 60s, there were things they could actually do. Like the very first papers, Frank Rosenblatt was tuning dials to answer some sort of math logic problems or whatever. It's because you only really needed a couple layers and a couple neurons to get it to figure certain things out, but that had a ceiling very soon. You're showing how now we have this giant thing, but it's still distilling things down to really small amounts of information passing.

00:49:21 - Ishan Anand

It depends on the complexity of what it's doing. But yeah, there are a lot of key components. Another key component is something called an induction head, which basically, if you say "Harry Potter" in the sentence and then later at the end of the paragraph you say "Harry," it's going to predict it should probably say "Potter."

Or there's another circuit that does indirect object identification, figuring out what objects you're referring to elsewhere in the paragraph. One way to really drive this home is if you go back and look at this diagram right here, and you imagine you didn't have these layers in between, and it just gets input text. There's 12 of these as I've drawn here. They call this the zero-layer transformer. The best job it could possibly do is doing what are called bigram statistics. It just knows every time I see this word, the next most likely word is X.

[00:50:12] And that's just simply probability. I could take my entire dictionary of words and all my text, and I can just say every time I see a word, what is the next most likely word?

00:50:23 - Anthony Campolo

Like n-gram models, they used to call those, right?

00:50:26 - Ishan Anand

And so what you can imagine is it starts out with having that n-gram or bigram statistic. It just knows the previous word and it's going to predict based on that. And then each layer is trying to refine it with additional information from all the other words before it, to make it even more sophisticated with these more complex circuits that go beyond just having an n-gram.

00:50:47 - Anthony Campolo

I remember reading examples from n-grams and you read it and it's kind of like a person who's really confused and sleepy because it almost is sentences, but it's not quite. It's like fragments of things that kind of make sense, but it doesn't all hang together. And that's where we got now is that it all hangs together, has a global understanding in a way where all the tokens can have some kind of coherence.

00:51:16 - Ishan Anand

That's a really good point. One of the key things which we didn't have time to go into about attention is that it looks back. A classic n-gram only looks back n words. But attention is allowed to look at every token before it.

00:51:34 - Anthony Campolo

And that's why attention is really the breakthrough here.

00:51:37 - Ishan Anand

Well, the idea of attention already existed, but the idea of saying you only need attention plus the multi-layer perceptron, that's really all you need, and you just do that over and over again. That was the really surprising finding.

You can actually see a little bit of this inside the spreadsheet. So if you go to this one and go to block zero... I reran it with... here, I'll have to rerun it. Let's rerun it, and then we can talk about something else. But "Mike is quick. He moves. He does not move." So we have to recalculate this, and then we'll see our result. But you can actually see the attention pattern inside each block. Now there are multiple heads and there's a lot of ways to look at it. But in this particular example you can see "he" looking at "Mike" in its attention scores.

[00:52:28] So that's kind of a quick crash course in transformers and how they work. My goal in this course is really to demystify as much as possible. I've launched this course where I'm basically going to demystify how transformers work, every single step of the model. I'm going to go through each part and try to explain as simply as possible, even if you're not a developer. But I think it really helps if you are a developer. You can kind of watch the intro video.

00:53:03 - Anthony Campolo

Understanding this nicely. Got Jeremy Howard on here, by the way. This segues into kind of one of my last questions, which is, I think I know the answer to, that this is really a learning tool. You don't really expect anyone to use this to do serious work, at least not today. Maybe it could turn into that, but I think right now you seem to be positioning it as a learning tool.

I think that's really cool because it's rare that I see people build a large, complex... it's not that large or complex, but like a built-out thing that is just meant to explain a thing to people. This is my whole deal. This is what I do. This is why I love building stuff that explains things. And this is such a brilliant way of explaining something that no one would have thought of until you did this, I would imagine.

00:53:52 - Ishan Anand

I wouldn't say no one. I think Jeremy Howard was an inspiration. He gets a lot of credit. He uses spreadsheets a lot in his examples. That was kind of the inspiration, but he doesn't do the whole entire language model.

I wanted to do something that was not a toy. I wanted to show that true next token prediction, as implemented in GPT-2, is really just a large math problem, and I can take you through every part of it and it'll all be understandable, even if you don't have a machine learning degree or an engineering degree. I think there are ways to explain even things as complex as embeddings to make them approachable. That's something I've tried really hard to do inside this material.

00:54:35 - Anthony Campolo

Jeremy Howard is the one and only person to scoop you. I'd say that's still pretty good.

00:54:40 - Ishan Anand

Yeah, I suppose. Yeah, exactly. I'll take that.

00:54:44 - Anthony Campolo

I never took his course. I always knew I should have. I would have known that if I actually took his course.

00:54:48 - Ishan Anand

Yeah. The part of his course where it clicked for me was he does a convolution of basically an image. And the image is a spreadsheet zoomed out. So each pixel is really just a cell. And I was like, "Oh, wait, I wonder if I can do this with a language model."

So to answer your question, yes and no in terms of a learning tool. Yes, it is a learning tool. I think of this in terms of horizon planning. I don't know if folks are familiar, but in product management you plan for right now, the present, and the future. The current horizon is a learning tool.

I do think this idea, "spreadsheets are all you need," that phrase captures something for me more than just simply the parody or play on "attention is all you need."

[00:55:41] I'm well known for this view at work. I think a lot of SaaS applications, maybe not all, are isomorphic to a spreadsheet at some point or another.

00:55:54 - Anthony Campolo

As a spreadsheet.

00:55:55 - Ishan Anand

So I do think there's potentially an interesting avenue, which I've not quite figured out yet.

00:56:05 - Anthony Campolo

This is my other question. Once you build it into something that's going to be a useful tool, would it be for AI people who want to do Excel stuff or Excel people who want to do AI stuff?

00:56:18 - Ishan Anand

More the latter, but I honestly wouldn't phrase it that way. I think there's two use cases. One is a table, or a spreadsheet-like interface. I think "spreadsheet" has a lot of connotation. A spreadsheet-like interface is a natural way for interacting at a power user type level, where you may want to be iteratively refining a prompt, or you might want to give it a bunch of data to fine tune and see the results and measure it against some evals. So I think it's a great way to work with maybe even an API for a large language model.

I do think, just like Canva, if you showed that to somebody who uses Photoshop, they'll be like, "This is a toy." I do think there is potentially a future where people might be building or tweaking models inside a spreadsheet-like interface or interacting with them.

00:57:13 - Anthony Campolo

There's no copilot for Excel yet, right?

00:57:16 - Ishan Anand

There is actually. It only works...

00:57:19 - Anthony Campolo

I see where you're going with this now.

00:57:21 - Ishan Anand

I do think there's a really interesting space, and I don't think I'm the only one to say this. I have a list actually. I'm categorizing and keeping track of all the spreadsheet-meets-AI companies. I think there's definitely fertile ground there.

But for now, it is a teaching tool. I do hope to build this inside a web browser so more people can use it, both to learn but also to see what potential there is when you combine the two together.

00:57:49 - Anthony Campolo

Really? On stream together? Let's do it. Oh, I'm so down.

00:57:54 - Ishan Anand

I might take you up on that. But right now, it's a teaching tool and it's a course.

00:58:00 - Anthony Campolo

I think I cut you off earlier when you were finishing a thought. Did you have anything else you were saying before I segued to that question?

00:58:07 - Ishan Anand

I think the only other thing I'll say on the course is, even if you don't own Excel, I spend a lot of time... you can look at the sample classes I have posted here. I spent a lot of time making it as simple and easy to understand.

00:58:20 - Anthony Campolo

You could write this on paper.

00:58:23 - Ishan Anand

You could do it. There are people who've actually posted doing it on paper. But my goal is to make it as approachable as possible.

In today's world, it's like when computers first came out in the 80s, you had to know basic things about how they worked. They were binary. What a megabyte was, what a kilobyte was. When the web came out, you needed to understand Trumpet Winsock and all these other things, some of which you don't need to know, but you still need to know about bandwidth and other concepts like that.

I feel like it is useful for almost anybody to understand and demystify how these tools work if they're a power user, and especially if they're trying to figure out what the limits of what these things can do. So you can have a conversation with your engineering team or other stakeholders who are actually building your AI systems.

[00:59:16] That's why I'm trying to make it as universal for everyone.

00:59:22 - Anthony Campolo

Awesome, man. I knew that I was going to come away from this conversation knowing a lot more about transformers than I did going into it, and that's exactly what happened. Your slides are great, by the way. Your talk probably isn't online yet, but it will be, right?

00:59:39 - Ishan Anand

My talk actually is online. Yeah, but you have to go to the seven hour and 44 minute mark of the live stream on YouTube.

00:59:51 - Anthony Campolo

You can just get a URL with that time.

00:59:52 - Ishan Anand

Yeah, you can get a URL. I can send it to you. I know they're going to break it up and post it as separate ones, but I can send you the link that goes directly to it. It was a keynote stage talk. The keynote stage was recorded. They recorded everything, but they live streamed the keynote plus a few other tracks.

It's not a very long talk. In fact, if you watch on double speed, it'll actually be less than ten minutes. I think it's about 20 minutes long. But there's obviously a lot more detail that I go into inside the course.

01:00:26 - Anthony Campolo

Sweet, man. Anything else you want to talk about or things that you're interested in AI that you think people should be paying attention to?

01:00:35 - Ishan Anand

There's no shortage of things in AI that I'm interested in. I'm most strongly interested in this field they call mechanistic interpretability, which is about figuring out how these models actually work, as well as bridging these models to actual problems that can be solved.

Basically, the job of the AI engineer, or I'd call it the AI product manager, is... AI is not a product, right? So how do we actually turn these into reliable, useful things like cars are today? Those are the two areas that I'm most interested and fascinated by: AI product impact and how they actually work under the hood.

For other people, I would just say there's so much, but my big recommendation is something I call the AI step count. Just like you try to get 10,000 steps in, there was this Moderna CEO who did a deal with one of the large language model providers, and he said, "I expect my employees to use this 20 times a day."

[01:01:39] So pick a number. 20 is not a bad one. Try to remind yourself to use it. I have been at work and halfway through I'm like, "I could use GPT for this," or Copilot, which we had at work.

01:01:52 - Anthony Campolo

Welcome to my life. Yeah, all day, every day I have GPT open more than any other tab.

01:01:58 - Ishan Anand

Yeah, exactly. I think my recommendation to folks is just remember to try it. A good example of this is LinkedIn. It maybe slightly overdoes it, but if you're in LinkedIn Premium, there are all these little sparkles suggesting you should write that message with AI or summarize this long post somebody wrote.

01:02:18 - Anthony Campolo

Everything on LinkedIn sounds the frigging same, like they're written by bots.

01:02:21 - Ishan Anand

Well, social media in general may have that problem. But the point is, you have to remember right now to use AI because it hasn't become like electricity built into all our workflows.

01:02:32 - Anthony Campolo

You're not using it enough, obviously. You're not hanging out with me enough.

01:02:36 - Ishan Anand

Well, that's true, but I'd say for most people, coding is a big exception. Coding is one of those use cases that lots of people have adopted. It's one of the fastest adopted use cases.

In fact, I was just chatting with someone online and it was like, "Which AI companies have the best growth outside the model providers?" Sourcegraph had really impressive growth numbers. It makes a ton of sense because coding is one of the most common, most useful, most testable use cases for AI models. Plus it has a direct tie to productivity. They've had some great numbers and they can point to revenue in a way that I think other people are doing POCs right now. I still think that revenue will come.

But yeah, I'm not surprised to find you use it like any developer, probably using it more than 20 times a day. That's really a challenge I'm giving to people who are not developers.

01:03:24 - Anthony Campolo

You know who's not? Fuzzy Bear. Fuzzy Bear is a total outlier. He came on the show to talk to us about it. I totally respect his opinion. It's a good chat with him about it. But he's definitely one of the people who is not feeling the need to use it every day.

01:03:41 - Ishan Anand

I would love to chat with Fuzzy Bear either on a future live stream or offline. I'll watch his episode. I always like to hear both sides.

01:03:48 - Anthony Campolo

Yeah. So your course, is it live? Can people use it and buy it right now?

01:03:53 - Ishan Anand

They can. It's going to be a live course. It's going to be the last week of July, on July 29th.

01:03:59 - Anthony Campolo

That explains why these courses can be so expensive.

01:04:03 - Ishan Anand

Yeah, they're live and cohort-driven. That's how Maven works. So you get live interaction. It's not a...

01:04:08 - Anthony Campolo

Actual teaching is what you're saying. You need to literally teach.

01:04:12 - Ishan Anand

I care very deeply about good explanations and teaching. I really want that interaction. You can do a prerecorded course, but especially for this first one, I wanted to have that interaction. I'd actually done a course on Maven recently that had really great interaction. So I want to replicate as much of that as possible.

01:04:34 - Anthony Campolo

I get it for free if I'm your teacher's assistant.

01:04:37 - Ishan Anand

We can work something out. Yeah.

01:04:40 - Anthony Campolo

Okay.

01:04:41 - Ishan Anand

It's at the end of the month, and then if you miss a session, it will be recorded. That's the plan.

01:04:52 - Anthony Campolo

Cool, man. Well, this is super legit. I'm so glad that you have also been moving into AI. Web is fun, but guys, AI.

01:05:02 - Ishan Anand

Yeah, well, I think like everyone...

01:05:05 - Anthony Campolo

We're both wearing red shirts like we're the greatest web frameworks ever, so it's not like we're anti-web now. That's why I'm building out a web thing. I'm building a full stack app that's totally... I'm living the AI engineer lifestyle right now.

01:05:19 - Ishan Anand

Yeah. Do you want to talk about AutoShow?

01:05:21 - Anthony Campolo

You're one of the... I think you were the second person to ever use AutoShow. I ran your first three videos through it and you were like, "Wow, these are really good." And I'm like, "I know, right?"

AutoShow is a tool that lets you use different transcription models, either locally or hosted, to transcribe a video or audio file, and then concatenate a prompt that is fed to an LLM — also either local or hosted. It has five different options right now, and you can tweak the prompt to essentially let you generate all sorts of different content from the files.

It was originally for me to create show notes for FSJam because I wanted to have chapters with timestamps. I wanted to have a nice description and summary. But now, I was talking to one of my friends just last week, he's a teacher, and we were talking and I realized that I could take a video he wants to show his class, feed it to AutoShow, have it create ten questions for the teacher, and it's just a slight tweak in the prompts.

[01:06:23] All of a sudden, it's not even just for content creators. My mind continues to expand in terms of what it can do. I feel like I'm building it out in a way where it's very flexible as long as you have enough experience to know how to run some Node commands. Eventually there'll be a front end with an input form so it'll be for anybody.

It's been super exciting. This is the most hyped I've been. This is the first real useful open source thing I've ever built. It has 20 stars, which is not a lot, but it's 19 more stars than any of my other repos. I've learned a ton building it. I feel like I'm doing a legit open source thing finally, and it will eventually become a product that will make me maybe not a lot of money, but some amount of money. We'll see.

01:07:06 - Ishan Anand

I'm hugely excited for it. The ability to repurpose content in ways we couldn't have imagined is one of the great use cases for LLMs. I like to call it taking unstructured data and making it structured. I'm really excited for it and looking forward to what you do with it.

01:07:22 - Ishan Anand

Yeah.

01:07:22 - Anthony Campolo

There's only 50 companies doing it, but I feel like none of them give you enough configurability, enough low end to actually mess with the different models and tweak the prompts. They give you things like, "You can use our summarize API," that gives you crappy summaries. If I just ask ChatGPT to write a summary, it's better than what this whole company is giving me.

01:07:43 - Ishan Anand

There's an extent to which the frontier model providers are accelerating so fast that other solutions, it's hard to keep up with what you can do just throwing it at one of these frontier models.

01:07:56 - Anthony Campolo

And so I can switch out frontier models. I can go from ChatGPT to Claude, and if Cohere all of a sudden takes the lead, you can switch to that. It's just a single flag you throw at the CLI.

01:08:06 - Ishan Anand

The other thing you're reminding me is when I first encountered various forms of development in my career, I can remember React, I can remember JavaScript, I can remember C programming.

01:08:21 - Anthony Campolo

You remember JavaScript. You're old.

01:08:23 - Ishan Anand

I am, but you come across something and you're like, I feel an urge to build, an intrinsic urge to build. And I would say generative AI is definitely giving me that as well.

I think it's very inspiring. It's rare, you only get these every once in a while where a tool comes along and you just can't wait to build on top of it. And I think that's what I'm hearing from you. I think that's partly what a lot of the excitement at the AI engineering conference was about as well.

01:08:58 - Anthony Campolo

I love that, man. That is so hyped. It's a good place to close out here. Just give your Twitter so people can follow your tweets.

01:09:07 - Ishan Anand

On Twitter I am at I-A-N-A-N-D. So my first initial plus my last name all concatenated together, that is my Twitter handle. LinkedIn, I'm also posting as well. Just look for Ishan and you'll find me there as well. Those are the two.

01:09:26 - Anthony Campolo

And you're open to being hired.

01:09:29 - Ishan Anand

Yes, I am.

01:09:32 - Anthony Campolo

The first time in, what, 15 years?

01:09:34 - Ishan Anand

As a co-founder, I'm very interested in doing more AI stuff. So yes, I am around, let's put it that way.

01:09:48 - Anthony Campolo

Awesome. We should definitely do more streams. All right, thanks everyone for watching.

Thank you Fuzzy, Mike Cavalier, who was also on the show and had a cool AI project for generating children's stories with different topics and age groups, and Nate Coates, and Aidan Amavi. Is that one of your homies?

01:10:08 - Ishan Anand

No, I don't know.

01:10:10 - Anthony Campolo

I don't know who that is. I found a website for him. I think that was who it is anyway, some random Twitch person is watching. So thanks everyone for being out there. Not sure when the next episode I'll do, but probably in a week or so. I'm back from vacation, ready to do things.

01:10:27 - Ishan Anand

Thank you.

01:10:29 - Anthony Campolo

Bye everyone.

On this pageJump to section