skip to content
Video cover art for AutoShow CLI Pt 3: Multi-Modal
Video

AutoShow CLI Pt 3: Multi-Modal

Nick and Anthony explore the AutoShow CLI's expanding toolchain, debugging Python setup issues live while demoing new music and video generation features.

Open .md

Episode Description

Nick and Anthony explore the AutoShow CLI's expanding toolchain, debugging Python setup issues live while demoing new music and video generation features.

Episode Summary

In this third installment of the AutoShow CLI series, Anthony Campolo and Nick Taylor catch up on recent travels and tech events before diving into a hands-on setup of the CLI's newest features. Nick shares his experiences speaking at Apollo GraphQL and Black Hat USA, discusses his work with MCP servers and zero trust security at Pomerium, and describes building a home lab with local AI models. The conversation shifts to the growing importance of short-form video content and open-source media generation models before the pair begins installing the AutoShow CLI's expanded toolchain, which now includes music generation via AudioCraft and Stable Audio, image generation through Stable Diffusion, video creation with Runway and Veo, and text-to-speech capabilities. The live setup reveals several friction points around Python version management, Core ML configuration, and tightly coupled dependency scripts, turning the stream into a productive debugging session. Anthony explains his vision for the CLI as a unified pipeline that can transcribe, generate text, create images, produce music, and animate video, with the eventual goal of assembling all these pieces into fully generated media productions. They close by listening to AI-generated music samples and previewing Runway's image-to-video animation capabilities.

Chapters

00:00:00 - Catching Up on Travels and Tech Events

Nick recaps a whirlwind month of conference appearances, starting with an Apollo GraphQL MCP builders meetup in New York where he spoke about securing MCP with zero trust, followed by an AI engineering event in San Francisco hosted by GMI Cloud. While in San Francisco he visited clients for Pomerium, co-worked at the Continue offices, and attended a voice and AI event before heading to Las Vegas for his first Black Hat USA conference.

Anthony and Nick briefly discuss the intimidating reputation of Black Hat and its massive scale as a security conference. Nick explains how his work at Pomerium on identity-aware proxy and zero trust security has naturally extended into the MCP ecosystem, where he has built a demo application using the OpenAI Responses API that integrates tool calls, web search, and code interpretation alongside custom MCP servers.

00:05:43 - MCP Servers, Home Labs, and Local AI Models

Nick describes building MCP servers, including one for Dev.to that pulls in his writing style and tone to help draft blog posts using AI. He explains how he feeds transcriptions and voice notes into LLMs to assist with writing rather than having AI generate content from scratch, which Anthony notes aligns closely with the core AutoShow use case.

The conversation turns to Nick's new home lab setup featuring a Minis Forum mini PC running Open WebUI and LibreChat with local models. They discuss VRAM allocation through BIOS settings, the practical limits of running models up to around 20 billion parameters on consumer hardware, and how Nick's transition into platform engineering and security has pushed him to learn Docker, Linux commands, and infrastructure tooling more deeply.

00:11:03 - Open Source Media Models and Short-Form Content

Anthony shares his interest in renting GPUs from services like Linode and Hetzner to run open-source image and video models, noting that image generation models produce good results at manageable sizes while video models remain too resource-intensive for his local machine. He predicts AI video generation will explode in the next two to three years as quality crosses the uncanny valley threshold.

The discussion pivots to the dominance of short-form video content and its role as a funnel for longer content. Nick confirms he sees strong engagement with short clips driving viewers to full-length videos on his work YouTube channel. Both acknowledge the polarized consumption pattern where audiences tend toward either 90-second clips or multi-hour podcasts, with little appetite for content in between.

00:17:21 - Pandemic Reflections and the Path to AutoShow CLI

Anthony and Nick reflect on how the pandemic shaped their careers, with Anthony describing how his Lambda School bootcamp and Uber driving coincided with lockdowns, creating an unexpected opportunity to build an online audience through Discord communities. Nick recalls building a backyard skating rink for his kids during lockdown, and both note how the shift to remote work and online communities brought them together in 2021.

The conversation transitions into the day's main task as Anthony outlines the plan to have Nick set up the AutoShow CLI's new commands for image, video, and music generation. He explains the project's complex Python toolchain, which maintains separate virtual environments for each tool, and notes that the setup process will likely take about ten minutes between environment configuration and model downloads.

00:21:01 - CLI Tools, Subagents, and Terminal Workflows

Nick shares his screen and mentions using Claude Code subagents, small scoped agents he has configured for Kubernetes, TypeScript, Pomerium, and Golang expertise. Anthony asks about them, and Nick shows how each subagent carries a detailed system prompt that defines its specialization, including a TypeScript expert modeled after the teaching styles of well-known developers.

They discuss terminal workflows, with Anthony praising Warp's integrated AI agent capabilities and Nick explaining why he hasn't adopted Warp due to its lack of VS Code integration. Nick describes his growing comfort with CLI-based AI tools like Claude Code and Open Code, noting that even for complex Kubernetes configuration, the CLI experience has become his preferred workflow for moving fast and learning infrastructure concepts on the job.

00:26:05 - Running the AutoShow Setup and Hitting Dependency Issues

Nick pulls the latest AutoShow CLI code and runs the setup command, immediately encountering a Homebrew failure when trying to install LibreOffice as a dependency for the Xerox PDF extraction tool. After manually installing LibreOffice and removing the problematic Xerox package, they re-run setup only to hit Python version constraints requiring 3.9 through 3.11 while Nick has 3.12 installed.

Anthony recognizes that the version checks are unnecessary since each tool runs in its own virtual environment and should not depend on the global Python version. They work through the setup script to remove or update the version constraints in multiple locations, with Anthony noting that this live debugging is exactly why he wanted Nick to run the setup fresh, as it reveals friction points that would affect any new user trying to install the project.

00:36:11 - Whisper Variants, Core ML Challenges, and Graceful Failures

The updated setup progresses further but encounters failures with Core ML and Whisper diarization components. Anthony explains the different Whisper acceleration options, including Metal for macOS CPU optimization and Core ML for Apple's Neural Engine on M-series chips, noting that Core ML took him over a year of trial and error to get working on his own machine.

They systematically comment out the failing components to isolate what works, revealing that the setup script's error handling inconsistently fails either gracefully or fatally depending on the component. Anthony acknowledges this architectural issue and describes his goal of making each of the project's six modalities independently installable, so that a failure in any single tool does not prevent the rest from functioning properly.

00:50:18 - Music Generation with AudioCraft and Stable Audio

With the setup partially working, Anthony switches to demoing the music generation tools. He explains AudioCraft from Meta and Stable Audio from Stability AI, walking through the available models including MusicGen variants ranging from small to stereo large. They generate samples using prompts like a calm piano melody and a rock song with heavy saxophone.

The initial results from the small AudioCraft model sound generic, and the saxophone prompt produces rock elements without recognizable saxophone sounds. Anthony compares these open-source results to Suno AI's commercial output, which he considers excellent at the 4.5 model level, but notes Suno lacks an API for integration. They listen to Stable Audio's demo samples on the project website, which show notably higher quality for sound design applications like ambient effects and pop tracks.

01:11:10 - Video Generation, Runway Demo, and Wrapping Up

Anthony discusses the video generation landscape, explaining that Veo 3 from Google offers impressive quality but costs roughly a dollar per ten-second clip, while open-source video models remain too resource-intensive for local machines. He demonstrates Runway's capabilities, showing an animation of a panda dancing in space generated from a static image created with the AutoShow image command.

Anthony outlines his vision for combining all these modalities into a complete media production pipeline, noting that the CLI can now generate images, animate them into video, produce text-to-speech, and generate music, with lip-syncing being the final missing piece for creating fully synthetic television-style content. Nick expresses interest in exploring the tools further and notes the potential for wrapping everything as an MCP server before signing off for a work meeting, closing out the stream at roughly one hour and nineteen minutes.

Transcript

00:00:03 - Anthony Campolo

Welcome back, everyone, to AJC and the Web Devs. Let me fix my microphone. Mic check. Hey, how are you doing, man? AutoShow CLI, the trilogy — we're on part three now. How you been, man? Tell me about your recent travels.

00:00:28 - Nick Taylor

Yeah, lots going on. It's September already. What the heck have I been doing? Last month I was in New York City for Apollo GraphQL. They had an MCP builders series meetup, and I got invited to speak there. I gave a talk about securing MCP with zero trust. That was a lot of fun — really great venue, and there were about four other speakers. It was Anthony from Goose, Amanda from Apollo, and Michael gave some talks. The other two are escaping me. It was a good event.

Then it was a whirlwind travel tour. I was in New York City for one day, flew out to San Francisco the next day, and gave a talk at another AI event I got invited to, talking about stuff I work on.

[00:01:37] So, zero trust. The event was from GMI Cloud — they offer GPU rentals and inference. I had to look up the name. Everything's on Luma these days for AI stuff. It was called Eat, Sip, Ignite: The Race for Innovation, Engineer, Founder and VC Insights. So that was in San Francisco.

And while I was there — I work at Pomerium, so I was down there meeting with some of our big clients, just going to lunches and dinners. The office is technically Bend, Oregon, where the CEO is, but I think it's literally just a P.O. Box. We don't have a physical office. I also got a chance to hang out with a mutual of ours, Dougie — he's over at Acme.

00:02:52 - Anthony Campolo

Oh, if you saw him, actually.

00:02:54 - Nick Taylor

Yeah. I told him I was going to be in town, and originally we were both supposed to go to this MCP meetup hosted by Sourcegraph — there were people from Netlify there, my old crew — but I couldn't end up going. Dougie couldn't either. I ended up going to the Continue offices, co-worked there, grabbed a coffee with Dougie, and then they were hosting a voice and AI event there that night. I can't remember the name of that one because I was literally in the office helping them set up chairs. So I was there, got to see some coworkers, and then flew off to Vegas right after that for Black Hat USA.

[00:03:50] That was my first Black Hat, which was pretty massive.

00:03:57 - Anthony Campolo

Black Hat scares me. I've always heard people say don't bring your devices to Black Hat. I'm sure for most people it's probably not an issue, but just the fact that that's a thing people say — I'm like, I don't want to go to that.

00:04:10 - Nick Taylor

Yeah. It's like, "Oh, thank you for that USB drive — let me go put that in my computer."

00:04:16 - Anthony Campolo

The host of Software Engineering Daily actually had a mental breakdown after going to Black Hat because he got so worried about security. That actually happened.

00:04:25 - Nick Taylor

Oh, wow. But yeah, that was a cool event. First time I went. I don't know if it was bigger than KubeCon EU or not, but it was definitely a massive event.

00:04:39 - Anthony Campolo

Yeah, I think it's probably the biggest security conference in the world.

00:04:44 - Nick Taylor

Yeah. At Pomerium we're an identity-aware proxy, so we're based on zero trust security and secure internal apps in general. But with all the stuff around MCP, even though we're not an AI company, our software fits very well with remote MCP. So I've been doing a lot of stuff with remote MCP. I've got this whole demo app — which is not really a demo anymore because we use it internally as well. I built out essentially an MCP client with the OpenAI Responses API. It does tool calls — you can use their web search tool integrated with their code interpreter, so I can say "make me a pie chart" and it spins up a container in the cloud, runs some Python, and gives you back the image or whatever asset you asked for. It also integrates with MCP servers.

[00:05:43] And I've been building out some demo MCP servers, but also some that I actually wanted to have. I created a Dev.to one. Think of MCP servers as remote tools you can use to enhance the context you send to an LLM. I made it because I used to work there and wanted a demo, and they have an API. But the other reason was that when I write blogs, I use AI all the time now — not to say "write me a blog post," but I'll bring in transcriptions from a live stream, or voice notes from a walk, and say "help me."

00:06:38 - Anthony Campolo

This is a big part of AutoShow. This is one of the main use cases. We were doing exactly what you're doing.

00:06:45 - Nick Taylor

Yeah, exactly — that's why I bring it up. With the Dev.to MCP server, I can use it in Goose or VS Code or the MCP client I built for work. But I want to be able to pull in the style and tone of how I write so that when I'm asking for help with a first draft, it can match my voice — typically informal but still very technical. I haven't finished that, but that's the idea.

And yeah, I've just been messing around with all kinds of stuff. Even though I've been at Pomerium since January, I'm still learning a lot. I've built a home lab recently — not something I was really into before.

00:07:41 - Anthony Campolo

What do you mean by home lab?

00:07:43 - Nick Taylor

So I have a mini PC — it's like a Linux box, and I have stuff installed on it. I have Open WebUI and LibreChat, so I have my own personal "ChatGPT" with local models. LibreChat uses cloud-based models too.

00:08:02 - Anthony Campolo

Is it running on GPU or CPU?

00:08:04 - Nick Taylor

There's a GPU on it, though I didn't buy a GPU separately — it's from a company called Minis Forum. The GPU is okay, but it's not like a dedicated card.

00:08:16 - Anthony Campolo

So you could probably run models up to the tens of billions of parameters, but not like the 405B ones.

00:08:25 - Nick Taylor

Yeah. And I learned something about this — the GPU has its own RAM, but you can go into BIOS and give it some additional virtual RAM too, which helps.

00:08:44 - Anthony Campolo

VRAM, as they say.

00:08:46 - Nick Taylor

Yeah. I cranked that up. So models at 7 or 8 billion parameters can be pretty much fully loaded into the GPU VRAM. But bigger ones still work too. I have the 20 billion parameter Qwen model and it runs on the mini.

00:09:06 - Anthony Campolo

You can get results out of it.

00:09:07 - Nick Taylor

Yeah, yeah, it works. It's just not a super powerful machine, so it might chug along a bit. I've also got n8n on there. Basically I'm building up all these tools — partly because it helps me work with our product better, and partly because it's an actual valid use case of our software. We have an open source version and a paid Zero tier, but there's a pretty generous free tier if you want more hand-holding, for lack of a better term.

But it's just been cool stuff. As you know, I've been mainly a web and app developer.

[00:10:02] And now I'm all in on platform engineering and security.

00:10:08 - Nick Taylor

My past experience definitely helps, especially for building things. But there's stuff I'm learning. I'm definitely a lot better at Docker now. There's stuff where I'm sure some senior platform engineers would laugh if I said "I just figured this out," but that doesn't bother me. I'm good at picking things up — it's just a space I haven't really been in a lot.

00:10:37 - Anthony Campolo

Yeah, totally. I've never done DevOps officially, but I've always been interested in deployment. And every company I've worked for is basically a deployment company. StepZen was deploy your GraphQL API gateway, QuickNode was we deploy your blockchain infra for you, and Edgio was we deploy your front end — like an enterprise Vercel kind of thing.

00:11:02 - Nick Taylor

Yeah.

00:11:03 - Anthony Campolo

It just feels like a recurring theme for me. I've never done hardware at home, but I'm thinking more about renting things like Linode, Hetzner, stuff like that. Because I'm getting more into open source models — and not even necessarily text, but like we're going to get to on the stream, I'm messing around with music models now. I've gotten into some open source image models, and that's actually a really interesting space because the models don't need to be so big that they get totally ridiculous.

I tried to integrate open source video — it's the only one I couldn't get working. I tried three different things and none of them could run on my machine. There were some smaller ones I could have tried, but I was kind of like, eh, whatever.

[00:12:00] The image models, though — you can get fairly good results with the open source stuff. That's why I'm interested in renting a GPU. You could load up an open source video model and just mess around with it. I'm curious what that would actually cost and how easy it would be to set up. The video stuff is really interesting to me right now. I think it's about to completely explode in the next two to three years because it's right on the uncanny valley line. A lot of the tools, unless you pay a lot, give you these slightly weird artifacts. But as the growth curve continues, once it's at the point where you can give any image and instantly turn it into a flawless five to ten second video, that's going to be huge.

We're already kind of seeing this — Elon Musk went all-in on the "turn X images into video" thing and it was basically all he tweeted about for a week. People just love short-form video. I don't watch TikTok or YouTube Shorts, but as far as I can tell that's basically all Zoomers watch, at least if the stereotypes are true. What about you — do you watch TikToks and YouTube Shorts?

00:13:42 - Nick Taylor

I do. And I personally find it harder and harder to watch long videos. I still do, but it's pretty easy to consume short content, and it does do well. Like I was telling you briefly before the live stream, even at work I kind of kick-started our YouTube channel again because they had some videos up but weren't doing live streaming or anything. And I was using AutoShow to find the key moments in the live stream, like you said.

But yeah, I just notice a ton of engagement with short-form content. And the interesting thing is that it sends people to watch long-form content on the channel afterward.

00:14:50 - Anthony Campolo

Right, and that's why I know this is a huge blind spot for me — I don't watch this kind of content, but that's where everyone's going. It is such a good funnel. I know people who'll show up on Joe Rogan's podcast and everyone's like "who is this, they came out of nowhere?" Then you look them up and they've been cranking out TikToks for two years with millions of fans. So clearly there's something really important going on there.

00:15:18 - Nick Taylor

Yeah. People's attention spans in general are just very short these days. I'm generalizing, but —

00:15:30 - Anthony Campolo

— but that's true. And at the same time, you either watch 90-second videos or you listen to three-hour podcasts. I find there's very little in the middle.

00:15:42 - Nick Taylor

Yeah, because I still listen to podcasts a lot. If I go out for a walk I'll 100% listen to an hour, hour and a half podcast. This morning I went for a 30-minute walk and I'll go for another one later today. It's a nice way to decompress and the easiest time to listen to long-form content.

I mean, I'm not going to walk and watch a one-hour video. But long-form, like what we're doing right now — if I was working and a friend was live streaming, I'd have it to the side. That's typically how I watch a lot of long-form content.

00:16:31 - Anthony Campolo

Streams for sure.

00:16:32 - Nick Taylor

Yeah. But then there are really long streams, like Ryan Carniato — my old coworker from Netlify.

00:16:39 - Anthony Campolo

He's actually live right now with Dylan talking about Marco.

00:16:43 - Nick Taylor

Yeah, I know, I saw. He has amazing content and it's not boring at all. My only problem is I can't commit to a five-hour stream.

00:16:59 - Anthony Campolo

But you do a one-hour stream every week.

00:17:02 - Nick Taylor

Yeah. That said, Ryan's stream is the kind I could put to the side while I'm working — again, not because it's boring, but because I can't literally focus for five hours on a stream and justify it as work. But yeah, I do watch Ryan on Fridays. Cool.

00:17:21 - Anthony Campolo

That's how I felt during the pandemic at a certain point where I was on Discord and watching a lot of Twitch streams. It was just like that was all I was doing all day.

00:17:30 - Nick Taylor

Yeah. Pandemic was wild. I mean, it's not like COVID's gone away, but peak pandemic was wild.

00:17:40 - Nick Taylor

Yeah, lockdown and stuff. I don't know how bad lockdown was for you, but my kids weren't allowed to go anywhere. So I literally built a skating rink in the backyard.

00:17:53 - Anthony Campolo

Let me just leave it at that.

00:17:55 - Nick Taylor

Okay. All right. Don't say any more — don't want to implicate you in anything.

00:17:59 - Anthony Campolo

The pandemic was really interesting for me, actually, because it started two months into my boot camp, which I was doing at Lambda School. So I was already doing an online remote boot camp and had been mostly living at home already. But at the same time I was driving for Uber to make money, so it was this weird thing where I was actually out all the time but also kind of already acclimated to the work-from-home lifestyle.

For me it was like everyone else came online, and all of a sudden I had this huge audience of people to interact with through content. I got into all these Discord servers — and this is how you and I met, through all this stuff, in 2021. So for me it ended up being a highly positive experience in a lot of ways, because it was a point in my life where I really had to make some huge changes, and it kind of forced a lot of changes on the world in a way that I was able to take advantage of.

[00:18:57] People also started tipping way more for Uber Eats for a couple of months. I made like twice as much money the first month of the pandemic just because every single restaurant had to do delivery.

00:19:09 - Nick Taylor

Yeah.

00:19:10 - Anthony Campolo

Yeah, it was like a super in-demand, important job all of a sudden to be driving for Uber Eats.

00:19:15 - Nick Taylor

Yeah. I used to work in restaurants so I generally tip well, even for bad service — I'll still leave 15%. Good service is usually 18 or 20. But the thing that annoyed me, even before the pandemic: some places, I'll leave a tip when I'm coming to pick up an order, and this one restaurant has really good tacos, but it took an hour and I regretted tipping. But anyway, I don't want to get into tipping. People that work in restaurants are generally underpaid. I digress.

00:20:06 - Anthony Campolo

So what we're going to do today — I was actually going to demo some stuff, and then you were like, do you want me to do anything? I was like, actually, we should have you set up some of these new commands on the CLI. Because since we're doing these open source things like images and video and music, it's a lot of Python. I have this completely insane Python toolchain right now that builds three or four different virtual environments for all these different tools. I eventually plan to move everything to uv so I have a simpler way of managing all this, but for now it has separate environments per tool. So we should pull this thing up, get the setup running, and I'll explain some of the new tools while it runs in the background. It'll probably take about five minutes to set up and another five to download the models.

00:21:01 - Nick Taylor

Cool. Let's go ahead and share my screen here.

00:21:05 - Anthony Campolo

Let me know when I can pull it up.

00:21:07 - Nick Taylor

Yeah, go for it. I'll just close these as we're talking. I was deploying Kubernetes locally. Have you used subagents in Claude Code?

00:21:22 - Anthony Campolo

No. I tried Claude Code but I'm not using it regularly. If I was going to use anything I'd probably try Open Code, honestly.

00:21:30 - Nick Taylor

Yeah. I use Open Code as well. I don't think they have subagents yet. I'm actually enjoying it, though.

00:21:38 - Anthony Campolo

I'm ignorant of that term, surprisingly enough.

00:21:41 - Nick Taylor

Well, it's just an agent. If I zoom in here — I've created a few, and there are some built-in ones.

00:21:49 - Anthony Campolo

A mini agent that does a certain kind of scoped task.

00:21:52 - Nick Taylor

Yeah, exactly. I have a Kubernetes specialist, a TypeScript expert, a Pomerium expert, and a Golang and TypeScript expert.

00:22:00 - Anthony Campolo

Dude, that sounds useful.

00:22:01 - Nick Taylor

I grabbed it from somewhere, or I might have made it. Let me open it in the editor. So basically it's just like this.

00:22:16 - Anthony Campolo

It comes with a big system prompt.

00:22:18 - Nick Taylor

Yeah.

00:22:19 - Anthony Campolo

"You combine the practical teaching approach of Matt Pocock with the architectural insights of Anders Hejlsberg." That's really funny.

00:22:27 - Nick Taylor

Oh yeah, I mentioned that when I said "help me create it." This could probably be improved, but the point is I've been finding these super useful. It came out about a month ago, I think. The thing that's been super helpful for me — and I don't know if other people use it this way — is that there are still a lot of Linux commands I don't know. I need to learn more of that given the nature of what I'm doing now.

00:22:53 - Anthony Campolo

Like "awesome Linux commands."

00:22:56 - Nick Taylor

Yeah, all that stuff. Even on my DigitalOcean droplet where I have my MCP demo, I have Open Code and Claude Code there. Whenever I get stuck — like "help me clean up this Docker Compose YAML, I want to add this to it" — I can obviously Google it and add the config, but it's just faster with the AI tool.

00:23:23 - Anthony Campolo

Also, Warp now has agent stuff built straight into the terminal. So anytime you're doing terminal work you can just drop down to natural language at any point. I really like that.

00:23:35 - Nick Taylor

Okay, yeah. I saw Ben Holmes — who works there now. He left the Astro team a couple months ago. The AI World Fair was in June, so he left before that. I'm not sure exactly when.

00:23:57 - Anthony Campolo

The episode of fsjam where we interviewed the CEO Zak is a really cool one. I got into Warp right when it came out and was like, screw the regular terminal. Everyone has their oh-my-zsh configs and spends all this time making their terminal super dope, and then Warp just comes in and says here's a terminal that's actually usable from the beginning. I was sold.

00:24:22 - Nick Taylor

Yeah. I still haven't used it. A couple reasons — I'm in VS Code a lot, at least when I was doing web development, so I'm always in the built-in terminal. VS Code integration is apparently one of the most requested Warp features, but it's a major rework from what I understand, which is probably why I haven't switched.

But now that I'm using these AI CLIs, I'm in Ghostty all the time and I'm enjoying the CLI experience even though I love VS Code. So I kind of use both — Copilot for some things, but for a lot of what I've been doing lately I've been in the CLI.

[00:25:15] Even Kubernetes, for example — I understand what it is at a high level but there's a lot of configuration to it. Right now I just need to bootstrap a demo and add our software on top of it. I'm still going to learn Kubernetes properly, but right now I just need to move fast and get something working, and I'll understand it all afterward.

00:25:42 - Anthony Campolo

Docker back in 2021 was one of the things that was really big for me. It got me into the terminal more, learning a lot of commands and the Docker CLI. That was so useful for me now that I'm building this CLI.

00:25:58 - Nick Taylor

Let's load it up. I'm going to probably have to pull latest. I think I pulled last week but just in case.

00:26:05 - Anthony Campolo

I just pushed all the newest stuff.

00:26:07 - Nick Taylor

Okay, cool. Let's do that. Zoom in a bit for the folks at home. Buckle up. Let's go. There are a few things. Okay, cool.

00:26:25 - Anthony Campolo

So let's just do npm run setup first.

00:26:29 - Nick Taylor

Okay, the classic. All right.

00:26:33 - Anthony Campolo

I'm not sure how it's going to work with that. We'll see. What were you saying?

00:26:40 - Nick Taylor

I was just trying to remember what it installs. Whisper. What else does it do? npm packages, obviously.

00:26:49 - Anthony Campolo

Yeah, it installs npm packages and sets up Whisper. The other things it's doing now: it sets up your text-to-speech libraries, which are Coqui and Kokoro — those are super fast so you barely notice them. It also sets up Stable Diffusion for images and the text-to-audio libraries, AudioCraft and Stable Audio. Did it break?

00:27:21 - Nick Taylor

It's the cask. It can't pull it down from Brew.

00:27:25 - Anthony Campolo

Oh, interesting.

00:27:27 - Nick Taylor

Do I have to do a brew tap first or something?

00:27:32 - Anthony Campolo

What's it failing on?

00:27:34 - Nick Taylor

It's installing LibreOffice. Are you using that for PDF or something?

00:27:39 - Anthony Campolo

Yes. You shouldn't need to install that through Brew, but go ahead and see what happens.

00:27:47 - Nick Taylor

Well, I wasn't going to, but it just said "error during installation: failed to install. Command failed: brew install --cask LibreOffice." I tried that manually and it failed too. It can't download it. I wonder if there's an issue with the registry. I don't think it's a Brew issue either — it's literally the URL to grab the latest binary, and it's giving a 404.

00:28:19 - Anthony Campolo

Yeah. So —

00:28:23 - Nick Taylor

Let me see here. I can literally download LibreOffice directly. Something's going on with Brew, but it has nothing to do with you.

00:28:56 - Anthony Campolo

Yeah. I'm curious which tool is even using this.

00:29:02 - Nick Taylor

All right. I'm guessing it's for PDF or reading something text-wise.

00:29:11 - Anthony Campolo

Yeah, there's some PDF stuff that got added. Also I think someone's watching the stream — Jace left an issue saying it doesn't fully support Windows. That is true. I kind of didn't want to figure out how to support both, so I've made everything macOS-only for now. But if someone actually wants to use it on Windows, I would be willing to put in the work.

00:29:43 - Nick Taylor

Yeah, you could realistically say for now you need to use WSL. But I think for most developers that's fine. I don't really use Windows anymore anyway.

00:30:05 - Anthony Campolo

I would like to make it platform-agnostic and not depend on Homebrew. That's mostly just because I was already using Homebrew so it was the path of least resistance. I'm not philosophically tied to it.

00:30:19 - Nick Taylor

Yeah. Okay, I'm going to run setup again. LibreOffice is installed now, so hopefully it just gets past that.

00:30:26 - Anthony Campolo

If it breaks again, try wiping your node_modules, bin, and models folder and doing a fresh install.

00:30:34 - Nick Taylor

Okay. Script failed. It's still trying to do the LibreOffice thing. Hold on.

00:30:40 - Anthony Campolo

Does it say where in the script it's failing?

00:30:44 - Nick Taylor

curl 56. I'm just going to look into where LibreOffice is coming from.

00:30:49 - Anthony Campolo

It's nowhere — not installed to any location.

00:30:52 - Nick Taylor

Okay.

00:30:53 - Anthony Campolo

Yeah, that has nothing to do with it. That's probably related to the Zerox PDF package.

00:31:00 - Nick Taylor

Okay, yeah. The error path is AutoShow CLI node_modules, Zerox command. That's what it is — Zerox is trying to install LibreOffice.

00:31:12 - Anthony Campolo

Let's remove that from your package.json and try again.

00:31:19 - Nick Taylor

All right.

00:31:24 - Anthony Campolo

So Zerox is one of the PDF extraction tools. It's actually pretty sweet — it's the only one that gives you markdown formatting, because most PDF extraction tools just dump you a flat text block.

00:31:39 - Nick Taylor

Okay.

00:31:39 - Anthony Campolo

Let's see if this works.

00:31:42 - Nick Taylor

Yeah, it's already progressing past where it stopped before. But it's giving the Brew error again.

00:31:59 - Anthony Campolo

That's really weird. It's a Brew error but it's all going through the npm postinstall scripts.

00:32:06 - Nick Taylor

Yeah, I'm guessing there's probably a pre- or post-install script that detects your OS and knows how to pull it down. I'm curious if there's an issue open in the Zerox repo about this.

00:32:27 - Anthony Campolo

Yeah, unless it just broke today.

00:32:31 - Nick Taylor

Ha — they said "Anthony's doing a live stream, let's just bust this for him. Let's make him look bad live on camera." That's probably it.

00:32:43 - Anthony Campolo

I don't see any open issues for LibreOffice. So who knows.

00:32:51 - Nick Taylor

Well, we know what it is now so it's not the end of the world. Are we dropping the Zerox dependency?

00:32:57 - Anthony Campolo

Okay, let's get back to this. The way the setup command works now — and it can be improved — is that I tried to decompose the different pieces so you don't need to install a full Whisper model if you only want to use the image thing, and vice versa. You shouldn't need Python 3.9 specifically because it should be doing everything through virtual environments.

00:33:32 - Nick Taylor

I have pyenv installed. Let me check. Python --version. What is it?

00:33:45 - Anthony Campolo

Python 3.something, probably.

00:33:47 - Nick Taylor

Okay. I do have it. I switched to 3.9 earlier. I've never actually used pyenv to switch versions before. How do you do it? Just pyenv install?

00:34:07 - Anthony Campolo

I don't know either.

00:34:10 - Nick Taylor

Here we go. If only we had AI tools. Okay. All right. Beast mode. Let's go.

00:34:34 - Anthony Campolo

pyenv install 3.9 — that's all it is.

00:34:38 - Nick Taylor

Okay. Kind of like nvm install 3.9, you said.

00:34:44 - Anthony Campolo

Yep. Let's try that. See what happens.

00:34:49 - Anthony Campolo

Because some of the different tools require different versions of Python, that's why everything should be going through virtual environments. It may just be that the way it was initially configured, it looks for your global Python version while running the setup script and has a check saying it needs to be 3.9 to 3.11.

That's something I need to fix because it shouldn't be a requirement. The whole point of virtual environments is to avoid that kind of stuff. This is why — man, Python is a mess when it comes to packaging.

00:35:20 - Nick Taylor

Well, the JavaScript ecosystem is interesting too.

00:35:25 - Anthony Campolo

JavaScript is much better than Python when it comes to packaging. Even with the difficulties of npm and yarn and pnpm and all that stuff, I feel like they've done a good job of coming together. Whereas with Python, things have seemed to get more fractured.

00:35:45 - Nick Taylor

Yeah, I'm not really in the Python ecosystem. I just meant there are so many ways to install stuff in JavaScript these days too.

00:35:59 - Nick Taylor

Yeah. Let's do it. All right.

00:36:11 - Anthony Campolo

This is why I wanted to have you do this. Now that it has so many tools, it's very easy for any one of them to break. This is why I have some more work to do decomposing the different pieces. Right now it runs through setup for basically all of the different modalities, but it doesn't download any of the models because that's what takes a long time — some of the models are 5 to 10 gigs.

00:36:39 - Nick Taylor

Okay, yeah.

00:36:40 - Anthony Campolo

So it will basically set you up to use all the tools. Then if you want to download the models for any one in particular, you do npm run setup -- --image. And you can do --all if you want to download the models for everything.

00:36:56 - Nick Taylor

Gotcha. Yeah, I have a pretty fast internet connection — one gig up and down — so if we do want to pull stuff down it should go quickly.

00:37:08 - Anthony Campolo

So let's look at the docs. What were you going to say?

00:37:13 - Nick Taylor

I was just going to say, for people not familiar with some of the tools — we've got Whisper installing right now. What does Whisper give us out of the box?

00:37:26 - Anthony Campolo

Yeah, Whisper is for transcription. That's really what launched this whole project — using whisper.cpp and then adding LLMs on top so you could get a pipeline of transcription to LLM with a prompt inserted in the middle.

So this says Python 3.9 to 3.11 required.

00:37:50 - Nick Taylor

Do I have to do pyenv local or something?

00:37:56 - Anthony Campolo

Oh, your earlier switch didn't actually take effect.

00:37:59 - Nick Taylor

Yeah, it installed it but didn't update the local settings.

00:38:05 - Anthony Campolo

Let's do this. Can you scroll up a little? I want to see which Whisper it's breaking on. So it says 3.9 — this is for Whisper Core ML.

00:38:14 - Nick Taylor

Core ML.

00:38:15 - Anthony Campolo

Yeah. I'll explain this quickly. Can you go into the setup script? So Core ML is a way to speed up your Whisper transcription on macOS. This is something I've been trying every couple months over the last year and a half — I would look at the Core ML section on the whisper.cpp README, try the commands, and just never figure out how to get it to work. So let's just comment out the Core ML part for now.

Okay, so there's Whisper, Whisper Metal, and Core ML. One of them leverages the macOS GPU and another uses Apple's Neural Engine — I should look this up, I might not be exactly right. But there are like four or five different acceleration options: Metal, Core ML, CUDA stuff.

[00:39:15] This is a super complicated thing to figure out, so I can see why it's breaking. Let's just comment out Core ML and try again.

00:39:25 - Nick Taylor

Cool. And this also gets to your point about having me run it fresh — you might have run into these issues before, or forgot about them, and are like "yeah, everything's fine." It's kind of like when people onboard to a company or try something for the first time. If you can capture that experience, you always should.

00:39:47 - Anthony Campolo

Yeah. I don't think these were issues I ran into before. But that's because there are so many things that are specific to one machine — like needing to check your Python version, and you being on a different version globally, and that messing things up just because your setup was slightly different.

00:40:09 - Nick Taylor

But yeah, I wonder why the pyenv switch didn't take. Are you using Poetry still? I remember you mentioning that.

00:40:15 - Anthony Campolo

I think it's all virtualenv now. Poetry was better than what I was doing before, but then I realized all the momentum is actually behind uv, not Poetry. They're aimed at the same thing — simplifying your entire Python package management including different Python versions. So I think I bet on the wrong horse in that race. But right now I think it just uses virtualenv to create individual environments.

Okay. So this might actually be why it's failing now. Whisper diarization this time.

00:40:57 - Nick Taylor

I was going to make a juvenile joke about the name but I'll refrain.

00:41:05 - Anthony Campolo

Whisper diarization is something I integrated into this project a while ago, then took out because at the time it was the only thing requiring a whole Python virtual environment. I ripped it out a while back. But since then, now that I have three Python virtual environments for other tools, I figured I should just add it back in because I kind of want to use it. It adds speaker labels to your Whisper output, which is really hard to do otherwise.

00:41:33 - Nick Taylor

Oh, nice.

00:41:33 - Anthony Campolo

One of the biggest gaps with whisper.cpp is that it only gives you a single stream of text — no speaker labels. You need a third-party service for that. Whisper diarization is the main open-source way to handle it, but it's a bit complicated to set up.

00:41:55 - Nick Taylor

Yeah. And it looks like it says done now, so we'll go to the next piece.

00:42:02 - Anthony Campolo

Okay. So Whisper Metal worked, then it errored on Core ML, then Whisper diarization. If it bubbles up another error in a different part of the project, there's something more fundamental wrong that's related to the other Python setups.

00:42:18 - Nick Taylor

Gotcha. Cool. Just checking something in Slack. All right.

00:42:30 - Anthony Campolo

Okay, interesting. I see what happened. It's still failing because it says it doesn't have the right version of Python. Are you on 3.12? Is that what it said you were on?

00:42:43 - Nick Taylor

Let me check. python3 -V. Okay.

00:42:55 - Anthony Campolo

Now I understand what's wrong. It created this constraint that you have to be between 3.9 and 3.11, which actually should not be the case. That was just because different tools specified different minimum versions, but they don't need a global version constraint at all — they each run in their own virtual environment.

00:43:09 - Nick Taylor

Okay.

00:43:10 - Anthony Campolo

Yeah. Okay, so it looks like we got to the end this time. It did fail over and didn't actually set up some things — which is okay because we demoed those on a previous stream.

So it skips AudioCraft, it skips Stable Audio too. It looks like it was able to just set up Whisper. But because the Python version check broke, it basically hasn't set up any of the Python tools.

00:43:37 - Nick Taylor

Okay, so you need to either fix the version check or get me onto Python 3.11.

00:43:44 - Anthony Campolo

Yeah. And we could figure out how to get you off of 3.12, but really the better fix is just removing the version check entirely.

00:43:57 - Nick Taylor

Right, because if we just change the check to match the version I have, would it all work?

00:44:02 - Anthony Campolo

Yeah, it should. We just need to figure out where the check is actually happening.

00:44:07 - Nick Taylor

Yeah. Let me look at it.

00:44:14 - Anthony Campolo

Oh wait, I think I see where it is.

00:44:16 - Nick Taylor

Yeah, there's a bunch of them.

00:44:17 - Anthony Campolo

Okay. So it's everywhere it has 3.9, 3.13, 3.11. This is the change right here. Change each of those — not the echo part but the version range check above it.

00:44:39 - Nick Taylor

Oh yeah. Here, okay.

00:44:40 - Anthony Campolo

Yeah. So grab that pattern. This happens in three places in the project, so copy that whole thing and update all three.

00:44:47 - Nick Taylor

I can just do a find-and-replace.

00:44:49 - Anthony Campolo

Right, so the find_python function is doing a for loop over your different Python versions. You need to do this in two other places — Whisper Core ML, Whisper Diarization, and then the TTS. So this is definitely the issue. I'm going to pull this fix right now.

00:45:33 - Nick Taylor

Boom. Actually, you could maybe do a regex here — something like 3\.9|3\.1.

00:45:44 - Anthony Campolo

Actually, this whole thing should be removed. There's no reason to be checking your Python version at all — that's what the virtual environments are for.

00:45:50 - Nick Taylor

Okay, gotcha. Cool. All right, so we should do that.

00:46:00 - Nick Taylor

All right, if that's the case, let's fix this. Undo this here. This should work now, right?

00:46:11 - Anthony Campolo

Did you change all three places where that was happening?

00:46:14 - Nick Taylor

Yeah.

00:46:15 - Anthony Campolo

Okay, great. Let's do it and see what happens.

00:46:18 - Nick Taylor

I'm gonna YOLO this. All right. Welcome to the show, AutoShow. Yeah, this is the good stuff to show off. This is the thing that makes projects stick or not, because this is currently a real friction point.

00:46:49 - Anthony Campolo

Yeah. And because they're all coupled together, it was interesting — at one point it was breaking, but at other points it would skip over the ones that were failing. So there are different checks, some of which are failing the whole command and some of which are failing gracefully.

00:47:06 - Nick Taylor

Exactly.

00:47:07 - Anthony Campolo

They should all fail gracefully so that if any one piece breaks, the rest still works. This has been one of the interesting things about building this project — I've had to think a lot about the overall structure because it has so many tools and does so many things at this point.

00:47:23 - Anthony Campolo

So I'm trying to make it basically six CLIs combined into one. If you want to just use one of those six, you can. But if you want to combine them all into a meta CLI, that should also be easy.

00:47:37 - Nick Taylor

The megalodon of scripts. I don't know. Cool. So while this is going on — it's installing Whisper, which is for converting audio to text, right? And then there's the Metal and Core ML variants you mentioned. What does Metal do specifically?

00:48:03 - Anthony Campolo

Metal is for macOS. I think it handles CPU-level optimization. I should look this up to be sure. I think Core ML uses Apple's Neural Engine, which is Apple's dedicated accelerator for machine learning — optimized specifically for M-series chips. Whereas the Metal stuff I think is just CPU acceleration.

[00:48:33] So we see right now it's trying to set up a Python environment. This is great — this is setting up Core ML Python. You're on a Mac right now, right?

00:48:53 - Nick Taylor

Yeah, I'm on an M4 actually, so it should work well.

00:49:00 - Anthony Campolo

Core ML for M4 would be great, actually. I'm really glad I finally got it working on my machine. All right, we got something else here.

00:49:06 - Nick Taylor

Okay. The Core ML stuff — so there were some modules that couldn't load.

00:49:15 - Anthony Campolo

Okay. So let's comment out the Core ML one again.

00:49:19 - Nick Taylor

Okay. Not today, Core ML. All right.

00:49:24 - Anthony Campolo

Like I was saying, Core ML took me a year of trial and error to finally get working. I ended up building out a whole custom bash script to do all these checks and things to get it to finally work on my machine. So I'm not surprised this is happening for you.

00:49:46 - Nick Taylor

I'm just going to comment out this stuff since we've already been through all that.

00:49:49 - Anthony Campolo

Yeah, sure.

00:50:18 - Anthony Campolo

It's okay. There are other things I want to talk about. While it's running, let me explain the music tools we're using. There are two I checked out. One is Stable Audio — I'm not sure if it's actually related to Stable Diffusion or not. I'm looking at Stability AI's GitHub.

00:50:43 - Nick Taylor

I was going to say "stable" rhymes with "Ableton."

00:50:53 - Anthony Campolo

Ha. Yeah, the company names get a little confusing. I think it's the people behind Stable Diffusion, but I could be wrong — the GitHub should clarify.

00:51:14 - Anthony Campolo

So Stable Audio tools are generative models for conditional audio generation — training and inference code using PyTorch.

00:51:29 - Nick Taylor

I'm assuming you generated that banner image with AI.

00:51:33 - Anthony Campolo

Yeah, that's old — from when I first started the project. I need to update it. But if you go to the docs directory and scroll down to the README.

00:51:55 - Anthony Campolo

Keep scrolling. Yeah. So you see here this breaks down everything the project does. You have the text command that couples together the transcription and the LLM stuff, because that's how the pipeline was initially created. I'm eventually going to refactor this so there's just a transcribe command — so if you wanted to run transcription without the LLM prompt step, you can. That's kind of the last thing that needs breaking out so every modality has its own standalone command. But right now, text and transcription are tightly bound together.

00:52:45 - Anthony Campolo

And then there's the image command, which includes DALL·E — which I need to update to ChatGPT Image 1 — and Black Forest Labs, AWS Nova, and SD, which is Stable Diffusion. SD.cpp is kind of like how whisper.cpp is a C++ port to speed up Whisper — this is the same idea for Stable Diffusion.

00:53:04 - Nick Taylor

This is interesting because you could wrap this as an MCP server. You could send a tool call saying "make me a futuristic landscape" and specify which backend to use.

00:53:26 - Anthony Campolo

Definitely. I've thought about the MCP angle for this. If I built it out, it would be really sweet — right now it has a lot of disparate tools that require a lot of CLI work to use. If you could combine all that into a natural language interface, that would be great. Let's check back on our setup command and see if it failed or made it to the end.

00:53:54 - Nick Taylor

Buddy, what's going on?

00:54:01 - Anthony Campolo

Can you go to the logs saved there?

00:54:04 - Nick Taylor

Oh, it doesn't look like it's there. Oh yeah — see? Script failed.

00:54:08 - Anthony Campolo

Hmm. It doesn't write a setup log unless... let me check — that should be in your project root.

00:54:13 - Nick Taylor

Okay. Oh, it's gitignored. That's why it's not showing up.

00:54:20 - Anthony Campolo

Latest log. Okay, it's probably just going to say the same thing. Okay, so it's not entirely clear why it broke. All right. Let's just comment out all the Whisper variants.

00:54:31 - Nick Taylor

Whisper, okay.

00:54:32 - Anthony Campolo

Metal, Core ML, and Diarization — take all three of those out. I just want to see if we can get the music one to work. Let's also comment out the TTS.

00:54:47 - Nick Taylor

Okay. But keep the original Whisper.

00:54:50 - Anthony Campolo

Yes. Keep the main Whisper — that one hasn't broken. Comment out the others. Okay. Let's try.

00:54:56 - Anthony Campolo

I just want to at least get the music one working. Hey — Fuzzy in the chat! What's up, Fuzzy.

00:55:02 - Nick Taylor

Hey, what's up, Fuzzy?

00:55:03 - Anthony Campolo

We're deep in dependency hell right now.

00:55:07 - Nick Taylor

Ship it. But yeah, to me, this would be more interesting and accessible as an MCP server. Obviously as devs I'm fine messing around with this like we're doing now, but a remote MCP server would really open it up.

00:55:28 - Anthony Campolo

That would be more like the AutoShow app then.

00:55:31 - Nick Taylor

Yeah, but it could be another way to expose it. For now, even a local MCP server would be interesting.

00:55:45 - Anthony Campolo

We were looking at images. Let's go back and look at the music section.

00:55:54 - Anthony Campolo

Yeah. So AudioCraft — that's the name of the other music tool.

00:56:00 - Anthony Campolo

For most of these, what I did was ask ChatGPT what the biggest open source projects are for each modality, then clicked around the GitHub repos. AudioCraft is from Meta.

00:56:13 - Nick Taylor

Okay, that's pretty cool. I guess the copyright question is interesting here — you could use that music, and presumably there's no copyright on it, as long as the model isn't literally spitting out a copy of a training track.

00:56:30 - Anthony Campolo

That's a great question. As far as I know, the laws aren't written specifically enough to claim copyright over anything generated by a model. This is currently being legislated, and I think the biggest case is going to be OpenAI versus the New York Times — that's going to go to the Supreme Court and will kind of decide a lot of these copyright concerns.

A lot of the third-party services, at least in the past, had things in their terms of service specifically saying there's no copyright restriction on anything you generate. I'm not sure if that's still the case for all of them. For open source, there's really no way they could claim copyright.

00:57:11 - Nick Taylor

Setup successfully finished.

00:57:14 - Anthony Campolo

Let me look through this. Okay, so it's still failing — it was trying to do AudioCraft and Stable Audio. So let's put the TTS back in, and then we'll do the next step.

00:57:26 - Nick Taylor

Oh, but it says it's missing a .env file.

00:57:29 - Anthony Campolo

Right, and that's what I'm saying — some of these use a shared environment and some use separate ones. The environment for those tools is possibly being generated by the Whisper setup. Since we commented out the Whisper thing, the .env never got created. So yeah, I think I'm going to have to rethink this and actually separate out the model-specific parts. The model download step is the one that'll break on people's machines, because with the models you're just curling a big file — that's not going to cause much trouble. These setup failures are all virtual environment and configuration issues.

00:58:12 - Nick Taylor

Yeah, I'm going to create a .env manually.

00:58:15 - Anthony Campolo

Oh, it actually creates a .env for you automatically.

00:58:19 - Nick Taylor

Oh, okay. Cool.

00:58:20 - Anthony Campolo

It copies over the .env.example.

00:58:23 - Nick Taylor

Okay, cool. I moved that off screen. All right, so what should we do next then?

00:58:32 - Anthony Campolo

We should switch back to my screen and I'll demo stuff. Now I kind of know what I need to fix in terms of the setup. That was a useful exercise. I'm not too surprised this is all broken because I'm just kind of vibe coding the Python setup — it's a trial-and-error thing.

This is why I like CLIs. What I do is start by writing a bash script to do whatever I'm trying to do. Then I run it, and it either works or it gives me long logs of errors. I feed that back to the model, have it update the bash script, and that's how I iterate on this. But you end up with a setup script that can be highly coupled to your own machine.

[00:59:20] As we're seeing. So this is something I'm not surprised isn't working. Let's do — actually, let me check.

00:59:30 - Nick Taylor

Just real quick. Got a little inception there.

00:59:33 - Anthony Campolo

I'm not sure. I think I'm actually — hold on. Let me check my desktop real quick. That's where I always get you — you've got your 1Password open in the back or something like that. Okay.

00:59:58 - Nick Taylor

Good old inception.

01:00:03 - Anthony Campolo

Let me start with the music stuff. Let me go to the docs. Oh, one thing I wanted to mention: I added S3 and R2 integrations, so you can save your outputs to a cloud bucket now.

01:00:19 - Nick Taylor

Oh, that's a nice touch.

01:00:21 - Anthony Campolo

Yeah, it's a nice quality-of-life thing. Right now it just dumps a file on your local machine, which is fine if you want files on your machine, but it's nice to have everything saved in the cloud so you don't lose all this stuff you're generating. Since a lot of this is now media — audio files, video files, music files — a bucket makes a lot of sense.

01:00:46 - Nick Taylor

Yeah, totally.

01:00:50 - Anthony Campolo

Okay. So for these, if the setup script works, you can just run one of these commands. For example, this one will generate audio.

01:01:01 - Nick Taylor

This is pretty cool — when this project started it was transcription only, and now I could literally use this to make music snippets. How well have you found the music to be? Like you said "calm piano melody" — but can you get super specific?

01:01:24 - Anthony Campolo

I haven't played around with it that much yet. I just integrated this a week or two ago and I was also trying to get the video models working, so it was one of those things where I just got it to work and moved on. So I'm not entirely sure.

I will say I use Suno AI a ton. Suno AI is one of my absolute favorite AI things that exists. I'm a music major, so I know what good music sounds like, and their models are amazing. Their 3.5 model from a year or two ago would sometimes create cool stuff but a lot of it was generic. Their 4 model got a bit better. And now the 4.5 model — over half of what they generate is an instant banger.

[01:02:19] It's absolutely incredible. The music stuff has gotten really good. Suno does not have an API, though, which really frustrates me — I would integrate it into AutoShow today if it did. The only problem is I don't know how to play music through VS Code, so I'll just dump it into one of my R2 buckets and play it through the browser.

01:02:47 - Nick Taylor

Oh, through the browser. Gotcha.

01:02:48 - Anthony Campolo

So give me some prompts — what do you want to hear?

01:03:01 - Nick Taylor

Make something sound like Kenny G, I guess. Or — well, you said jazz before. How about: generate a rock song melody with heavy saxophone usage.

01:03:26 - Anthony Campolo

Rock song with heavy saxophone. That's a good one. We had a sax player in my band in college and we used to say he was our secret weapon.

01:03:37 - Nick Taylor

Ha. I generally don't like saxophone in songs, but if the song is good and it's got saxophone, then they're a good saxophonist.

01:03:58 - Anthony Campolo

Okay, here we go. I'm going to try to share audio. You heard that, right?

01:04:07 - Nick Taylor

Nope. When you share your screen you probably need to check the audio sharing option. I don't know if it picks up audio by default.

01:04:14 - Anthony Campolo

Oh, I know what's wrong. Yeah. Let me switch to the right tab. There we go.

01:04:29 - Anthony Campolo

Okay. Well, atmospheric. Kind of interesting.

01:04:33 - Nick Taylor

It's like "Your call is important to us. Please stay on the line."

01:04:37 - Anthony Campolo

Ha, yeah, a little bit of that. Let me quickly check these other ones. So I'm running Stable Audio now. Let me show the output for that, and once it's done I'll dump it into a bucket as well.

So the first one was AudioCraft, the one by Meta. It looks like we ran the small model. If we ran a bigger model it might sound better — MusicGen has several model sizes here.

01:05:20 - Nick Taylor

And is it the quality of the musical composition, or also the quality of the audio output itself?

01:05:28 - Anthony Campolo

I would imagine probably both, because with these models it's just a big black box that they're trying to make better overall.

01:05:41 - Anthony Campolo

I need to improve the docs on this a little bit because it doesn't quite show you all the available models. Let me find the model list. There's a place in the project where you can see all the models listed. So we want to go to AudioCraft here — okay. So these are the different MusicGen models.

So I'm going to try the stereo large. There's melody stereo and stereo melody — those are terminology specific to the MusicGen project. I'm going to use the same prompt as before so it's easy to compare.

Adding this in right here. Okay, and the first run's done.

[01:07:05] Let me dump that in a bucket and go listen to it.

01:07:09 - Nick Taylor

What's the pricing on R2 anyway?

01:07:13 - Anthony Campolo

At this point I'm just on the free plan. I think you get up to like ten gigs on a bucket.

01:07:20 - Nick Taylor

Pretty solid.

01:07:22 - Anthony Campolo

Yeah, it's pretty amazing. I keep waiting to get charged for something but it's been great. R2's UI leaves a bit to be desired, but have you heard about the upload thing? Theo built a whole tool because he was frustrated with the interface.

01:07:45 - Nick Taylor

Yeah, okay. Cool.

01:07:50 - Anthony Campolo

Okay. So this is the saxophone one I think. Let me share this. Oh, this tab's dead. Let me try again.

01:08:30 - Anthony Campolo

That didn't really work at all.

01:08:33 - Nick Taylor

I don't know if it was just louder, but it sounded like better quality in terms of the sound itself.

01:08:43 - Anthony Campolo

But there's no saxophone. This is supposed to be the saxophone one.

01:08:46 - Nick Taylor

Oh, okay. Yeah.

01:08:48 - Anthony Campolo

You said rock song with saxophone, so it took "rock song" from that prompt but couldn't really translate "heavy saxophone" into the actual sound. It's not able to reliably map a named instrument to what that instrument sounds like. Let me try Stable Audio.

01:09:36 - Anthony Campolo

So if you look in the logs here — generating music with model Stability AI Stable Audio Open 1.0. Some of the commands have a list-models command that will tell you all the available models. For this one I haven't implemented that yet, so I still need to do that. I think I might have just set it up for the one model they have. So I'll probably need to go back to their GitHub and figure out what other models are available. Still running. This one takes a while.

01:10:37 - Anthony Campolo

Okay. So it looks like this project actually is designed around this one model.

01:10:44 - Anthony Campolo

Stable Audio Open generates variable-length stereo audio and comprises three components: an autoencoder that compresses waveforms, a T5-based text embedding for text conditioning, and a transformer-based diffusion model that operates on the latent space of the audio encoder. So that's where you tell it the instrument and style. Interesting. Okay, so just the one model for now.

01:11:13 - Anthony Campolo

Yeah. This is interesting because some projects are built around a single model and some have a family of models. That's the case with AudioCraft — because AudioCraft is a Meta project, there's actually a range of different MusicGen models that you can use. I actually hadn't heard of Stable Audio until I started researching this, and they seem to have put a lot of effort into it.

01:11:47 - Nick Taylor

I'm trying to find the tab you're in. Oh, there we go. Lost you for a minute.

01:11:57 - Nick Taylor

Can you see the right screen now?

01:11:58 - Anthony Campolo

Yeah, your tabs. Gotcha. Let me switch to the Stable Audio GitHub tab so I can play some of their example outputs.

01:12:19 - Anthony Campolo

Yeah. So that's like whistling with wind blowing.

01:12:24 - Nick Taylor

That's pretty accurate.

01:12:27 - Anthony Campolo

Yeah. This would be really good if you're creating a TV show and you needed background sound effects.

01:12:42 - Nick Taylor

I'd be curious to know what the prompt is. Oh, that is the prompt. Seems pretty good.

01:12:49 - Anthony Campolo

"Pop dance track with catchy melodies, tropical percussion, and upbeat rhythms perfect for the beach." I'm assuming that's the prompt — it's not entirely clear from the README. Yeah.

01:13:11 - Nick Taylor

Those sound pretty good.

01:13:15 - Anthony Campolo

Yeah. I'm assuming those were run on the highest quality model, which is what we're trying to do right now. Of course, it's going to take its time because it's probably a very large model. So you see here it's doing all this processing. And it looks like I'm actually using pyenv, not virtualenv.

01:13:38 - Nick Taylor

Okay.

01:13:39 - Anthony Campolo

Good to know. Anyway, all that's going to keep going for a while. Let me show some of the other things.

As I was saying, I couldn't get any of the open-source video models to work on my machine. But for all of these, my goal is to have an open-source version and a third-party hosted version for each modality so that people can choose. Right now there's no third-party hosted music option, and there's no open-source video option. In each of those gaps, I can only find either good open-source stuff or good third-party services — not both. Because like I said, Suno is the best music service but they don't have an API.

So for video right now there are two options. Have you heard of Veo 3?

01:14:33 - Nick Taylor

How can you not hear about that or Kling? That's basically all I've seen for video this past month.

01:14:40 - Anthony Campolo

Yeah. Google's got a lot of money to promote it, so it's not surprising. But I'm not going to run the Veo one because it's super expensive — like a dollar for a single ten-second video, which is insane to me. The quality is incredible, but it's just very expensive right now.

So you see here I have this list-models command. It will tell you the options are veo-3.0-generate-preview and veo-3.0-fast-generate-preview — I think that's the cheaper one. And then Veo 2, I'll probably just remove. For the most part, I'm trying not to support legacy models. I want to just have the latest, because usually when a new model comes out, you get a better expensive one and a slightly better cheaper one.

[01:15:41] That's kind of the trend I tend to notice.

01:15:42 - Anthony Campolo

So I am going to run the Runway demo. Do you know anything about Runway?

01:15:48 - Nick Taylor

I've heard about it — I know somebody that worked there, and I know they do video generation.

01:16:00 - Nick Taylor

What's his name? I forget. And there's somebody I did a front-end testing panel with who worked there too. The name's escaping me.

01:16:14 - Anthony Campolo

[unclear] npm run as-video list-models, but anyway —

01:16:20 - Anthony Campolo

Runway does both video and images. They also have a lip-syncing tool, which is really interesting for me because eventually I want to get to the point where I can create a full television show. And so I essentially have all of the pieces for that now. In the AutoShow CLI, you can generate images, turn those images into video, generate text-to-speech, and then stitch all that together.

The last thing you still need is getting the text to actually sync to the video. The way the video works right now, you can either give it a prompt and it will just generate a video, or you can give it an image and it will animate that image. So if you have a picture of Homer Simpson, it'll just be him eating a donut or whatever.

01:17:10 - Nick Taylor

Gotcha.

01:17:10 - Anthony Campolo

Lip syncing takes a character image and some text and has them speak it — makes their lips match the speech. There are a couple of tools that do that. I don't think Runway can do that through the API, though — they have some features that are only in the dashboard UI. But their API does support image generation and video generation right now. So let me go ahead and demo it.

01:17:37 - Nick Taylor

Gotcha. I actually have to hop off in a second — I got invited to a work meeting.

01:17:47 - Anthony Campolo

Okay, great. Let me just run this command and then we'll be done in two minutes.

01:17:50 - Nick Taylor

Yeah, no worries.

01:17:56 - Nick Taylor

Yeah, I've heard of Runway and I know they do video. I've just never actually seen it in action.

01:18:06 - Anthony Campolo

Okay. So we're going to take this image of a panda in space and we're going to make him dance.

01:18:10 - Anthony Campolo

Actually, I already have the output. Let me just show it.

01:18:20 - Nick Taylor

That's pretty cool. It's wild what you can do these days.

01:18:25 - Anthony Campolo

Yeah. And because now we have both the image capabilities and the video capabilities, you can start by generating the image and then tell it how to animate. I generated this panda with the image command. Let me have us listen to the higher-quality music model output real quick and then we'll wrap up. What are your impressions so far of all this?

01:18:51 - Nick Taylor

It's pretty cool. It's neat having all those tools in there — I definitely want to dig into it more.

01:18:57 - Anthony Campolo

Yeah, I hear you.

01:18:58 - Nick Taylor

Gotta go.

On this pageJump to section