Autogenerate Show Notes with Whisper.cpp, Llama.cpp, and Node.js
Published:
Last Updated:
End-to-end scripting workflow to generate automatic show notes with LLMs from audio and video transcripts using Whisper.cpp, Llama.cpp, and Commander.js.
All of this project’s code can be found on my GitHub at ajcwebdev/autoshow.
Introduction and Overview
Creating podcast show notes is an arduous process. Many podcasters do not have the support of a team or the personal bandwidth required to produce high quality show notes. A few of the necessary ingredients include:
Accurate transcript with timestamps
Chapter headings and descriptions
Succinct episode summaries of varying length (sentence, paragraph, a few paragraphs)
Thankfully, through the magic of AI, many of these can now be generated automatically with a combination of open source tooling and affordable large language models (LLMs). In this project, we’ll be leveraging OpenAI’s open source transcription model, Whisper and their closed source LLM, ChatGPT.
Project Setup
To begin, create a new project:
Initialize a package.json and set type to module for ESM syntax.
Create a script called autoshow that runs the main entry script and includes a .env for environment variables.
Create the .env file, the autoshow script will fail if this file doesn’t exist but it can be empty if you don’t need any environment variables for your workflow.
Create a .gitignore file which includes the whisper.cpp and content directories along with common files you don’t want to commit like .env.
Create Directory Structure
Create the directory structure and project files required.
content for audio and transcription files that we’ll generate along the way.
src for all of the project’s logic including:
autoshow.js for the entry file to the CLI.
commands for main project functions that correspond to different CLI options:
processVideo.js takes a single YouTube URL with --video and runs the entire transcription and show note generation workflow.
processPlaylist.js takes a URL for a YouTube playlist with --playlist and runs the processVideo function on each video (be aware that for long playlists with over 100+ hours of content this could take multiple days to complete).
processURLs.js takes a list of arbitrary URLs with --urls and runs processVideo on each one.
processFile.js takes a path to a video or audio file stored locally on your machine.
processRSS.js takes a URL to a podcast RRS feed and runs on each episode of the podcast.
utils for five common and reusable utility operations that processVideo calls out to and are agnostic to different LLM and transcription approaches. The file names should be mostly self explanatory but we’ll explain each of these in more detail throughout the article:
Generating frontmatter with generateMarkdown.js.
Downloading audio with downloadAudio.js.
Running the transcription model with runTranscription.js.
Running the large language model with runLLM.js.
Cleaning up extra files with cleanUpFiles.js.
llms for prompts to generate different show notes (prompt.js) and utilities related to LLMs:
We’ll only create one LLM integration in this article,llama.js, but the next article will include options like chatgpt.js and claude.js.
transcription for utilities related to transcription:
Also will only have one option,whisper.js, but in the future options will include deepgram.js and assembly.js.
Install the following NPM dependencies:
commander: A package for creating command-line interfaces with support for subcommands, options, and custom help.
node-llama-cpp@beta: A Node.js addon for interacting with C++ code, featuring Llama libraries (specifying that we want the experimental v3 beta version for node-llama-cpp).
file-type: A tool for detecting the file type and MIME type of a Buffer/Uint8Array/ArrayBuffer.
ffmpeg-static: Provides a static FFmpeg binary for Node.js projects, enabling media processing without requiring a separate FFmpeg installation.
fast-xml-parser: A fast, lightweight XML parser for converting XML to a JSON object, validating, or parsing XML values.
Here’s a high level view of the project structure:
Autoshow Main Entry Point
Commander.js is an open source library for building command line interfaces with Node.js. We could use process.argv to take an argument with a --video option and run the processVideo function on a YouTube URL provided as an argument. However, this project will eventually include dozens of different options, and managing all of that through process.argv would become unmaintainable very quickly.
autoshow.js defines the command-line interface (CLI) for an application called “autoshow” using the Commander.js library. The application is designed to process various types of media content, including YouTube videos, playlists, local files, and podcast RSS feeds. The file imports Command from the ‘commander’ library for creating the CLI and various processing functions from local modules for handling different types of content. A new Command instance is created and stored in the program variable and the CLI is configured with a name and description.
Several options are defined:
-v, --video <url>: Process a single YouTube video
-p, --playlist <playlistUrl>: Process all videos in a YouTube playlist
-u, --urls <filePath>: Process YouTube videos from a list of URLs in a file
-f, --file <filePath>: Process a local audio or video file
-r, --rss <rssURL>: Process a podcast RSS feed
--order <order>: Specify the order for RSS feed processing (newest or oldest), default is ‘newest’
--skip <number>: Number of items to skip when processing RSS feed, default is 0
--whisper <modelType>: Specify the Whisper model type for transcription, default is ‘base’
--llama: Use Llama for processing (likely an AI model option)
The main action of the command is defined using program.action(). This function is executed when the CLI is run.
A handlers object is created, mapping option keys to their corresponding processing functions.
The llmOption is determined by checking if the llama option is set.
The transcriptionOption is set to the value of the whisper option.
The function then iterates through the handlers object:
For each handler, it checks if the corresponding option is set.
If set, it calls the handler function with appropriate arguments.
For the ‘rss’ option, additional arguments (order and skip) are passed.
Parsing Arguments
Finally, program.parse(process.argv) is called to parse the command-line arguments and execute the appropriate action based on the provided options.
This CLI structure allows for flexible processing of various media types with different AI model options and parameters.
Utilities for Process Video Command
Import the utility functions in processVideo.js and pass the function url, llmOption, and transcriptionOption.
Extract and Download Audio File
yt-dlp is a command-line program for downloading videos from YouTube and other video platforms. It is a fork of yt-dlc, which itself is a fork of youtube-dl, with additional features and patches integrated from both.
FFmpeg is a free and open-source software project consisting of a vast software suite of libraries and programs for handling video, audio, and other multimedia files and streams. It’s used for recording, converting, and streaming audio or video and supports a wide range of formats.
yt-dlp and ffmpeg both provide extensive documentation for installing their respective binaries for command line usage (I used brew install yt-dlp ffmpeg):
For transcriptions of videos, yt-dlp can download and extract audio from YouTube URL’s. For podcasts, you can input the URL from the podcast’s RSS feed that hosts the raw file containing the episode’s audio.
Write Frontmatter with Video Metadata
In this project, we’re going to build a CLI that orchestrates executing different commands from yt-dlp and whisper.cpp all through various Node.js interfaces, scripts, packages, and modules. We’ll start with the generateMarkdown function in utils to create a markdown file with pre-populated frontmatter from a metadata object.
Prepare Audio for Transcription
Next, we’ll create a downloadAudio function to extract audio from a YouTube video in the form of a WAV file.
This function wraps a yt-dlp command that creates a file called 2023-09-10-teach-jenn-tech-channel-trailer.wav by performing the following actions:
Downloads a YouTube video specified by its URL.
Extracts and downloads the video’s audio as a WAV file.
Performs audio post processing to set the correct sample rate.
Saves the audio file in the content directory.
Creates a dynamic file name generated by the video’s upload date and unique video ID.
Here’s a breakdown of each option and flag utilized in the command:
--restrict-filenames restricts filenames to only ASCII characters, avoids ”&”, and removes spaces in filenames.
--extract-audio downloads the video from a given URL and extracts its audio.
--audio-format specifies the format the audio should be converted to for Whisper we’ll use wav for WAV files.
--postprocessor-args has the argument 16000 passed to -ar so the audio sampling rate is set to 16000 Hz (16 kHz) for Whisper.
--no-playlist ensures that only the video is downloaded if the URL refers to a video and a playlist.
-o implements the output template for the downloaded file, in this case content/%(upload_date>%Y-%m-%d)s-%(title)s.%(ext)s specifies the following:
Directory to place the output file (content).
Upload date of the video (%(upload_date>%Y-%m-%d)s).
Video title (%(title)s).
Extension name for the file, in this case wav (%(ext)s).
The URL, https://www.youtube.com/watch?v=jKB0EltG9Jo is the YouTube video we’ll extract the audio from. Each YouTube video has a unique identifier contained in its URL (jKB0EltG9Jo in this example).
Click here to show how to run the yt-dlp command in the terminal.
Note: Include the --verbose command if you’re getting weird bugs and don’t know why.
Generate a Formatted Transcription
whisper.cpp is a C++ implementation of OpenAI’s whisper Python project. This provides the useful feature of making it possible to transcribe episodes in minutes instead of days.
Setup Whisper Repo and Model
Run the following commands to clone the repo and build the base model:
Note: This builds the smallest and least capable transcription model. For a more accurate but heavyweight model, replace base (150MB) with medium (1.5GB) or large-v2 (3GB).
If you’re a simple JS developer like me, you may find the whisper.cpp repo a bit intimidating to navigate. Here’s a breakdown of some of the most important pieces of the project to help you get oriented. Click any of the following to see a dropdown with further explanation:
models/ggml-base.bin
Custom binary format (ggml) used by the whisper.cpp library.
Represents a quantized or optimized version of OpenAI’s Whisper model tailored for high-performance inference on various platforms.
The ggml format is designed to be lightweight and efficient, allowing the model to be easily integrated into different applications.
main
Executable compiled from the whisper.cpp repository.
Transcribes or translates audio files using the Whisper model.
Running this executable with an audio file as input transcribes the audio to text.
samples
The directory for sample audio files.
Includes a sample file called jfk.wav provided for testing and demonstration purposes.
The main executable can use it for showcasing the model’s transcription capabilities.
whisper.cpp and whisper.h
These are the core C++ source and header files of the whisper.cpp project.
They implement the high-level API for interacting with the Whisper automatic speech recognition (ASR) model.
This includes loading the model, preprocessing audio inputs, and performing inference.
It’s possible to run the Whisper model so the transcript just prints to the terminal without writing to an output file. This can be done by changing into the whisper.cpp directory and entering the command: ./main -m models/ggml-base.bin -f content/file.wav.
-m and -f are shortened aliases used in place of --model and --file.
For other models, replace ggml-base.bin with ggml-medium.bin or ggml-large-v2.bin.
This is nice for quick demos or short files. However, what you really want is the transcript saved to a new file. Whisper.cpp provides many different output options including txt, vtt, srt, lrc, csv, and json. These cover a wide range of uses and vary from highly structured to mostly unstructured data.
Any combination of output files can be specified with --output-filetype using any of the previous options in place of filetype.
For example, to output two files, an LRC file and basic text file, include --output-lrc and --output-txt.
For this example, we’ll only output one file in the lrc format:
Note:
-of is an alias for --output-file and is used to modify the final file name and select the output directory.
Since our command includes content/2023-09-10-teach-jenn-tech-channel-trailer, a file is created called 2023-09-10-teach-jenn-tech-channel-trailer.lrc inside the content directory.
Run Whisper and Transform Output
Despite the various available format options, whisper.cpp outputs them all as text files. Based on your personal workflows/experience, you may find it easier to parse and transform a different common data formats like csv or json. For my purpose, I’m going to use the lrc output which looks like this:
I’ll create a JavaScript regular expression to modify the LRC transcript by performing the following transformation:
Like we did before with yt-dlp, we’ll create a Node.js function that wraps a terminal command. However, unlike downloadAudio, we’ll implement an additional wrapper function called runTranscription.js that will import a callWhisper function. This ensures the top level utility functions remain composable, allowing the ability to switch out Whisper for other transcription services in the future such as Deepgram or Assembly.
whisper.js contains a sub-utility, getWhisperModel, which implements a switch/case statement for each Whisper model, sets the result to whisperModel, and passes whisper.cpp/models/${whisperModel} to Whisper’s -m flag.
The main runTranscription function imports callWhisper (which runs the whisper.cpp main command under the hood) and handles the coordination of three things which are combined and written to a single file:
The frontmatter from ${finalPath}.md at the top of the file.
The formatted transcript from txtContent at the bottom of the file.
The show notes PROMPT from prompt.js (which we’ll write in the next section) is inserted between the two.
In the next section we’ll create the prompt to tell the LLM how to write the show notes.
Generate Show Notes with LLMs
Now that we have a cleaned up transcript, we can use an LLM to create the show notes by giving it the transcript along with a prompt that describes what we want the show notes to contain and instructions for how we want the show notes to be written.
Create Show Notes Prompt
The output contains three distinct sections which correspond to the full instructions of the prompt. Any of these sections can be removed, changed, or expanded:
One Sentence Summary
One Paragraph Summary
Chapters
Include the following prompt in src/llms/prompt.js:
If we were to run the autoshow command without runLLM, the output will look like so:
At this point, you can copy and paste this entire file’s content into your LLM of choice to create the show notes. But, we’re going to add another function to perform this step automatically using a local, open source model.
Run LLM Function with Llama
We need to download a model before we can use node-llama-cpp. I’m going to download the Llama3.1-8B instruct model and set the model environment variable in a .env file.
-L follows redirects, which is important for Hugging Face links.
-o specifies the output file and directory.
Create an export for callLlama from src/llms/llama.js. The callLlama function takes in the transcriptContent and an outputFilePath for the show notes generated from the transcription content. Similarly to runTranscription, in a follow up blog post I’ll show how to integrate additional 3rd-party language models like OpenAI’s ChatGPT, Anthropic’s Claude, and Cohere’s Command-R.
The callLlama function is imported and called in runLLM.js.
The function passes the final markdown content to an LLM.
It rewrites the markdown file to replace the prompt with the resulting show notes generated by the LLM.
The last utility function is cleanUpFiles which deletes files associated with a given id. The intermediary files include wav, lrc, and txt formats.
Run autoshow -- --video with the --llama and --whisper flags to see the complete functionality.
Run without --llama to get frontmatter, transcript, and prompt to generate the show notes.
At this point, the autoshow.js script is designed to run on individual video URLs. However, there’s a handful of other use cases I want to implement.
If you already have a backlog of content to transcribe, you’ll want to run this script on a series of video URLs.
If you have an archive of stored audio or video files, you’ll want to run this script on a collection of files on your local machine.
If you have a podcast or RSS feed, you’ll want to run this script on each episode in the feed.
Additional Process Commands
Now that all the functionality for processVideo is complete, we’ll write four more functions in the commands directory. These functions do the following:
processPlaylist.js accepts a playlist URL instead of a video URL with --playlist.
processURLs.js accepts a list of arbitrary YouTube URLs with --urls.
processFile.js accepts a path to a local audio or video file with --file.
processRSS.js accepts an RSS feed URL with --rss.
Add Commands to Process Multiple Videos
The processPlaylist function will fetch video URLs from a playlist, save them to a file, and processes each video URL by calling processVideo. The --print "url" and --flat-playlist options from yt-dlp can be used to write a list of video URLs to a new file which we’ll call urls.md.
Run npm run autoshow -- --playlist with a playlist URL passed to --playlist to run on multiple YouTube videos contained in the playlist.
To process a list of arbitrary URLs, we’ll want to bypass the yt-dlp command that reads a list of videos from a playlist and pass urls.md directly to Whisper. processURLs will process a list of video URLs and perform the following actions:
Reads a file containing video URLs.
Parses the URLs.
Processes each URL by calling the processVideo function.
The function checks to see if the file exists so it can log an error message and exit early if the file doesn’t exist.
Run npm run autoshow -- --urls with the path to your urls.md file passed to --urls.
Add Command to Process Local Files
Add the following to processFile.js to allow running npm run autoshow -- --file with the path to an audio or video file passed to --file.
I’ve done my best to avoid extra dependencies with this project, but I choice to utilize file-type for this script due to two main reasons:
I think this area (managing different media file types) has to manage extensive edge cases and permutations or options from one use case to another. This kind of problem is usually well served by a dedicated, well-scoped library that explicitly manages these types of edge cases and inoperabilities.
The project is maintained by Sindre Sorhus, one of the most prolific and reliable open source JavaScript maintainers. Despite not being necessarily “actively developed,” at the least the project can be expected to stay up to date with peer-dependency updates, security patches, and ongoing bug fixes.
If you need a file to test, run the following command to download a one minute MP3 file:
Run npm run autoshow -- --file with the path to an audio or video file passed to --file:
Add Command to Process RSS Feeds
Add the following to processRSS.js to allow running npm run autoshow -- --rss with a URL to a podcast RSS feed:
This file also uses one dependency, this time for XML parsing with fast-xml-parser. It chose this because
Writing my own XML parser from scratch seemed like it would be a poor use of time
It’s fast, and everyone knows fast things are better than slow things
Run npm run autoshow -- --rss with a podcast RSS feed URL passed to --rss:
In a follow up blog post, I’ll show how to integrate additional LLMs from OpenAI, Claude, Cohere, and Mistral plus transcription models from Deepgram and Assembly.