Creating podcast show notes is an arduous process. Many podcasters do not have the support of a team or the personal bandwidth required to produce high quality show notes. A few of the necessary ingredients include:
Accurate transcript with timestamps
Chapter headings and descriptions
Succinct episode summaries of varying length (sentence, paragraph, a few paragraphs)
Thankfully, through the magic of AI, many of these can now be generated automatically with a combination of open source tooling and affordable large language models (LLMs). In this project, we’ll be leveraging OpenAI’s open source transcription model, Whisper and their closed source LLM, ChatGPT.
Setup Project and Install Dependencies
Create a new project directory, initialize a package.json, set type to module for ESM syntax, create directories for scripts and content, and create a .gitignore file.
yt-dlp is a command-line program to download videos from YouTube and other video platforms. It is a fork of yt-dlc, which itself is a fork of youtube-dl, with additional features and patches integrated from both.
whisper.cpp is a C++ implementation of OpenAI’s whisper Python project. This provides the useful feature of making it possible to transcribe episodes in minutes instead of days. Run the following commands to clone the repo and build the large-v2 model:
Note: This will build the largest and most capable transcription model, for a more lightweight example replace large-v2 (3GB) with base (150MB) or medium (1.5GB).
If you’re a simple JS developer like me, you may find the whisper.cpp repo a bit intimidating to navigate. Here’s a breakdown of some of the most important pieces of the project to help you get oriented. Click any of the following to see a dropdown with further explanation:
models/ggml-large-v2.bin
Custom binary format (ggml) used by the whisper.cpp library.
Represents a quantized or optimized version of OpenAI’s Whisper model tailored for high-performance inference on various platforms.
The ggml format is designed to be lightweight and efficient, allowing the model to be easily integrated into different applications.
main
Executable compiled from the whisper.cpp repository.
Transcribes or translates audio files using the Whisper model.
Running this executable with an audio file as input transcribes the audio to text.
samples
The directory for sample audio files.
Includes a sample file called jfk.wav provided for testing and demonstration purposes.
The main executable can use it for showcasing the model’s transcription capabilities.
whisper.cpp and whisper.h
These are the core C++ source and header files of the whisper.cpp project.
They implement the high-level API for interacting with the Whisper automatic speech recognition (ASR) model.
This includes loading the model, preprocessing audio inputs, and performing inference.
Download and Extract Audio with yt-dlp
For transcriptions of videos, yt-dlp can download and extract audio from YouTube URL’s. For podcasts, you’ll need to find the URL that hosts the raw file containing the episode’s audio. This URL can usually be found in one of two places:
If the podcast producer chooses to enable the feature, a download link will be available to click directly on the episode’s podcast player.
If there is no download button available in the podcast player’s UI, you’ll need to find the download link from the show’s RSS feed.
Create a command that completes the following actions:
Download a specified YouTube video.
Extract the video’s audio.
Convert the audio to WAV format.
Save the file in Whisper’s content directory.
Set filename to output.wav.
Note: Include the --verbose command if you’re getting weird bugs and don’t know why.
This command uses yt-dlp, a command-line utility for downloading videos from YouTube and other video platforms, to perform the following actions:
--extract-audio downloads the video from a given URL and extracts its audio.
--audio-format specifies the format the audio should be converted to for Whisper we’ll use wav for WAV files.
--postprocessor-args has the argument 16000 passed to -ar so the audio sampling rate is set to 16000 Hz (16 kHz) for Whisper.
-o specifies the output template for the downloaded files, in this case content/output.wav which also specifies the directory to place the output file.
The URL, https://www.youtube.com/watch?v=jKB0EltG9Jo is the YouTube video we’ll extract the audio from. Each YouTube video has a unique identifier contained in its URL (QhXc9rVLVUo in this example).
Create and Prepare Transcription for Analysis
It’s possible to run the Whisper model and have the transcript output just to the terminal by running:
Note: -m and -f are shortened aliases used in place of --model and --file.
This is nice for quick demos or short files. However, what you really want is the transcript saved to a new file.
Run Whisper Transcription Model
Whisper.cpp provides many different output options including txt, vtt, srt, lrc, csv, and json. These cover a wide range of uses and vary from highly structured to mostly unstructured data.
Any combination of output files can be specified with --output-filetype using any of the previous options in place of filetype.
For example, to output two files, an LRC file and basic text file, include --output-lrc and --output-txt.
For this example, we’ll only output one file in the lrc format:
-of is an alias for --output-file. The command is used to modify the final file name along with the selected file extensions. Since our command includes content/transcript, there will be a file called transcript.lrc inside the content directory.
Create files in all output formats
Modify Transcript Output for LLM
Despite the various available options for file formats, whisper.cpp outputs all of them as text files that later can be parsed and transformed. As with many things in programming, numerous approaches could be used to yield similar results.
Based on your personal workflows/experience, you may find it easier to parse and transform a different common data formats like csv or json. For my purpose, I’m going to use the lrc output which looks like this:
Our goal is to create a script that modifies the preceding transcript to look like this instead:
The script will need to:
Read a markdown file in the content directory containing the transcript.
Perform multiple transformations sequentially on the file.
Write the final output to a new file.
This will require a few different functions to properly perform each transformation so lets break it down step by step. First, create a new file called transform.js in scripts.
To achieve the desired transformations with the given directory structure, transform.js needs to:
Read the transcript.lrc file from the content directory.
Remove the [by:whisper.cpp] signature.
Format the timestamps to remove milliseconds.
Merge every other line in the file to reduce the total number of lines by half.
Write the transformed content to a new file called transcript.md in the same directory as the original file.
To reiterate, this script performs the following actions:
Utilizes fs and path from Node.js core modules to handle file operations and path resolutions.
Reads the transcript.lrc file asynchronously from the content directory.
Applies the specified transformations to the file’s content.
Writes the transformed content to a new file called transcript.md in the same directory.
Run the script with the following command to read the file, perform the transformations, and save the output in the same directory:
ChatGPT Show Notes Creation Prompt
Now that we have a cleaned up transcript, we can use ChatGPT directly to create the show notes. The output will contain six distinct sections which correspond to the full instructions of the prompt. Any of these sections can be removed, changed, or expanded:
One Paragraph Summary
One Sentence Summary
Chapters
Potential Episode Titles
Key Takeaways
Potential Future Episode Topics
Create a file called prompt.md:
Include the following prompt with the transcript after the final line:
The final step is to take the content of prompt.md, append the transcript in transcript.md, and write the combined content to a new file called chatgpt.md in the content directory.
To achieve this directly from the terminal, use the cat command to concatenate the content of scripts/prompt.md with content/transcript.md and redirect the output to create chatgpt.md in the content directory:
This might give you a lot more than you really need for your show notes. Lets create a shorter prompt we can use instead for quicker outputs:
If you want to reduce your prompt to just include a one sentence summary, one paragraph summary, and chapters, use the following:
Use scripts/reduced-prompt.md instead of scripts/prompt.md in the cat command if you want to use the reduced prompt.
Create Autogen Script to Run on Multiple Videos
Lets combine all the previous commands into one single script. Create a file called autogen_video.sh.
Give the script executable permissions with chmod:
The --print option from yt-dlp can be used to extract metadata from the video. We’ll use the following in our script:
video_id and upload_date provide a unique name for each video.
video_webpage_url for the full video URL.
video_uploader for the channel name.
video_uploader_url for the channel URL.
video_title for the video title.
video_thumbnail for the video thumbnail.
Include the following code in autogen_video.sh:
If you want the script to delete all intermediate files once the final transformation and concatenation is complete, uncomment the rm command.
At this point, the autogen_video.sh script is designed to run on individual video URLs. However, if you already have a backlog of content to transcribe, you’ll want to run this script on a series of video URLs. Lets create another script called autogen_playlist.sh to accept a playlist URL instead of a video URL.
Include the following code in autogen_playlist.sh:
The --print "url" and --flat-playlist options from yt-dlp are used to write a list of video URLs to a new file called urls.md.
This workflow is fine for me because I only create a podcast every week or two, so I can just copy paste the transcript into ChatGPT and copy out the output. However, it’s very possible that you could have dozens or even hundreds of episodes that you want to run this process on.
To achieve this in a short amount of time, you’ll need to use the OpenAI API and drop a bit of coin to do so. In my next blog post, I’ll be showing how to achieve this with OpenAI’s Node.js wrapper library. Once that blog post is complete I’ll update this post and link it at the end.