AI Podcasts: From NotebookLM to NotebookLlama to Video Podcasts with F5TTS

Calculating... Comments

Remember how Google recently released NotebookLM, an AI tool that acts like a research assistant but with a twist: it doesn’t just summarize articles?

It analyzes whatever documents you upload and creates audio that sounds like a conversation between podcast hosts discussing your content. The result? A realistic “podcast” where you’d be hard-pressed to tell it isn’t two real people talking. It’s amazing—and a little eerie—just how natural these conversations sound! Although, to be fair, a bit boring over time, as they do tend to just agree on everuything: "Exactly! Spot on! You got it!" ))

NotebookLM is still pretty basic for now, with only a few options like speeding up or slowing down playback. But the future is promising: Google plans to add ways for you to pick the type of presenter, their accent, personality, expertise, and, eventually, even their appearance when AI-generated video becomes the norm. We’re clearly just at the beginning of AI’s new wave, with big names like ChatGPT, Gemini, and even Apple Intelligence still gearing up.

So Meta, too, isn’t sitting this one out.

They’ve just launched their own “open” version of this feature, called NotebookLlama, using their Llama models to create podcast-like digests of text files.

NotebookLlama: Meta’s Take

NotebookLlama works in a four-step process. First, it pre-processes a file (like a PDF of a news article), turning it into text. Then, it uses a different Llama model to create a podcast transcript from that text. Another model then “dramaticizes” the transcript, adding pauses and interruptions. Finally, it uses text-to-speech models to make the podcast audio. Just like NotebookLM, it sounds like two people having an engaging chat!

Here’s the rundown on how NotebookLlama works now:

Pre-process the PDF with the Llama-3.2-1B-Instruct model to get a .txt file
Use the Llama-3.1-70B-Instruct model to write the podcast transcript
Add dramatization with the Llama-3.1-8B-Instruct model
Use parler-tts and bark models for realistic, back-and-forth audio

You’ll need a GPU server for the 70B, 8B, and 1B models, or an API provider, to run this pipeline. For those without top-of-the-line GPUs, the process can be done on smaller models, though results may vary. Alternatively, there's rent-a-gpu services, of course.

Get NotebookLlama on GitHub https://github.com/meta-llama/llama-recipes/tree/main/recipes/quickstart/NotebookLlama

My Experience with AI-Generated Podcasts

I started trying this out just yesterday, but using my own F5TTS setup and a custom GPT model to generate podcasts. But I’m also exploring AI-generated video options like Kling and Minimax since, honestly, people are more into watching than just listening.

Right now, here’s what my process could look like:

Choose an image of the podcaster – prompt an AI image generator.
Generate a voice in Udio and pair it with my F5TTS podcast script.
Create video with an AI tool like Kling, Hailuo, or Runway and add subtle face movements.
Combine the video clips into one file.
Sync lips using Facefusion or similar. Optionally - Live Portrait, but video to video apparently has some bugs from time to time in the head area. LivePortrait also is more resources heavy and while I can run Facefusion for a 30 second video I can only do 5 seconds at a time in my Liveportrait in ComfyUI.

With two video podcasters, of course, it's another level of complexity. I'm probably going to talk about this another time in more details if/when I can demonstrate something.

For audio podcasts? You don't even need Google's NotebookLM or Meta's NotebookLlama. ChatGPT can make an engaging podcast script for you right now. Here's my podcast text generator you can use for free. And F5TTS can turn that text into audio. Of course, you'd have to monitor for any 'hiccups' like occasional mispronunciations, but they're not that many if you select a good voice sample, and totally fixable. And no egregious GPU requirements. I guess I should dedicate a separate article to the process.

F5-TTS Podcast Example

F5-TTS can generate great audio for monologues or dialogues, like this podcast.The static video with a waveform was generated with FFMPEG functionality, conveniently implemented by AI Video Composer - all free tools.

The Future of AI-Generated Video Podcasts: Hold on to Your Butts!

There are still a lot of bugs, especially with lip-syncing, but I’m optimistic this tech will improve fast.

And that means soon, we’ll probably see an explosion of AI-generated YouTube channels featuring two-host podcasts that look entirely real.

Some of it will be great content; others, frankly, will be low-quality junk, flooding the platform. It’s always a double-edged sword with new tech—democratized tools let anyone have a shot, but they also open the door to a lot of people just churning out trash.

We’re heading straight into the eye of an AI-generated content shitstorm, and it’s going to be like a digital locust plague. Mark my words: once AI video generation finally gets the lip-syncing and facial animations right, we’re looking at an all-you-can-stomach buffet of fake, formulaic, and recycled drivel, served up by digital talking heads with the charisma of moldy tofu. You think we’ve seen low-effort content on YouTube now? Strap in, because this is just the appetizer.

But, as usual, platforms will likely catch on and filter out the worst of it before long. They better.

Last modified 12 January 2025 at 11:37

Published: Oct 29, 2024 at 12:15 PM