Deepfake Voice Cloning for Podcasting (Descript vs. ElevenLabs)
More than ever, we live in a world where we can’t always believe our eyes and ears. Faking video or audio of politicians or celebrities saying or doing things they never said or did is getting easier every day. Actors and voiceover talent are rightly worried about their jobs, while fake news and scams threaten the rest of us.
So yeah, deepfake tech is scary - but could it also be a useful tool that simplifies our work without threatening anyone’s job or the fabric of democracy?
Testing Voice Cloning Apps
To find out, I tested a handful of the top AI voice deepfake apps - and mostly, the results were disappointingly robotic sounding. One even gave me a southern accent (I’m from Canada).
Two stood out as having real-world production possibilities though: Descript and ElevenLabs.
And just for fun, you can listen to this full blog post as read by AI-me:
Your browser doesn't support HTML5 audio
(Main audio generated using ElevenLabs “professional” voice clone trained on 45 minutes of my youtube videos (less than recommended); samples from other voice clones inserted where noted)
TLDR: Try Descript & ElevenLabs
These two tools stood out from the pack, but for very different reasons.
Descript is a whole production suite that makes it easy to edit audio via an AI generated transcript. Voice cloning in Descript seamlessly integrates with the rest of your podcast production workflow and allows you to fix bad edits; change a misspoken word or two; remove audio distractions; and improve voice performance issues in your original recording. Descript’s fakes hold up well when replacing up to maybe 1-3 words in a row, but become noticeable after that.
ElevenLabs on the other hand produces the best voice clones I’ve ever heard, but it doesn’t integrate into podcast production workflows very well. You have to generate the audio, download it, and then bring it into your audio editing software. The clones sound pretty good though - so you can get away with using them for longer chunks of a script.
For those reasons, both these tools earn a place in our podcast production toolbox.
Keep reading to hear samples and learn more.
Use Cases & Ethics
I’ve focussed on audio-only deepfakes here because affordable/accessible deepfake video tech just isn’t high enough quality to use in production (yet).
I’m also going after the most benign use cases: fixing an unintelligible or incorrect line of dialogue, or adding something for clarity when we can’t get the podcast host or interviewee back in the studio.
Basically, we’re using these tools to improve a production, with the express consent of the person we’re faking – not replacing voiceover talent with AI or making anyone say anything they don’t want said.
Another valuable use case came up working with a client a couple of years ago. She’s a longtime podcaster who developed physiological difficulties speaking. We worked together using Descript to replace her voice in podcast interviews, intros, and outros with a clone. And it actually worked reasonably well, especially with some manual finessing during the edit (speeding up parts, changing the length of gaps between words etc…) but I was curious to see the state of voice cloning in 2024 vs. back then.
Descript
Descript has a huge convenience factor for us in that we already have a subscription and use it as part of our production workflow. Descript “Pro” with an unlimited AI Voice vocabulary costs US$30/month, and you get a lot of other handy production tools included at that price.
There are a couple of ways to use deepfake voice clones in Descript:
Regenerate allows you to replace shorter bits of audio with an AI generated voice clone (up to 250 characters). Basically you can magically get a second or third “take” of a speaker restating what they originally said, even if you no longer have access to that person.
We use Regenerate to fix bad edits where the speaker’s inflection changes in such a way that the edit sounds obvious or odd; to fix an issue like a speaker trailing off mid-sentence; or to fix a major audio issue like a car horn during a key moment. There’s no training or approval to do, just highlight a section, click “regenerate”, and more often than not Descript does a good job matching the speaker’s voice and tone.
It’s like a magic wand for fixing audio issues, but you can’t make people say something they didn’t originally say - for that, Descript offers AI Speakers.
AI Speakers (formerly Overdub) is more of a “true” voice clone. You train the model by reading a short script provided by Descript. This can be recorded live, or uploaded as an audio file (handy for client work). The script is only about 45 seconds long, so it’s kind of amazing how accurate it ends up being.
Real me:
Your browser doesn't support HTML5 audio
Descript AI Voice me:
Your browser doesn't support HTML5 audio
It sounds good considering how quick and easy it was to create – but really, who are we kidding? This 10-second clip isn’t fooling anyone, and anything longer really falls apart. Descript shines when replacing a handful of words in a row because the workflow is so quick and easy. For anything longer, ElevenLabs is going to be your go-to.
ElevenLabs
ElevenLabs is focussed exclusively on text-to-speech, voice cloning, voice translation and other AI voice tricks - so it costs less, but you don’t get a full production suite like Descript. The US$5/month “Starter” plan gives you access to “instant” voice clones and around 30 minutes of generated audio per month. Upgrading to the “Creator” plan for US$22/month gives you access to “professional” level voice clones that accept more training data and can sound more realistic, plus ~2 hours of generated audio monthly.
Unlike Descript, there isn’t any safety check or approval process for using ElevenLabs instant voices. You’re simply asked to upload up to 25 samples of audio under 10MB each, which are then used to create your instant clone. Here’s how it sounds with just the same 45-second script-read I used in the Descript example:
Your browser doesn't support HTML5 audio
It sounds significantly more natural than Descript, and unlike Descript, ElevenLabs provides the ability to fine-tune the output a little with settings sliders for “Stability”, “Similarity”, and “Style Exageration” - each of which can impact the clone’s performance.
Combine a little fiddling with those sliders with some extra training data (I used 45 minutes of my publicly available youtube videos), and you get this:
Your browser doesn't support HTML5 audio
The fact that I can generate an instant clone of this quality in just a few minutes based on publicly available audio -without consent- is more than a little frightening. You do have to check a box saying you have the right and consent to clone this voice… but it’s a checkbox…
ElevenLabs also allows you to create “Professional” voice clones using a minimum of ten minutes of training audio, though they say 3 hours is optimal. For these higher quality clones there is a verification process, so you’re limited to only cloning your own voice. I’d say that’s a good thing, because here’s how my professional clone sounds using the same 45 minutes of my publicly available YouTube videos as training data (significantly less than the recommended 180 minutes):
Your browser doesn't support HTML5 audio
Conclusions
So, do AI deepfake voice clones have a place in podcast production?
I think so, yes. Both Descript and ElevenLabs’ solutions have their strengths, and both have the potential to make production faster and easier - while accomplishing feats that wouldn’t be possible without these tools.
Descript is incredibly easy to integrate into a workflow and has the bonus of paying more than lip service to consent and authorization. The quality of the output isn’t as good as ElevenLabs, but works well enough for correcting short bits of audio.
ElevenLabs’ output quality is getting frighteningly realistic, but you can tell the AI doesn’t understand what it’s saying, and the performance suffers as a result. It’s also disconcerting how easy it is to clone any voice without consent.