All you have to do is speak into a microphone for around 30 minutes while attentively reading a script. After beginning and stopping hundreds of times to re-record your stutters and mumbles, you’ll submit the resulting audio files to be processed and, within a few hours, you’ll be notified that a duplicate of your voice is ready and waiting. Then you can enter whatever you want into a chatbox, and your AI clone will speak it back to you, with the resultant audio convincing enough to mislead even friends and family – at least for a few seconds. Many people may be unaware that such a service exists, and I don’t believe we’ve begun to thoroughly contemplate the implications that simple access to this technology will have.
Speech synthesis work has vastly improved in recent years as a result of improvements in machine learning. Previously, the most convincing synthetic voices were generated by capturing audio of a human voice actor, separating their speech into component sounds, then sewing them back together like letters in a ransom note to form new words. To produce raw audio of someone speaking from the start, neural networks may now be trained on unsorted data of their target speech. The finished product is faster, simpler, and more realistic. The quality isn’t great when it comes out of the machine (though manual tuning might help), but it’s only going to grow better in the near future.
Because there is no secret sauce involved in creating these clones, hundreds of businesses are already providing comparable services. Just Google “AI voice synthesis” or “AI voice deepfakes,” and you’ll discover how widespread the technology is, accessible from specialized companies that specialize in speech synthesis, such as Resemble. AI and Respeecher have also been implemented into bigger platforms, such as Veritone (where the technology is part of its advertising repertoire) and Descriptive (which uses it in the software it makes for editing podcasts).