The human voice, with all its subtlety and nuance, is proving to be an exceptionally difficult thing for computers to emulate. Using a powerful new algorithm, a Montreal-based AI startup has developed a voice generator that can mimic virtually any person’s voice, and even add an emotional punch when necessary. The system isn’t perfect, but it heralds a future when voices, like photos, can be easily faked.
When Siri, Alexa or our GPS talk to us, it’s fairly obvious that we’re being spoken to by a machine. That’s because virtually every text-to-speech system on the market relies on a pre-recorded set of words, phrases and utterances (recorded from voice actors), which are then strung together in Frankenstein-like fashion to produce complete words and sentences. The end result is a vocal delivery that sounds distinctly uninspiring, robotic and at times laughable. This approach to voice synthesis also means that we’re stuck listening to the same pre-recorded, monotonous voice over and over again.
In an effort to inject some life in the automated voices that come out of our apps, AI startup Lyrebird has developed a voice-imitation algorithm that can mimic any person’s voice, and read any text with a predefined emotion or intonation. Incredibly, it can do this after analysing just a few dozen seconds of pre-recorded audio. In an effort to promote its new tool, Lyrebird produced several audio samples using the voices of Barack Obama, Donald Trump and Hillary Clinton.
Lyrebird’s demos also showcase the virtually unlimited catalogue of voices, and the system’s ability to articulate the same sentence with different intonations.
This is all made by possible through the use of artificial neural networks, which function in a manner similar to the biological neural networks in the human brain. Essentially, the algorithm learns to recognise patterns in a particular person’s speech, and then reproduce those patterns during simulated speech.
“We train our models on a huge dataset with thousands of speakers,” Jose Sotelo, a team member at Lyrebird and a speech synthesis expert, told Gizmodo. “Then, for a new speaker we compress their information in a small key that contains their voice DNA. We use this key to say new sentences.”
The end result is far from perfect — the samples still exhibit digital artefacts, clarity problems and other weirdness — but there’s little doubt who is being imitated by the speech generator. Changes in intonation are also discernible. Unlike other systems, Lyrebird’s solution requires less data per speaker to produce a new voice, and it works in real time. The company plans to offer its tool to companies in need of speech synthesis solutions.
“We are currently raising funds and growing our engineering team,” said Sotelo. “We are working on improving the quality of the audio to make it less robotic, and we hope to start beta testing soon.”
This form of speech synthesis introduces a host of ethical problems and security concerns. Eventually, a refined version of this system could replicate a person’s voice with incredible accuracy, making it virtually impossible for a human listener to discern the original from the emulation. The day is coming when vocal speech, like an image processed in Photoshop, can be manipulated without our knowing. Unscrupulous individuals could fake a speech by a prominent politician, adding yet another layer to the emerging post-truth environment. Hackers could use speech synthesis for social engineering, fooling even the most careful security experts. The possibilities are almost endless.
These potentially adverse impacts are not lost on Lyrebird, which argues that the era in which we can trust audio recordings is on the verge of coming to an end.
“We take seriously the potential malicious applications of our technology,” Sotelo told Gizmodo. “We want this technology to be used for good purposes: Giving back the voice to people who lost it to sickness, being able to record yourself at different stages in your life and hearing your voice later on, et cetera. Since this technology could be developed by other groups with malicious purposes, we believe that the right thing to do is to make it public and well-known so we stop relying on audio recordings [as evidence].”
No doubt, we’ll have to start second-guessing audio recordings of speech soon, but solutions could also be developed to ascertain the authenticity of vocal recordings. Humans may be fooled by such systems, but computers will not be — at least, not for a while. When analysing the waveform, or frequencies, of human speech, a high resolution recording can yield a tremendous amount of data for a computer to analyse. It will be a long, long time before a speech synthesis program can replicate every single aspect of a person’s distinctive speech, like the finer details of vocal timbre (that is, the quality of speech), and mouth noises such as breathing, tongue sounds and lip smacking, to the point where even a machine can’t detect the difference. There are other aspects of a recording to consider as well. For instance, the absence of background noises, the presence of a faked acoustic space, or artificially introduced ambient sounds should be easily detectable by a machine designed for the task.
Eventually, however, a speech synthesis program may be able to fake all of these things, at which point, our ability to discern truth from fabrication will be put to the test.