Imagine the worst case scenario. Dubious filmmakers use artificially intelligent computers to feed raw audio into a simulated version of Barack Obama. The audio is actually Obama’s voice, and the face really is his face. But the lip movements? Totally fake. The filmmakers publish the video on the internet, and it’s virtually impossible to see that it’s a fake, because the technology is so good. This is not a hypothetical situation any more.
Researchers at the University of Washington have developed a method that uses machine learning to study the facial movements of Obama and then render real-looking lip movement for any piece of audio. That means they can make videos of Obama saying pretty much anything they want, in whatever setting they want. The effect works especially well when they use random audio of Obama’s voice — say, an old recording of Obama as a law student — and make it look like Obama said these things yesterday.
This new development builds on a growing body of research into creating realistic videos of people speaking without actually recording them with a video camera. In the past, a similar lip-synching effect was achieved by recording several people saying the same sentences over and over to capture the specific mouth movements needed to make each sound. The University of Washington team streamlined this process, however, by feeding large quantities of footage showing one person (Obama) speaking into a neural network, and then used algorithms to determine the differences in mouth movements. They chose Obama because there are so many hours of Obama speaking on video in the public domain.
The lip-synching problem is an especially challenging one, the researchers say, because humans are incredibly good at spotting tiny visual inaccuracies in speech. “If you don’t render teeth right or the chin moves at the wrong time, people can spot it right away and it’s going to look fake,” lead author Supasorn Suwajanakorn said in a statement. “So you have to render the mouth region perfectly to get beyond the uncanny valley.”
For the ultimate demo, the researchers use years-old audio of Obama speaking on a talk show and to a news crew at Harvard and then create new-looking video of Obama in the Oval Office reciting the lines. It’s not perfect, but it’s damn close.
The new breakthrough builds on the same University of Washington research team’s previous work involving training computers to recognise certain personas, like Tom Hanks. By identifying what traits make a particular face and its expressions unique, the team developed a method that would allow them to create moving, 3D renderings of a specific face using a photo or short video clip. From there, they could effectively turn the simulations into puppets. They even made a simulated Barack Obama give a George W. Bush speech.
Of course, there are other teams working on similar problems around the world. And you know what? They’re all getting really good at creating incredibly realistic fake videos, even with low-budget equipment. Last year, for instance, one Stanford team created a method of facial reenactment that could be performed with any cheap consumer webcam. It’s incredibly creepy.
While you can imagine the conspiratorial implications of technology like this, the practical applications are much more mundane. For instance, the researchers think that this sort of technology would make video chat better, since a computer could generate an image of you speaking if the always awful video feed cuts out. Alternatively, museums and theme parks could use old recordings of historic figures to create videos or holograms showing them giving famous speeches, using the actual audio from the events.
But still, the capacity to use easy-to-access technology to create fake images and video is growing by the day. Just last week, security researcher Greg Allen published a cautionary tale of sorts in Wired: “AI Will Make Forging Anything Entirely Too Easy.” Allen writes:
Combined, the trajectory of cheap, high-quality media forgeries is worrying. At the current pace of progress, it may be as little as two or three years before realistic audio forgeries are good enough to fool the untrained ear, and only five or 10 years before forgeries can fool at least some types of forensic analysis. When tools for producing fake video perform at higher quality than today’s CGI and are simultaneously available to untrained amateurs, these forgeries might comprise a large part of the information ecosystem. The growth in this technology will transform the meaning of evidence and truth in domains across journalism, government communications, testimony in criminal justice, and, of course, national security.
As this week’s research shows, that pace of progress is picking up, fast. The good news is that technology like that being developed at the University of Washington might also be used to spot media forgeries. The bad news, well, Allen sums up the bad news pretty well when he says this technology will “transform the meaning of evidence and truth.” If you thought fake-looking news websites were a problem, just imagine what a completely fake police bodycam video could do.