One of the most unsettling moments in Stanley Kubrick’s 2001: A Space Odyssey is when it’s revealed that HAL 9000 can read lips, leaving no secrets between the astronauts and the ship’s computer. That might have been science fiction, but 15 years after the events of that film, researchers in the real world have finally taught computers how to read lips.
LipNet, developed by researchers at the University of Oxford Computer Science Department, isn’t the first software designed to predict what a person is saying by analysing the movement of their lips. But it’s by far the most accurate, achieving an impressive 93.4 per cent accuracy, compared to just 52 per cent accuracy achieved by an experienced human lip reader.
So what’s the “secret sauce” that makes LipNet so adept at reading lips? Here’s how the researchers’ abstract that explains what makes their approach different, and better:
Lipreading is the task of decoding text from the movement of a speaker’s mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.
So what does all that mean in English? Based on previous research, the computer scientists realised that humans are better at reading lips, and deciphering what’s being said, when longer words are spoken. So instead of analysing footage of someone speaking on a word-by-word basis, LipNet goes one step further by taking entire sentences into consideration, using Deep Learning techniques to then backtrack and decipher each word.
But what does this mean for those of us outside academia? Running on a smartphone, fed a live feed from a body-worn camera, LipNet could serve as an amazing tool for the hearing impaired. Even if they already know how to lip read, it could help boost their understanding while watching someone speak. And those without lip reading skills wouldn’t be frustrated when a person they’re speaking to doesn’t know sign language.