Speech Recognition Isn’t Dead

Robert Fortner’s penned a fascinating post-mortem on speech recognition software. That’s right, post-mortem. Because it’s apparently very, very dead! Except that it’s not.

Fortner’s analysis is thorough, and to be fair, he’s speaking mostly about the apparently stagnation of speech recognition research and accuracy improvement, not declaring speech recognition dead outright. The thesis:

The accuracy of computer speech recognition flat-lined in 2001, before reaching human levels. The funding plug was pulled, but no funeral, no text-to-speech eulogy followed. Words never meant very much to computers-which made them ten times more error-prone than humans. Humans expected that computer understanding of language would lead to artificially intelligent machines, inevitably and quickly. But the mispredicted words of speech recognition have rewritten that narrative. We just haven’t recognised it yet.

The evidence, to irresponsibly summarise, runs something like this:

• The basic techniques for machine speech recognition haven’t truly changed, well, ever.

• Speech recognition error word error rate fell precipitously from the early nineties to around 2001, but has plateaued at somewhere around 80 per cent.

• It’s proven difficult to formalize grammar rules in a way that computers can understand and use, leaving speech recognition dependent nearly entirely on interpreting sounds, not context.

Anyway, just give it a read. When you do, though, see if you notice what’s missing. (Hint: It’s in your pocket, probably.)

Fortner makes fleeting references to mobile and phone-centric speech recognition, and that current technology is actually powerful enough to deal with the kinds of input needed for call centre phone tree navigation, voice dialling and whatnot. What’s conspicuously missing is the speech recognition we’re seeing more and more every day: Mobile! Phones. Mobile phones.

Android has it, and Google has taken it to other platforms; Apple appears to be very interested in expanding voice search on their phones, and not just simple, one or two word queries. Apps from the very companies Fortner implies are waning (Dragon Dictation, for one) have proven extremely popular (and impressive) on the iPhone. The implicit assumption in the piece is that if desktop speech-to-text is on the wane – and it’s pretty clear that it is – then so follows the entire dream of talking to a computer, period. That’s where I disagree. In failing to find a place on the desktop, speech recognition has been forced to a place – mobile – where long-form dictation isn’t as vital, and where its uses are much, much wider. I may never be able to tell my PC exactly what to do, but I won’t really care – I’ll be too busy talking with my phone. [Robert Fortner via Techmeme]

Discuss

(4 Comments)
  • [–]

    Shane

    Tuesday, May 4, 2010 at 8:39 AM

    “I may never be able to tell my PC exactly what to do, but I won’t really care” – and it’s therefore obvious you aren’t a disabled person who has to use voice recognition to use a computer.

  • [–]

    Astro

    Tuesday, May 4, 2010 at 1:16 PM

    I disagree with this article. The company I am working for is deploying a large system now that is using speach recognition capabilities. The system is part of a training system where complex phraseology is used. The technology we are using for the system does not require training and we are achieving above 95% recognition rates, and that is for users that have not yet been “trained” by the system.

    The technology has improved quite significantly in the last few years. It has not really been seen in the consumer style systems, but is certainly being used with very good results in specific high end applications.

  • [–]

    Radim

    Tuesday, May 4, 2010 at 4:31 PM

    I agree with Astro:
    … the technology improved a lot. I work in the company producing the dictation system and we get accuracy about 90% for the telefone records (bad micro, poor articulation). If you have avarage micro and good articulation, you can have over 95%. You can rise this number by adding words to the machine´s vocabulary.

    Note: The speech recognition (SR) is more general and the speech-to-text and text-to-speech only part of it. The SR means as well the following: keyword spotting/search in the audio (ie. spoken term detection in audio without transcription); speaker identification/verification/search, language identification, emotion detection, etc.

  • [–]

    Kahuna

    Thursday, May 6, 2010 at 2:36 AM

    The article is a crock. I’m sitting on my sofa, dictating this reply using MacSpeech Dictate and I rarely use either a mouse or keyboard with my iMac for anything. It’s not that I’m in any way handicapped, I’m just plain lazy.

    Using speech recognition and speech navigation is the next step beyond the touchscreen, which seems to be the rage right now.

    Someday, everyone will be able to talk to their computers.

Join The Discussion