You’re The Voice: The Science Behind Speaker Recognition Tech

10 years ago

September 15, 2014 at 11:00 am

You may have read reports that the Australian Tax Office (ATO) has introduced voiceprint technology which aims to do away with cumbersome identity verification processes on the telephone.

When you phone the ATO call centre, instead of supplying your date of birth, address or a password, you’re prompted to say: “In Australia my voice identifies me.” By comparing this to a previously recorded voiceprint, the technology will deduce if the tax file number you gave actually belonged to you.

This article was originally published on The Conversation.

The technology that makes this possible is called “speaker recognition”. So how does it work, and how secure is it?

Speech Recognition And Speaker Recognition

Two distinct, but related, technologies use human speech as input:

Speech recognition turns speech sounds into text, and speaker recognition identifies a person based on the sound of their voice. One speech recognition system that many people are familiar with is Apple’s Siri.
Speaker recognition is what the ATO’s voiceprint system is based on. Speaker recognition is one of a broad range of technologies called biometrics, that can identify people based on physical properties — such as the sound of their voice, their fingerprint, the shape of blood vessels in their eye or the way they walk.

The science behind biometric systems such as voiceprints is based on various machine learning techniques. If you’d like to get technical, some examples are hidden Markov models, support vector machines and neural networks. These use sophisticated statistical algorithms to create biometric models of a speaker’s voice.

“My voice is my password.”

Two common ways that a biometric model can be used are to identify a person based on their voice alone, or to verify by voice whether someone is correctly claiming an identity.

According to the SMH, the ATO’s voiceprint system is developed by a company called Nuance, a world leader in speech and speaker recognition. It’s very likely that the ATO uses the technology behind Nuance’s VocalPassword system, which matches a customer’s passphrase with a recording of that passphrase kept in a database.

Because a voiceprint matches a passphrase with a stored recording, it only has to verify a match rather than sort through the whole database to uniquely identify a caller based on their voice. This means the recognition process can be very fast and can work with very low-quality audio.

Given a passphrase, the system would return a statistical likelihood that the speaker is the person who provided the original voiceprint. The ATO could select a threshold for a positive identification to ensure a good match was required.

On The Record

Johan Larsson/Flickr, CC BY

Engineers who develop systems such as these are very concerned with security. Much research effort has gone into what’s called “liveness detection” and “playback detection”.

These are ways to ensure that a real person is speaking the passphrase rather than a malicious person playing a recording or attempting to mimic another person’s voice.

It’s possible that a voiceprint is susceptible to what’s called a “replay attack”. If a recording could be obtained of someone saying the exact passphrase, there would be a strong chance of being able to access their account. A distinctive passphrase reduces this risk.

Voiceprint can identify you if you have a cold because it doesn’t model the sound of your voice – it uses the sound of your voice to model the shape of your vocal tract. When you have a cold the shape of your vocal tract is still the same (you just might sound a bit nasal).

But there are situations or events that could prevent voiceprint or similar systems from correctly identifying a speaker. If someone received an injury that damaged their vocal tract, it would be unlikely that a speaker recognition system would match a voiceprint made before the injury.

A very poor phone connection or high background noise could also prevent a speaker identification system from working properly.

In both of these cases, a failure to match would probably require a caller to the ATO to verify their identity by another means. It would be extremely unlikely to mis-identify someone.

Systems such as voiceprints are intended to save time for callers and for call-centre workers by reducing the time it takes to verify identities – and less time on the phone with the tax office is always a good thing.

Ben Kraal receives funding from the Australian Research Council.

David Dean receives funding from the Australian Research Council for research related to speaker recognition.

See How Andor Crafted Its Adorably Anxious Droid in This Exclusive Bonus Clip

Biden Signs TikTok Ban Into Law, but His Campaign Will Continue Using It

This Westworld Auction Suggests the Show Really Is Over Forever

Bioluminescence Is at Least Half a Billion Years Old

Not Cool, The World’s Getting So Hot, Scientists Needed a New Colour

Kogan Is Currently Your Cheapest Option for an NBN 50 Plan

Circles.Life Is Offering $20 for a Whopping 150GB of Data

Grab a Solid Bargain While Samsung’s Portable SSDs Are up to 54% Off

Today’s Best Australian Tech Deals

Southern Phone Currently Has the Cheapest NBN 1000 Plan

You’re The Voice: The Science Behind Speaker Recognition Tech

Speech Recognition And Speaker Recognition

On The Record