How Shazam Identifies (Nearly) Every Song You Throw At It

Many of us are prone to using the Shazam music-identification service whenever we encounter unfamiliar songs. After all, it’s just so easy to whip out our phones, open an app and know everything about a mystery song in seconds. But how does Shazam gives us all this information so quickly?

There is a cool service called Shazam, which take a short sample of music and identifies the song.  There are a couple of ways to use it, but one of the more convenient is to install their free app onto an iPhone. Just hit the “tag now” button, hold the phone’s mic up to a speaker, and it will usually identify the song and provide artist information, as well as a link to purchase the album.

What is so remarkable about the service is that it works on very obscure songs and will do so even with extraneous background noise.  I’ve gotten it to work sitting down in a crowded coffee shop and pizzeria.

So I was curious how it worked, and luckily there is a paper written by one of the developers explaining just that.  Of course they leave out some of the details, but the basic idea is exactly what you would expect: It relies on fingerprinting music based on the spectrogram.

Here are the basic steps:

1. Beforehand, Shazam fingerprints a comprehensive catalogue of music and stores the fingerprints in a database.
2. A user “tags” a song they hear, which fingerprints a 10-second sample of audio.
3. The Shazam app uploads the fingerprint to Shazam’s service, which runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise an error is returned.

Here’s how the fingerprinting works:

You can think of any piece of music as a time-frequency graph called a spectrogram. On one axis is time, on another is frequency, and on the third is intensity.  Each point on the graph represents the intensity of a given frequency at a specific point in time. Assuming time is on the x-axis and frequency is on the y-axis, a horizontal line would represent a continuous pure tone and a vertical line would represent an instantaneous burst of white noise. Here’s one example of how a song might look:

Spectrogram of a song sample with peak intensities marked in red. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 1A,B.

The Shazam algorithm fingerprints a song by generating this 3D graph and identifying frequencies of “peak intensity”. For each of these peak points, it keeps track of the frequency and the amount of time from the beginning of the track.  Based on the paper’s examples, I’m guessing they find about three of these points per second. [Update: A reader notes that in his own implementation he needed more like 30 points per second.]   So an example of a fingerprint for a 10-second sample might be:

Shazam builds their fingerprint catalogue out as a hash table, where the key is the frequency.  When Shazam receives a fingerprint like the one above, it uses the first key (in this case 823.44), and it searches for all matching songs. Their hash table might look like the following:

[Some extra detail: They do not just mark a single point in the spectrogram, rather they mark a pair of points: the "peak intensity" plus a second "anchor point".  So their key is not just a single frequency, it is a hash of the frequencies of both points. This leads to less hash collisions, which in turn speeds up catalogue searching by several orders of magnitude by allowing them to take greater advantage of the table's constant (O(1)) look-up time.  There are many interesting things to say about hashing, but I'm not going to go into them here, so just read around the links in this paragraph if you're interested.]

Top graph: Songs and sample have many frequency matches, but they do not align in time, so there is no match. Bottom Graph: frequency matches occur at the same time, so the song and sample are a match. Wang, Avery Li-Chun. An Industrial-Strength Audio Search Algorithm. Shazam Entertainment, 2003. Fig. 2B.

If a specific song is hit multiple times (based on examples in the paper, I think it needs about one frequency hit per second), it then checks to see if these frequencies correspond in time.  They actually have a clever way of doing this  They create a 2D plot of frequency hits, on one axis is the time from the beginning of the track those frequencies appear in the song, on the other axis is the time those frequencies appear in the sample. If there is a temporal relation between the sets of points, then the points will align along a diagonal. They use another signal-processing method to find this line, and if it exists with some certainty, then they label the song a match.

Top image via NextWeb

Bryan Jacobs is a Software Engineer living in San Francisco, California. He enjoys breaking down complicated topics on his blog, includingt the Higgs Boson, the recent financial crisis, the adaptive immune system and the flow of time. He currently the Director of Engineering at Marin Software, makers of the world-leading paid search management platform.

Discuss

(12 Comments)
  • [–]

    dawdle

    Saturday, September 25, 2010 at 10:10 PM

    I’m kind of insulted that this article only mentioned iPhone. In fact, Shazam is available on a lot of platforms, including on your desktop in Windows!

    • [–]

      Michael

      Monday, September 27, 2010 at 9:05 AM

      Why would that insult you?

  • [–]

    LJ

    Sunday, September 26, 2010 at 5:12 PM

    I would say 9 times out of 10 it doesn’t know what song it is. I’ve given up using it.

    • [–]

      Michael

      Monday, September 27, 2010 at 9:06 AM

      It’s not great for remixes of songs, and some of the random stuff you will hear on community radio it will miss, but if it got Mr. Bungle – Desert Search for the Techno Allah, it’s obviously got a huge library.

    • [–]

      DK_Son

      Monday, September 27, 2010 at 9:27 AM

      That makes it as useful as asking me what song it is lol. I’ll get it wrong 9/10 times too.

    • [–]

      Hiko

      Monday, September 27, 2010 at 9:35 AM

      shazam only works with english songs :(

      • [–]

        DansDans

        Monday, September 27, 2010 at 12:34 PM

        Bullshit

        Its picked up every Chinese songs that I’ve test through it – the only thing is it returns the name in character and not pinyin

    • [–]

      Shannon

      Monday, September 27, 2010 at 10:55 AM

      Thats strange, for me it works 9 times out of 10. And when it doesnt work it is usually because I am not close enough to the sound source.

      Although I have found that it doesnt work when attempting to tag from TV speakers. But if the TV is playing through home theater speakers tagging does work.

  • [–]

    DK_Son

    Monday, September 27, 2010 at 9:28 AM

    Does it work if I sing into it? Then again if I know the words to some of it I could just Google those words. Hell, if the song has words and you can pick up a few of them you don’t need Shazam.

    • [–]

      Wokwon

      Monday, September 27, 2010 at 10:21 AM

      The implementation that I use for Windows Mobile (Midomi) has a ‘sing’ mode that can often recognise humming or singing.

  • [–]

    James Mac

    Monday, September 27, 2010 at 12:28 PM

    Shazam doesn’t recognise pipe tunes.

    They’d make a small fortune at bagpiping competitions if it could though.

  • [–]

    Eglyntine

    Thursday, October 14, 2010 at 10:19 AM

    I’ve found that shazam only identifies songs available on Itunes – but I could be wrong.

Join The Discussion