
This week, Google launched the beta of its music locker service where you can upload all your music to the cloud and listen to it from anywhere. According to Techcrunch, Google’s Paul Joyce revealed that the Music Beta killer feature is ‘Instant Mix,’ Google’s version of Genius playlists, where you can select a song that you like and the music manager will create a playlist based on songs that sound similar. I wondered how good this ‘killer feature’ of Music Beta really was and so I decided to try to evaluate how well Instant Mix works to create playlists.
The Evaluation
Google’s Instant Mix, like many playlisting engines, creates a playlist of songs given a seed song. It tries to find songs that go well with the seed song. Unfortunately, there’s no solid objective measure to evaluate playlists. There’s no algorithm that we can use to say whether one playlist is better than another. A good playlist derived from a single seed will certainly have songs that sound similar to the seed, but there are many other aspects as well: the mix of the familiar and the new, surprise, emotional arc, song order, song transitions, and so on. If you are interested in the perils of playlist evaluation, check out this talk Dr Ben Fields and I gave at ISMIR 2010: Finding a path through the jukebox. The Playlist tutorial. (Warning, it is a 300-slide deck). Adding to the difficulty in evaluating the Instant Mix is that since it generates playlists within an individual’s music collection, the universe of music that it can draw from is much smaller than a general playlisting engine such as we see with a system like Pandora. A playlist may appear to be poor because it is filled with songs that are poor matches to the seed, but in fact those songs actually may be the best matches within the individual’s music collection.
Evaluating playlists is hard. However, there is something that we can do that is fairly easy to give us an idea of how well a playlisting engine works compared to others. I call it the WTF test. It is really quite simple. You generate a playlist, and just count the number of head-scratchers in the list. If you look at a song in a playlist and say to yourself ‘How the heck did this song get in this playlist’ you bump the counter for the playlist. The higher the WTF count the worse the playlist. As a first order quality metric, I really like the WTF Test. It is easy to apply, and focuses on a critical aspect of playlist quality. If a playlist is filled with jarring transitions, leaving the listener with iPod whiplash as they are jerked through songs of vastly different styles, it is a bad playlist.
For this evaluation, I took my personal collection of music (about 7800 tracks) and enrolled it into three systems; Google Music, iTunes and The Echo Nest. I then created a set of playlist using each system and counted the WTFs for each playlist. I picked seed songs based on my music taste (it is my collection of music so it seemed like a natural place to start).
The Systems
I compared three systems: iTunes Genius, Google Instant Mix, and The Echo Nest playlisting API. All of them are black box algorihms, but we do know a little bit about them:
iTunes Genius – this system seems to be a collaborative filtering algorithm driven from purchase data acquired via the iTunes music store. It may use play, skip and ratings to steer the playlisting engine. More details about the system can be found in: Smarter than Genius? Human Evaluation of Music Recommender Systems. This is a one button system – there are no user-accessible controls that affect the playlisting algorithm.
Google Instant Mix – there is no data published on how this system works. It appears to be a hybrid system that uses collaborative filtering data along with acoustic similarity data. Since Google Music does give attribution to Gracenote, there is a possibility that some of Gracenote’s data is used in generating playlists. This is a one button system. There are no user-accessible controls that affect the playlisting algorithm.
The Echo Nest playlist engine – this is a hybrid system that uses cultural, collaborative filtering data and acoustic data to build the playlist. The cultural data is gleaned from a deep crawl of the web. The playlisting engine takes into account artist popularity, familiarity, cultural similarity, and acoustic similarity along with a number of other attributes There are a number of controls that can be set to control the playlists: variety, adventurousness, style, mood, energy. For this evaluation, the playlist engine was configured to create playlists with relatively low variety with songs by mostly mainstream artists. The configuration of the engine was not changed once the test was started.
The Collection
For this evaluation I’ve used my personal iTunes music collection of about 7800 songs. I think it is a fairly typical music collection. It has music of a wide variety of styles. It contains music of my taste (’70s progrock and other dad-core, indie and numetal), music from my kids (radio pop, musicals), some indie, jazz, and a whole bunch of Canadian music from my friend Steve. There’s also a bunch of podcasts as well. It has the usual set of metadata screwups that you see in real-life collections (3 different spellings of Björk for example). I’ve placed a listing of all the music in the collection at Paul’s Music Collection if you are interested in all of the details.
The Caveats
Although I’ve tried my best to be objective, I clearly have a vested interest in the outcome of this evaluation. I work for a company that has its own playlisting technology. I have friends that work for Google. I like Apple products. So feel free to be skeptical about my results. I will try to do a few things to make it clear that I did not fudge things. I’ll show screenshots of results from the three playlisting sources, as opposed to just listing songs. (I’m too lazy to try to fake screenshots). I’ll also give API command I used for the Echo Nest playlists so you can generate those results yourself. Still, I won’t blame the skeptics. I encourage anyone to try a similar A/B/C evaluation on their own collection so we can compare results.
The Trials
For each trial, I picked a seed song, generated a 25-song playlist using each system, and counted the WTFs in each list. I show the results as screenshots from each system and I mark each WTF that I see with a red dot.
Trial #1 – Miles Davis – Kind of Blue
I don’t have a whole lot of jazz in my collection, so I thought this would be a good test to see if a playlister could find the jazz amidst all the other stuff.
First up is iTunes Genius:

Next up is The Echo Nest:

If you want to generate a similar playlist via our api use this API command.
Next up is Google:

After Trial 1 Scores are: iTunes: 0 WTFs, The Echo Nest 0 WTFs, Google Music: 18 WTFs
Trial #2 – Lady Gaga – Bad Romance
Now, lets move away from Jazz into mainstream pop. Again, I don’t have too much pop in my music collection. Mostly it is from my daughter, but we don’t mix our music collections too much any more.
First up is iTunes:

Next up is The Echo Nest:

Next up, Google Instant Mix:

After Trial 2 Scores are: iTunes: 2 WTFs, The Echo Nest 0 WTFs, Google Music: 31WTFs
Trial #3 – The Nice – Rondo
Next up is some good ol’ progressive rock. The Nice was an early progressive rock band fronted by Keith Emerson (of Emerson Lake and Palmer fame). It is hardcore late ’60s style progressive rock – keyboard heavy, frequent tempo and time signature changes, high speed, bull whips, damn the vocals stuff. This particular song is a cover of Brubeck’s Blue Rondo a la Turk. It is one of my favourite songs of all time. Really you should have a listen. I’ll wait. I have lots of music like this in my collection. It should be pretty easy to generate playlists that keep me happy with this seed.
First up, iTunes:

Next up is The Nest:

Next up is Google Instant Mix:

After Trial 3 Scores are: iTunes: 2 WTFs, The Echo Nest 0 WTFs, Google Music: 42WTFs
Trial #4 – Kraftwerk – Autobahn
I don’t have too much electronica, but I like to listen to it, especially when I’m working. Let’s try a playlist based on the group that started it all.
First up, iTunes:

Next up, The Echo Nest:

Next Up Google:

After Trial 4 Scores are: iTunes: 2 WTFs, The Echo Nest 0 WTFs, Google Music: 60 WTFs
Trial #5 The Beatles – Polythene Pam
For the last trial I chose the song Polythene Pam by The Beatles. It is at the core of the amazing bit on side two of Abbey Road. The zenith of the Beatles music are (IMHO) the opening chords to this song. Lets see how everyone does:
First up, iTunes:

Next Up, The Echo Nest:

Next up Google:

After Trial 5 Scores are: iTunes: 12 WTFs, The Echo Nest 0 WTFs, Google Music: 62 WTFs
(lower scores are better)
Conclusions
I learned quite a bit during this evaluation. First of all, Apple Genius is actually quite good. The last time I took a close look at iTunes Genius was three years ago. It was generating pretty poor recommendations. Today, however, Genius is generating reliable recommendations for just about any track I could throw at it, with the notable exception of Beatles tracks.
I was also quite pleased to see how well The Echo Nest playlister performed. Our playlist engine is designed to work with extremely large collections (10million tracks) or with personal sized collections. It has lots of options to allow you to control all sorts of aspects of the playlisting. I was glad to see that even when operating in a very constrained situation of a single seed song, with no user feedback it performed well. I am certainly not an unbiased observer, so I hope that anyone who cares enough about this stuff will try to create their own playlists with The Echo Nest API and make their own judgements. The API docs are here: The Echo Nest Playlist API.
However, the biggest surprise of all in this evaluation is how poorly Google’s Instant Mix performed. Nearly half of all songs in Instant Mix playlists were head scratchers – songs that just didn’t belong in the playlist. These playlists were not usable. It is a bit of a puzzle as to why the playlists are so bad considering all of the smart people at Google. Google does say that this release is a Beta, so we can give them a little leeway here. And I certainly wouldn’t count Google out here. They are data kings, and once the data starts rolling from millions of users, you can bet that their playlists will improve over time, just like Apple’s did. Still, when Paul Joyce said that the Music Beta killer feature is ‘Instant Mix’, I wonder if perhaps what he meant to say was “the feature that kills Google Music is ‘Instant Mix’.”
Republished from: MusicMachinery.com 
Top art courtesy of Shutterstock



















Bernhard de Kok
Tuesday, May 17, 2011 at 1:23 PMI quite like Genius in iTunes and use it often. It matches more than just Genre and really captures the mood and feel of the seed track to set the mix.
But the Beatles bit is really odd. I have all the beatles albums and as you found, it is not allowed to be used as a seed for Genius. I wonder if this was an oversight by Apple or a contract requirement by Apple Corps (the Beatles)?
boc
Tuesday, May 17, 2011 at 4:38 PMCould it be that it just gets better over time?
Like Apple’s Genius was pretty crap when it first launched but, now produces decent results.
It would be worth doing the same test in 6 months time to see what the results are like.
Ash
Wednesday, May 18, 2011 at 9:42 AMBetter to test again in 12-18 months as Google will need 6 months of data to analyse to see what theyre doing right and what theyre doing wrong with this service.