‘Anonymised’ Data Is Meaningless Bullshit

‘Anonymised’ Data Is Meaningless Bullshit

When most of us think of how the concept of “data” has been skewered by the press, we’re probably thinking about an app’s location data tipping off our home address, or apps like Grindr tipping advertisers off about our sexuality. What’s less scrutinised, both by the public and by those in public office, is data that’s “anonymised”—tied to something like an IP address, rather than a name—even though that’s a concept we’ve seen to be bullshit time and again.

The latest proof comes courtesy of Dasha Metropolitansky and Kian Attari, two Harvard students who recently built a tool that combs through troves of consumer datasets uploaded from breaches across the web. As Metropolitansky and Attari told Motherboard, their program was created to link together not-so-anonymous information—like emails or usernames—back to any “anonymous” data that was found in a decade’s worth of data breaches from nearly a thousand different domains, from Adobe to YouPorn.

And—surprise, surprise—despite the bulk of these datasets being “anonymised,” identifying someone caught up in a given leak wasn’t difficult at all, according to the researchers.

First, let’s get some facts out of the way. Big shadowy data brokers, by and large, aren’t going to store anything explicitly personal about you—the person reading this story—simply because there’s no value in it. Even though the ads stalking us around the web might seem to suggest otherwise, marketers give no shits about your hopes, your dreams, your fears, the gym you go to or how you sexually identify—at least not on an individual level. What they do care about is catering a specific ad to a specific demographic, which is something that’s ultimately gleaned from where you live, where you shop, and—yes, in some cases—whether you’re queer-identified.

Here’s a personal example: Based on my NYC-based paper trail—which involves purchases at Petco, Goodwill, and some of my city’s many gay bars, marketers can realistically market me anything related to cats, thrift stores, or anything bisexual with the confidence that they’re not wasting money when targeting me with ads. They don’t need to know who I am, per se—they just need a way to reach the target demographic that I just so happen to be a part of.

Major data brokers have reams of aggregated intel on me that’s incredibly valuable because it can plop me into one of those demos with a surprising degree of accuracy. Any of these data points aren’t necessarily going to be tied to me, Shoshana, because they don’t have to be to make other people money. What this data is tied to might be something like my computer’s unique IP address or my phone’s mobile ad identifier, which are, on their own, anonymous.

But even that particular data point isn’t truly worth that much— advertisers, on a day-to-day basis, are looking at my data (and yours) as it’s aggregated with data from an untold number of other people. A person’s individual “data,” on its own, is pretty much worthless; after all, marketers can’t guarantee that I’ll be clicking on a given ad or buying the product they’re selling. What is valuable is when that data’s in aggregate, even if it’s “anonymised” and not tied to any one individual. This is why Facebook, for example, can say that it’s earning roughly $US26 ($39) a pop from every user plugged into its system—the only reason it can say that is because it’s monitoring what billions of people in aggregate are doing on its platform and off.

While one data broker might only be able to tie my shopping behaviour to something like my IP address, and another broker might only be able to tie it to my rough geolocation, that’s ultimately not much of an issue. What is an issue is what happens when those “anonymised” data points inevitably bleed out of the marketing ecosystem and someone even more nefarious uses it for, well, whatever—use your imagination. In other words, when one data broker springs a leak, it’s bad enough—but when dozens spring leaks over type, someone can piece that data together in a way that’s not only identifiable but chillingly accurate.

That’s why the “anonymised data” defence from marketers and data brokers is so fucked. It’s a go-to line that major tech companies can technically turn to time and again, with clean conscious that their own data collection is by the books. At the same time, these are some of the same companies that have leaked nearly 8 billion records over the past year, which ultimately negates that logic in the first place. It’s enough to make you wonder where the hand-washing stops and the hand-wringing begins.