Political data gathered on more than 198 million US citizens was exposed this month after a marketing firm contracted by the Republican National Committee stored internal documents on a publicly accessible Amazon server.
The data leak contains a wealth of personal information on roughly 61 per cent of the US population. Along with home addresses, birth dates and phone numbers, the records include advanced sentiment analyses used by political groups to predict where individual voters fall on hot-button issues such as gun ownership, stem cell research, and the right to abortion, as well as suspected religious affiliation and ethnicity. The data was amassed from a variety of sources — from the banned subreddit r/fatpeoplehate to American Crossroads, the super PAC co-founded by former White House strategist Karl Rove.
Deep Root Analytics, a conservative data firm that identifies audiences for political ads, confirmed ownership of the data to Gizmodo on Friday.
UpGuard cyber risk analyst Chris Vickery discovered Deep Root's data online last week. More than a terabyte was stored on the cloud server without the protection of a password and could be accessed by anyone who found the URL. Many of the files did not originate at Deep Root, but are instead the aggregate of outside data firms and Republican super PACs, shedding light onto the increasingly advanced data ecosystem that helped propel US President Donald Trump's slim margins in key swing states.
Although files possessed by Deep Root would be typical in any campaign, Republican or Democratic, experts say its exposure in a single open database raises significant privacy concerns. "This is valuable for people who have nefarious purposes," Joseph Lorenzo Hall, the chief technologist at the Center for Democracy and Technology, said of the data.
The RNC paid Deep Root $US983,000 ($1,294,081) last year, according to Federal Election Commission reports, but its server contained records from a variety of other conservative sources paid millions more, including The Data Trust (also known as GOP Data Trust), the Republican party's primary voter file provider. Data Trust received over $US6.7 million ($8.8 million) from the RNC during the 2016 cycle, according to OpenSecrets.org, and its president, Johnny DeStefano, now serves as Trump's director of presidential personnel.
The Koch brothers' political group Americans for Prosperity, which had a data-swapping agreement with Data Trust during the 2016 US election cycle, contributed heavily to the exposed files, as did the market research firm TargetPoint, whose co-founder previously served as director of Mitt Romney's strategy team. (The Koch brothers also subsidised a data company known as i360, which began exchanging voter files with Data Trust in 2014.) Furthermore, the files provided by Rove's American Crossroads contain strategic voter data used to target, among others, disaffected Democrats and undecideds in Nevada, New Hampshire, Ohio and other key battleground states.
Deep Root further obtained hundreds of files (at least) from The Kantar Group, a leading media and market research company with offices in New York, Beijing, Moscow, and more than a hundred other cities on six continents. Each file offers rich details about political ads — including estimated cost, audience demographics and reach — by and about figures and groups spanning the political spectrum. There are files on the Democratic Senatorial Campaign Committee, Planned Parenthood and the American Civil Liberties Union, as well as files on every 2016 US presidential candidate, Republicans included.
What's more, the Kantar files each contain video links to related political ads stored on Kantar's servers.
Kantar files on political ads involving US officials, candidates and political organisations. (UpGuard)
Spreadsheets acquired from TargetPoint, which partnered with Deep Root and GOP Data Trust during the 2016 US election, include the home addresses, birth dates, and party affiliations of nearly 200 million registered voters in the 2008 and 2012 US presidential elections, as well as some 2016 voters. TargetPoint's data seeks to resolve questions about where individual voters stand on dozens of political issues. For example: Is the voter eco-friendly? Do they favour lowering taxes? Do they believe the Democrats should stand up to Trump? Do they agree with Trump's "America First" economic stance? Pharmaceutical companies do great damage: Agree or Disagree?
The details of voters' likely preferences for issues like stem cell research and gun control were likely drawn from a variety of sources according to a Democratic strategist who spoke with Gizmodo.
"Data like that would be a combination of polling data, real world data from door-knocking and phone-calling and other canvassing activities, coupled with modelling using the data we already have to extrapolate what the voters we don't know about would think," the strategist said. "The campaigns that do it right combine all the available data together to make the most robust model for every single voter in the target universe."
In a statement, Deep Root founder Alex Lundry told Gizmodo, "We take full responsibility for this situation." He said the data included proprietary information as well as publicly available voter data provided by state government officials. "Since this event has come to our attention, we have updated the access settings and put protocols in place to prevent further access," Lundry said.
Deep Root's data was exposed after the company updated its security settings on June 1, Lundry said. Deep Root has retained Stroz Friedberg, a cybersecurity and digital forensics firm, to investigate. "Based on the information we have gathered thus far, we do not believe that our systems have been hacked," Lundry added.
So far, Deep Root doesn't believe its proprietary data was accessed by any malicious third parties during the 12 days that the data was exposed on the open web.
Deep Root's server was discovered by UpGuard's Vickery on the night of June 12 as he was searching for data publicly accessible on Amazon's cloud service. He used the same process last month to detect sensitive files tied to a US Defence Department project and exposed by an employee of a top defence contractor.
This is not the first leak of voter files uncovered by Vickery, who told Gizmodo that he was alarmed over how the data was apparently being used — some US states, for instance, prohibit the commercial use of voter records. Moreover, it was not immediately clear to whom the data belonged. "It was decided that law enforcement should be contacted before attempting any contact with the entity responsible," said Vickery, who reported that the server was secured two days later on June 14.
A web of data firms funnel research into campaigns
Deep Root's data sheds light onto the increasingly sophisticated data operation that has fed recent Republican campaigns and lays bare the intricate network of political organisations, PACs and analysis firms that trade in bulk voter data. In an email to Gizmodo, Deep Root said that its voter models are used to enhance the understanding of TV viewership for political ad buyers. "The data accessed was not built for or used by any specific client," Lundry said. "It is our proprietary analysis to help inform local television ad buying."
However, the presence of data on the server from several political organisations, including TargetPoint and Data Trust, suggests that it was used for Republican political campaigns. Deep Root also works primarily with GOP customers (although similar vendors, such as NationBuilder, service the Democrats as well).
Deep Root is one of three data firms hired by the Republican National Committee in the run-up to the 2016 US presidential election. Founded by Lundry, a data scientist on the Jeb Bush and Mitt Romney campaigns, the firm was one of three analytics teams that worked on the Trump campaign following the party's national convention in the summer of 2016.
Lundry's work brought him into Trump's campaign war room, according to a post-election AdAge article that charted the GOP's 2016 data efforts. Deep Root was hand-picked by the RNC's then-chief of staff, Katie Walsh, in September of last year and joined two other data shops — TargetPoint Consulting and Causeway Solutions — in the effort to win Trump the US presidency.
Walsh, who now works for the nonprofit America First Policies after a brief stint in the White House, oversaw Trump's data operation in partnership with Brad Parscale, Trump's digital director. (Parscale did not respond to a request for comment before press time. Attempts to reach Walsh for comment were also unsuccessful.) Walsh and Parscale focused their efforts on three categories of voters, AdAge reports: Voters who might be predisposed to support Trump, Republican voters who were uncertain about Trump, and voters that were leaning toward Hillary Clinton but could be persuaded by Trump's message of changing up government-as-usual.
A spreadsheet forecasting specific voters' likely opinions on various issues weighed using a 0-to-1 scale. (UpGuard)
To appeal to the three crucial categories, it appears that Trump's team relied on voter data provided by Data Trust. Complete voter rolls for 2008 and 2012, as well as partial 2016 voter rolls for Florida and Ohio, apparently compiled by Data Trust are contained in the dataset exposed by Deep Root.
Data Trust acquires voter rolls from state officials and then standardises the voter data to create a clean, manageable record of all registered US voters, a source familiar with the firm's operations told Gizmodo. Voter data itself is public record and therefore not particularly sensitive, the source added, but the tools Data Trust uses to standardise that data are considered proprietary. That data is then provided to political clients, including analytics firms like Deep Root. While Data Trust requires its clients to protect the data, it has to take clients at their word that industry-standard encryption and security protocols are in place.
TargetPoint and Causeway, the two firms employed by the RNC in addition to Deep Root, apparently layered their own analytics atop the information provided by Data Trust. TargetPoint conducted thousands of surveys per week in 22 states, according to AdAge, gauging voter sentiment on a variety of topics. While Causeway helped manage the data, Deep Root used it to perfect its TV advertising targets — producing voter turnout estimates by county and using that intelligence to target its ad buys.
A source with years of experience working on political campaign data operations told Gizmodo that the data exposed by Deep Root appeared to be customised for the RNC and had apparently been used to create models for turnout and voter preferences. Metadata in the files suggested that the database wasn't Deep Root's working copy, but rather a post-election version of its data, the source said, adding that it was somewhat surprising the files hadn't been discarded.
Because the data from the 2008 and 2012 elections is outdated — the source compared it to the kind of address and phone data one could find on a "lousy internet lookup site" — it's not very valuable. Even the 2016 data is quickly becoming stale. "This is a proprietary dataset based on a mix of public records, data from commercial providers, and a variety of predictive models of uncertain provenance and quality," the source said, adding: "Undoubtedly it took millions of dollars to produce."
Although basic voter information is public record, Deep Root's dataset contains a swirl of proprietary information from the RNC's data firms. Many of filenames indicate they potentially contain market research on Democratic candidates and the independent expenditure committees that support them. (Up to two terabytes of data contained on the server was protected by permission settings.)
One exposed folder is labelled "Exxon-Mobile" [sic] and contains spreadsheets apparently used to predict which voters support the oil and gas industry. Divided by state, the files include the voters' names and addresses, along with a unique RNC identification number assigned to every US citizen registered to vote. Each row indicates where voters likely fall on issues of interest to ExxonMobil, the country's biggest natural gas producer.
The data evaluates, for example, whether or not a specific voter believes drilling for fossil fuels is vital to US security. It also predicts if the voter thinks the US should be moving away from fossil-fuel use. The ExxonMobil "national score" document alone contains data on 182,746,897 Americans spread across 19 fields.
The "Exxon-Mobile" file contains unique RNC codes and sentiment analyses for more than 182 million US voters. (UpGuard)
Some of the data included in Deep Root's dataset veers into downright bizarre territory. A folder titled simply "reddit" houses 170 GBs of data apparently scraped from several subreddits, including the controversial r/fatpeoplehate that was home to a community of people who posted pictures of people and mocked them for their weight before it was banned from Reddit's platform in 2015. Other subreddits that appear to have been scraped by Deep Root or a partner organisation focused on more benign topics, like mountain biking and the Spanish language.
The Reddit data could've been used as training data for an artificial intelligence algorithm focused on natural language processing, or it might have been harvested as part of an effort to match up Reddit users with their voter registration records. During the 2012 US election cycle, Barack Obama's campaign data team relied on information gleaned from Facebook profiles and matched profiles to voter records.
During the 2016 US election season, Reddit played host to a legion of Trump supporters who gathered in subreddits such as r/The_Donald to comb through leaked Democratic National Committee emails and craft pro-Trump memes. Trump himself participated in an "Ask Me Anything" session on r/The_Donald during his campaign.
Given how active some Trump supporters are on Reddit — r/The_Donald currently boasts more than 430,000 members — it makes sense that Trump's data team might be interested in analysing data from the site.
A FiveThirtyEight analysis that looked at where r/The_Donald members spend their time when they're not talking politics might shed some light onto why Deep Root collected r/fatpeoplehate data. FiveThirtyEight found that, when Redditors weren't commenting in political subreddits, they most often frequented r/fatpeoplehate.
It's possible that Deep Root intended to use data from r/fatpeoplehate to build a more comprehensive profile of Trump voters. (Lundry declined to comment beyond his initial statement on any of information included in the Deep Root dataset.)
A raw excerpt of the scraped Reddit data stored on Deep Root's server. (UpGuard)
However, FiveThirtyEight's investigation doesn't account for Deep Root's collection of data from mountain-biking and Spanish-speaking subreddits that weren't as popular with r/The_Donald members — and data from these subreddits that are not so closely linked to Trump's diehard supporters might be more useful for his campaign's goal of pursuing swing voters.
"My guess is that they were scraping Reddit posts to match to the voter file as another input for individual modelling," a source familiar with campaign data operations told Gizmodo. "Given the number of random forums, my guess is they started with a list of accounts to scrape from, rather than scraping from all forums then trying to match from there (in which case you'd start with the political ones)."
Matching voter records with Reddit usernames would be complicated and any large-scale effort would likely result in many inaccuracies, the source said. However, campaigns have attempted to match voter files with social media profiles in the past. Such an effort by Deep Root wouldn't be entirely surprising, and would likely yield rich data on the small portion of users it was able to match with their voter profiles, the source explained.
Data exposes sensitive voter info
The Deep Root incident represents the largest known leak of Americans' voter records, outstripping past exposures by several million records. Five voter-file leaks over the past 18 months exposed between 350,000 and 191 million files, some of which paired voter data — name, race, gender, birth date, address, phone number, party affiliation and so on — with email accounts, social media profiles, and records of gun ownership.
Campaigns and the data analysis firms they employ are a particularly weak point for data exposure, security experts say. Corporations that don't properly secure customer data can face significant financial repercussions — just ask Target or Yahoo. But because campaigns are short-term operations, there's not much incentive for them to take data security seriously, and valuable data is often left out to rust after an election.
"Campaigns are very narrowly focused. They are shoestring operations, even presidential campaigns. So they don't think of this as an asset they need to protect," the Center for Democracy and Technology's Hall told Gizmodo.
Even though voter rolls are public record and are easy to access — Ohio, for instance, makes its voter rolls available to download online — their exposure can still be harmful.
Voter registration records include post codes, birth dates, and other personal information that have been crucial in research efforts to re-identify anonymous medical data. Latanya Sweeney, a professor of government and technology at Harvard University, famously used voter data to re-identify Massachusetts Governor William Weld from information in anonymous hospital discharge records.
Because of the personal information they contain, voter registration databases can also be useful in identity theft schemes.
Even though exposure of Deep Root's data has the potential to harm voters, it's exactly the kind of data that campaigns lust after and will spend millions of dollars to obtain. Campaigns are motivated to accumulate as much deeply personal information about voters as possible, so they can spend their ad dollars in the right swing districts where they're likely to sway the greatest number of voters. But voter data rapidly goes stale and campaigns close up shop quickly, so data is seen as disposable and often isn't well-protected.
"I can think of no avenues for punishing political data breaches or otherwise properly aligning the incentives. I worry that if there's no way to punish campaigns for leaking this stuff, it's going to continue to happen until something bad happens," Hall said. The data left behind by campaigns can pose a lingering security issue, he added. "None of these motherfuckers were ever Boy Scouts or Girl Scouts, they don't pack out what they pack in."