Download The Dataset Of Every Publicly Available Reddit Comment

9 years ago

July 13, 2015 at 4:00 pm

Download The Dataset Of Every Publicly Available Reddit Comment

Redditor “Stuck_in_the_Matrix” has posted a torrent of what he claims is a dataset of every publicly available comment on Reddit.

That’s 1.7 billion comments total, with data about the author, subreddit, position in the comment tree, and comment score for each post. “This dataset is over 1 terabyte uncompressed, so this would be best for larger research projects,” wrote redditor and dataset compiler “Stuck_in_the_Matrix.”

The redditor first posted about the dataset on subreddit r/datasets (of course) on July 3, and with some help from other users, had set up a torrent by July 4. A smaller dataset, comprising just a month’s worth of comments, is also available as a torrent.

What could you do with all that data? “Give me 5 good data scientists and we can find the holy grail of karma!” said user “kill-init.”

Reddit user “mattrepl,” who identified themselves as a PhD student in machine learning and community dynamics, suggested that the dataset could be used to develop models of the flow of online conversations or the spread of Internet memes — a topic that sociologists have paid increasing attention to over the last few years. It could also be used to predict which subreddits or comment threads a user might participate in, which could help develop better recommendation systems.

All of that data is available through Reddit’s API, but according to other redditors in r/datasets, gathering it all would have been a dauntingly tedious task. “I’ve played with Reddit’s API some and have written crawlers to get data by user, sub, thread, etc. But it becomes prohibitive to get all the data if you have to continuously make requests for relatively small amounts of data and then piece them together,” wrote user “rePAN6517” in a comment.

And others are openly sceptical of the dataset. One commenter, “lost_file,” claimed, “Reddit has a policy for the amount of requests you can make per second. This dataset would have taken at least a year to compile. Something is fishy.”

As of the time of publication, “Stuck_in_the_Matrix” hasn’t responded to those questions.

[Reddit]

Top image: Getty Images.

LA Is Spending $11 Million on 100 Unproven AI Cameras to Issue Parking Tickets

Android 15: Everything Exciting About the Beta So Far

Injury Rates at SpaceX Soar Above Industry Norms

Small SUV ‘Crash Avoidance’ Tech Does Not Keep Small SUVs Out of Crashes

Here’s Everything That’s Gone Wrong With the Cybertruck Since It Released

Today’s Best Australian Tech Deals

Kogan Is Currently Your Cheapest Option for an NBN 50 Plan

Circles.Life Is Offering $20 for a Whopping 150GB of Data

Grab a Solid Bargain While Samsung’s Portable SSDs Are up to 54% Off

Southern Phone Currently Has the Cheapest NBN 1000 Plan

Download The Dataset Of Every Publicly Available Reddit Comment