Giz Explains: How To Use Robots.txt To Hide Your Dumb Blog

9 years ago

May 1, 2015 at 11:00 am

Giz Explains: How To Use Robots.txt To Hide Your Dumb Blog

Completely deleting something from the internet is like corralling drunk, feral geese after setting them loose: damn near impossible. But there are ways to conceal the web content you don’t want anyone to lay eyes on. You can hide all sorts of web pages with what’s essentially a “Keep Out” sign for search engines: a special file called robots.txt.

The robots.txt file acts as shortcut to bury content so deep it’s hard to dig up. As BuzzFeed recently illustrated when it tried to hide deleted posts that were critical of properties of its advertisers, robots.txt is a fairly potent coverup tool. If you want to dodge responsibility for something you’ve published, the file will partially hide your tracks. The internet’s memory is long, and robots.txt is a forgetting protocol.

BuzzFeed used robots.txt to make it harder for people to find a few posts about Dove and Monopoly that it had deleted in a bout of editorial stupidity. By adding the URLs of these ghost posts to its robots.txt directory, it prevented older versions from showing up in online searches. This didn’t mean the posts were gone entirely, but it made it so that Unilever execs pissed about the deleted Dove commentary would not be able to find it unless they specifically rummaged through BuzzFeed’s robots.txt directory.

So how does it work? Search engines like Google, Bing, and Yahoo often cache older versions of web pages, meaning it’s fairly simple to find a copy of a deleted post. The Internet Archive’s Wayback Machine also archives copies of gone posts, preserving a digital record. In general, this habit of preservation is a boon for protecting digital history. But when you want something to be forgotten, this tendency towards record-keeping becomes a problem.

However, Google, Bing, Yahoo, the Wayback Machine, and a variety of other search robots will listen to your commands not to keep a record. Most of the robots used by search engines will look for the presence of a robots.txt file right away, and will obey instructions to exclude content.

For a professional media company to self-censor is shady as hell, but perhaps you, dear reader, have a better reason. Making a robots.txt file to hide your shameful exploits is very easy. You open up a text editor and type the following:

User-agent:

Disallow:

Then you customise it with exactly what you want to disallow, and save it as a .txt file. It’s important to use lowercase for “robots” and to make a separate “Disallow:” command for each exclusion.

For the Wayback Machine, for instance, you write this and it will retroactively scrub your page:

User-agent: ia_archiver

Disallow: /

After that, you upload the file to the root directory of your domain (it has to be the main one). If you don’t have direct access to the directory, contact your web admin. You can also set up commands that just hide one specific posting, or commands that stop multiple crawlers from searching.

This isn’t just useful for hiding embarrassing adventures in blogging — it’s also helpful for hiding password-protected pages and sensitive information. E-commerce services can use robots.txt to keep databases that contain clients’ personal data hidden away.

Some sites get creative with their robots.txt files — Yelp includes instructions on the off chance that the robots become sentient as an inside joke. And may web admins include directions in robots.txt to help their sites get crawled more quickly, so it’s as much a tool for guiding robots around as it is a tool for telling them to stay out.

Most of the time, people try to get their internet content discovered. But robots.txt, which has been around since 1994, highlights a persistent desire for a level of control over how what we put on the web gets spread around. When media companies backtrack on what they have published, it draws attention to how this tool can be used for coverups. Yet there are many reasons why people want the chance to limit audiences, and the existence of a tool to give makers more power over what gets discovered and remembered online is a good thing.
[SEOROI, Yoast, Cognitive SEO, Google]

Illustration: Jim Cooke

You Can Get $350 off an iPhone 15 and 15 Pro From Vodafone Right Now

Zack Snyder Details the Process of Creating, Splitting, Then Expanding Rebel Moon

Here’s Why The Lightning Cable Sucks

U.S. Pinky Swears Not to Kill Julian Assange If He’s Extradited

Elon Musk Is Begging Tesla Investors to Reinstate $56 Billion Bonus

Moose’s Mobile Deals Have Some Hard to Ignore Dollar-To-Data Value

Today’s Best Australian Tech Deals

Listen Up, You Can Get up to 38% off Bose Headphones and Speakers

Here’s How You Can Get a Fast NBN 50 Plan For Under $60

The Best Mobile Plans Under $30

Giz Explains: How To Use Robots.txt To Hide Your Dumb Blog