Completely deleting something from the internet is like corralling drunk, feral geese after setting them loose: damn near impossible. But there are ways to conceal the web content you don’t want anyone to lay eyes on. You can hide all sorts of web pages with what’s essentially a “Keep Out” sign for search engines: a special file called robots.txt.
The robots.txt file acts as shortcut to bury content so deep it’s hard to dig up. As BuzzFeed recently illustrated when it tried to hide deleted posts that were critical of properties of its advertisers, robots.txt is a fairly potent coverup tool. If you want to dodge responsibility for something you’ve published, the file will partially hide your tracks. The internet’s memory is long, and robots.txt is a forgetting protocol.
BuzzFeed used robots.txt to make it harder for people to find a few posts about Dove and Monopoly that it had deleted in a bout of editorial stupidity. By adding the URLs of these ghost posts to its robots.txt directory, it prevented older versions from showing up in online searches. This didn’t mean the posts were gone entirely, but it made it so that Unilever execs pissed about the deleted Dove commentary would not be able to find it unless they specifically rummaged through BuzzFeed’s robots.txt directory.
So how does it work? Search engines like Google, Bing, and Yahoo often cache older versions of web pages, meaning it’s fairly simple to find a copy of a deleted post. The Internet Archive’s Wayback Machine also archives copies of gone posts, preserving a digital record. In general, this habit of preservation is a boon for protecting digital history. But when you want something to be forgotten, this tendency towards record-keeping becomes a problem.
However, Google, Bing, Yahoo, the Wayback Machine, and a variety of other search robots will listen to your commands not to keep a record. Most of the robots used by search engines will look for the presence of a robots.txt file right away, and will obey instructions to exclude content.
For a professional media company to self-censor is shady as hell, but perhaps you, dear reader, have a better reason. Making a robots.txt file to hide your shameful exploits is very easy. You open up a text editor and type the following:
Then you customise it with exactly what you want to disallow, and save it as a .txt file. It’s important to use lowercase for “robots” and to make a separate “Disallow:” command for each exclusion.
For the Wayback Machine, for instance, you write this and it will retroactively scrub your page:
After that, you upload the file to the root directory of your domain (it has to be the main one). If you don’t have direct access to the directory, contact your web admin. You can also set up commands that just hide one specific posting, or commands that stop multiple crawlers from searching.
This isn’t just useful for hiding embarrassing adventures in blogging — it’s also helpful for hiding password-protected pages and sensitive information. E-commerce services can use robots.txt to keep databases that contain clients’ personal data hidden away.
Some sites get creative with their robots.txt files — Yelp includes instructions on the off chance that the robots become sentient as an inside joke. And may web admins include directions in robots.txt to help their sites get crawled more quickly, so it’s as much a tool for guiding robots around as it is a tool for telling them to stay out.
Most of the time, people try to get their internet content discovered. But robots.txt, which has been around since 1994, highlights a persistent desire for a level of control over how what we put on the web gets spread around. When media companies backtrack on what they have published, it draws attention to how this tool can be used for coverups. Yet there are many reasons why people want the chance to limit audiences, and the existence of a tool to give makers more power over what gets discovered and remembered online is a good thing.
[SEOROI, Yoast, Cognitive SEO, Google]
Illustration: Jim Cooke