Facebook and all of its affiliated companies and services suddenly disappeared from the web on Monday — an outage that lasted over five hours and left users unable to reach their FB, WhatsApp, or Instagram accounts. Rumours and conspiracy theories soon spread that the social media giant had been hacked, or that it was trying to distract from its imminent congressional woes.
Well, now we know the real reason: On Tuesday, the company put out a statement providing more details about the outage and explaining that the whole global blackout was started by a “faulty configuration change” issued in the course of routine maintenance. That misconfiguration accidentally shut down Facebook’s backbone, the globally distributed network of fibre optic cables responsible for connecting all of the company’s data centres throughout the world. Thus, the much-maligned social media giant disappeared from the internet for the better part of a day — giving us all a much-needed rest from its toxic presence.
Of course, the details of what happened are more complicated than that. One particularly interesting aspect of the whole thing is the role played by a powerful but little-known web protocol called Border Gateway Protocol or “BGP.” It was widely speculated by web experts — and is now confirmed by Facebook — that BGP helped fuel the entire episode. So, yeah. What the hell is BGP?
It has been called the “glue” that holds the web together. Others refer to it as the internet’s “post office” or “air traffic controller.” When Facebook fell off the face of the Earth on Monday, Stripe CEO Patrick Collison referred to BGP as “the dark magic of the internet” — a complex mechanism “fully understood by no one.” Actually, BGP has a basic, straightforward function, but, to understand it, you have to consider the broad strokes of how the web actually works — which is, admittedly, pretty complicated.
In short, BGP is one of the many protocols that help bring order to the big mess of interlocking networks that make up the web. Specifically, BGP helps route traffic to and from the biggest online entities — what are called “autonomous systems.” An AS is basically shorthand for a large network or group of networks: It can be a university, an ISP, a government agency, or, among many other things, a very large tech company — like Facebook. Autonomous systems are responsible for keeping up-to-date information on the fastest web routes by which data packets can be sent to and from their network. Those policies are subsequently communicated to the wider web (and thus to other networks) using BGP. In this sense, BGP basically enables data routing on the web.
This is where the “post office” metaphor comes in. BGP is charged with finding and sharing the most efficient routes to relay data (like mail) back and forth from specific destinations. Others have spoken of it as a map — one that is constantly being changed and updated, depending on the fluctuating conditions of the internet. In yet another inspired metaphor, an analysis by the security firm Imperva compares BGP to your car’s GPS system:
…the BGP routing protocol is analogous to your trusty GPS navigator. Like Google’s Waze application, the best route is determined by different factors, such as traffic congestion, roads temporarily closed for maintenance, etc. The path is calculated dynamically depending on the situation of the network nodes, which are like roads and junctions on a GPS map.
There is a lot more that could be said about BGP but the short story is this: If an autonomous system doesn’t have its BGP configured properly, data can’t be routed effectively to and from its network and, therefore, people can’t reach it. This is apparently part of what happened to Facebook.
How BGP Relates to Facebook’s Very Bad Day
Historically speaking, BGP misconfigurations are known for causing “spectacular incidents of widespread outages,” cutting off user access to online services. Facebook has now copped to BGP’s role in its shittiest of shitty days, explaining in its recent update how its backbone issue contributed to the downing of its BGP “advertisement” — essentially the mechanism that beacons to other online entities that it exists on the web:
To ensure reliable operation, our DNS servers disable those BGP advertisements if they themselves can not speak to our data centres, since this is an indication of an unhealthy network connection. In the recent outage the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers.
Facebook’s incident is similar in many ways to other episodes involving large companies whose BGP was disabled or improperly configured.
“In our experience, these usually are mistakes, not attacks,” said Usman Muzaffar, SVP, Engineering at Cloudflare, in a statement shared with Gizmodo on Monday, when questioned about the outage. According to experts, such an outage is not a totally anomalous event — though the size and duration of Facebook’s outage are notable. Cloudflare has done its own breakdown on how BGP misconfiguration could have played out.
“It’s not that weird,” said Jacob Hoffman-Andrews, senior staff technologist at the Electronic Frontier Foundation. “The big tech giants have outages like this with some frequency,” he said, pointing to one particularly notorious BGP incident in 2008 when Pakistan’s state-owned telecom managed to accidentally boot YouTube off the internet by co-opting traffic meant for the video-sharing platform. During a similar episode in 2018, a large part of Google went down for about an hour after a BGP malfunction routed a large chunk of web traffic through Russia, China, and other areas it was not supposed to be.
Will Something Like This Happen Again?
Short answer: Yes. Most definitely yes. If not to Facebook, BGP will almost certainly play a role in tripping up another major platform that you use a lot. According to experts, that’s no cause for alarm — but it is a good example of the fallible nature of the web, illustrating how much of it can be brought down by something as simple as a company’s technical error.
“Today’s events are a gentle reminder that the Internet is a very complex and interdependent system of millions of systems and protocols working together,” said Cloudflare analysts in their write-up on the incident. “That trust, standardization, and cooperation between entities are at the centre of making it work for almost five billion active users worldwide.”