How Google Crunches All That Data

If data centres are the brains of an information company, then Google is one of the brainiest there is. Though always evolving, it is, fundamentally, in the business of knowing everything. Here are some of the ways it stays sharp.

One resource it uses for tackling huge data sets is MapReduce, a system developed by the company itself. Whereas other frameworks require a thoroughly tagged database, MapReduce breaks the process down into simple steps, allowing it to deal with any type of data, which it then distributes across a legion of machines.

Looking at MapReduce in 2008, Wired imagined the task of determining word frequency in Google Books. Here are the five steps they delineated:

1. First the data must be collected. In this example, Google's machines gather every page of every book scanned in Google Books.

2. The data is mapped. This is where MapReduce is unique. The master computer evaluates the request and then divvies it up into smaller, more manageable "sub-problems," which are assigned to other computers. These sub-problems, in turn, may be divided up even further, depending on the complexity of the data set. In our example, the entirety of Google Books would be split, say, by author (but more likely by the order in which they were scanned, or something like that) and distributed to worker computers.

3. The data is saved. To maximise efficiency, the data is saved to the worker computers' local hard drives instead of sending the entire petabyte-scale data set back to some central location.

4. The saved data is fetched and reduced. Other worker machines are assigned specifically to grab the data from the ones that crunched it, compiling the processed data into lists of words and the frequency with which they appeared.

5. The data is solved. The finished product of the MapReduce system is, as Wired says, a "data set about your data", one that has been created to easily display the solutions to your initial problem. In this case, the new data set would let you query any word and see how often it appeared in Google Books.

MapReduce is how Google manipulates its massive amounts of data, sorting and resorting it into different sets that reveal different meanings and have different uses. But another Herculean task Google faces is dealing with data that's not already on its machines. It's one of the most daunting data sets of all: the internet.

Last month, Wired got a rare look at the "algorithm that rules the web", and the gist of it is that there is no single, set algorithm. Rather, Google rules the internet by constantly refining its search technologies, charting new territories like social media and refining the ones in which users tread most often with personalised searches.

But of course it's not just about matching the terms people search for to the web sites that contain them. Amit Singhal, a Google Search guru, explains, "you are not matching words; you are actually trying to match meaning".

Words are a finite data set. You don't need an entire data centre to store them, either - a dictionary does just fine. But meaning is the most profound data set humanity has ever produced, and it's one we're charged with managing every day. Our own mental MapReduce probes for intent and scans for context, informing how we respond to the world around us.

Google's memory may in a sense be better than any one individual's, and complex frameworks like MapReduce ensure that it will only continue to outpace us in that respect. But in terms of the capacity to process meaning, in all of its nuance, one person could outperform the entire Googleplex. At least for now. [Wired and Wikipedia]

Image credit CNET

Memory [Forever] is our week-long consideration of what it really means when our memories, encoded in bits, flow in a million directions, and might truly live forever.