mess
How Google Crunches All That Data
Source: http://gizmodo.com/5495097/how-google-crunches-all-that-data
If data centers are the brains of an information company, then Google is one of the brainiest there is. Though always evolving, it is, fundamentally, in the business of knowing everything. Here are some of the ways it stays sharp.
For tackling massive amounts of data, the main weapon in Google’s arsenal is MapReduce, a system developed by the company itself. Whereas other frameworks require a thoroughly tagged and rigorously organized database, MapReduce breaks the process down into simple steps, allowing it to deal with any type of data, which it distributes across a legion of machines.
Looking at MapReduce in 2008, Wired imagined the task of determining word frequency in Google Books. As its name would suggest, the MapReduce magic comes from two main steps: mapping and reducing.
The first of these, the mapping, is where MapReduce is unique. A master computer evaluates the request and then divvies it up into smaller, more manageable “sub-problems,” which are assigned to other computers. These sub-problems, in turn, may be divided up even further, depending on the complexity of the data set. In our example, the entirety of Google Books would be split, say, by author (but more likely by the order in which they were scanned, or something like that) and distributed to the worker computers.
Then the data is saved. To maximize efficiency, it remains on the worker computers’ local hard drives, as opposed to being sent, the whole petabyte-scale mess of it, back to some central location. Then comes the second central step: reduction. Other worker machines are assigned specifically to the task of grabbing the data from the computers that crunched it and paring it down to a format suitable for solving the problem at hand. In the Google Books example, this second set of machines would reduce and compile the processed data into lists of individual words and the frequency with which they appeared across Google’s digital library.
The finished product of the MapReduce system is, as Wired says, a “data set about your data,” one that has been crafted specifically to answer the initial question. In this case, the new data set would let you query any word and see how often it appeared in Google Books.
MapReduce is one way in which Google manipulates its massive amounts of data, sorting and resorting it into different sets that reveal new meanings and have unique uses. But another Herculean task Google faces is dealing with data that’s not already on its machines. It’s one of the most daunting data sets of all: the internet.
Last month, Wired got a rare look at the “algorithm that rules the web,” and the gist of it is that there is no single, set algorithm. Rather, Google rules the internet by constantly refining its search technologies, charting new territories like social media and refining the ones in which users tread most often with personalized searches.
But of course it’s not just about matching the terms people search for to the web sites that contain them. Amit Singhal, a Google Search guru, explains, “you are not matching words; you are actually trying to match meaning.”
Words are a finite data set. And you don’t need an entire data center to store them—a dictionary does just fine. But meaning is perhaps the most profound data set humanity has ever produced, and it’s one we’re charged with managing every day. Our own mental MapReduce probes for intent and scans for context, informing how we respond to the world around us.
In a sense, Google’s memory may be better than any one individual’s, and complex frameworks like MapReduce ensure that it will only continue to outpace us in that respect. But in terms of the capacity to process meaning, in all of its nuance, any one person could outperform all the machines in the Googleplex. For now, anyway. [Wired, Wikipedia, and Wired]
Image credit CNET
Memory [Forever] is our week-long consideration of what it really means when our memories, encoded in bits, flow in a million directions, and might truly live forever.
Map of IP addresses around the world used to commit Click-Fraud
A recently disbanded click fraud ring in China racked up $3 million worth of clicks in two weeks. $3 million that we’re aware of. Just how detectable is this whole business of racking up fraudulent ad revenue clicks?
That intricate mess of lines above represents a portion of DormRing1, the click fraud bunch that was caught in China. The lines show the relationship of some of the IP addresses involved in the fraud and how they are connected to some fraudulent ad clicks. The whole network actually “involved 200,000 different IP addresses and racked up more than $3 million worth of fraudulent clicks across 2,000 advertisers in a two-week period.” Impressive and scary at the same time.
The trouble is that no one really knows how much ad revenue DormRing1 collected before they were caught. Click-fraud monitoring services such as Anchor Intelligence, the ones behind this catch, are evolving to keep up with the scale on which these rings are operating. It’s still difficult to judge just how well they’re doing as they’re having to infiltrate forums and gain the trust of the perpetrators in a manner reminiscent of drug busts. But as the criminals are getting more elaborate, the investigations are too.
That good news aside, do me a favor: after you read this post, comment, and all that jazz, refresh the page a few times and—Ah…I mean, heh…just kidding. [Tech Crunch]
![]()
About Me
Tags
Popular Posts
- HP Mini 311 Nvidia ION Netbook Hackintosh'ed
- Facebook advertising metrics and benchmarks
- When NOT to use Groupon (as an advertiser)
- How-To View Gmail for iPad on Your Regular Computer - Chrome and Safari
- social media benchmarks
- What is Web 3.0? Characteristics of Web 3.0
- Facebook's Security Check Asks Users to Identify Photos of Friends' Dogs, Gummi Bears
- Vapor4 May Be the First Bumper Worthy of the iPhone 4
- Two Social Success Stories - Groupon and FourSquare
Recent Posts
- ‘we are prioritizing our Android platform’
- 1531
- 1529
- 1527
- HP Labs teams up with Hynix to manufacture memristors, plans assault on flash memory in 2013
- Amazon planning subscription video service to challenge Netflix and Hulu?
- It’s Time To Make Standardized Ratings For Gadgets
- Arcade Fire and Google Pushing HTML5 Together
- New ARM architecture (likely Eagle) better suited for OS virtualization
- view movie service by end of 2010, says Financial Times
Recent Articles by Dr. Augustine Fou
- Augustine Fou | ClickZ
- ClickZ Welcomes Augustine Fou | ClickZ
- The ROI for Social Media Is Zero | ClickZ
- A New Definition of 'Digital' | ClickZ
- Social Commerce: In Friends We Trust | ClickZ
- 10 Commandments of Modern Marketing | ClickZ
- Digital is the DNA of All Advertising | ClickZ
- Experiential Marketing | ClickZ
- Social Intensity: A New Measure for Campaign Success? | ClickZ
- Beyond Targeting in the Age of the Modern Consumer | ClickZ
Pages
Archives
- September 2010 (6)
- August 2010 (101)
- July 2010 (61)
- June 2010 (28)
- May 2010 (28)
- April 2010 (26)
- March 2010 (33)
- February 2010 (21)
- January 2010 (12)
- December 2009 (4)
- November 2009 (2)
- October 2009 (14)
- September 2009 (6)
- August 2009 (19)
- July 2009 (34)
- June 2009 (11)
- May 2009 (4)
- April 2009 (6)
- March 2009 (13)
- February 2009 (32)
- January 2009 (25)
- December 2008 (1)
- October 2008 (1)
- November 2007 (1)
