Saturday, August 28, 2010

What Google’s Data Center Can Teach You

You know how well your data center works, but haven’t you always wanted to know how your data center compares to other companies?

The problem, of course, is that breaking into Fort Knox might be easier than finding out what’s what at another company’s data center.

Fortunately, Google is willing to share information.

Jeff Dean, a Google Fellow who has been with the company since 1999, gave a speech at the Web Search and Data Mining meeting in 2009 where he unveiled how Google puts together its data centers (PDF).

Because the meeting was in Barcelona, Spain, his speech didn’t receive the attention it deserved in the United States.

After all, wouldn’t you like to know how Google manages to do what it does? And how the company’s experience and expertise can help you predict how your data center measures up?

A Google data center starts with high-speed, multi-core CPUs, Dean revealed. Each of these servers has 16GBs of RAM with fast 2TB (Terabyte) hard drives.

These are kept in racks of 80 servers tied together with 10Gb Ethernet or other high-speed network fabrics.

Finally, 30 or more of these racks are deployed into a single cluster. In addition, each rack and cluster has its own servers simply to manage and maintain each layer’s PCs and racks.

Finally, add in additional storage to the tune of petabytes in storage area networks (SANs), and you have a single Google cluster.

A small Google data center consists of a minimum of 2,400 servers.

These clusters all run a Google-optimized version of Ubuntu Linux, according to Google’s open source programs manager Chris DiBona in a 2010 presentation at OSCON, an open-source developer conference.

The company uses a wide variety of open-source programs that create the Google search engine and applications many of us use every day.

You might think that Google, which depends for its very existence on keeping its data centers running 24×7, 365 days a year, works very hard on keeping its servers, racks, and clusters working no matter what.

You’d be right.

Even so, as Dean pointed out, “Things will crash. Deal with it!”

Even if you could “start with super reliable servers [with] mean times between failures (MTBF) of 30 years, and built a computing system with ten thousand of those, you’d still watch one fail per day,” Dean said.

In short, we can talk about five or even six nines (99.9999%) reliability, but when we’re working with modern data-centers it’s still a matter of when, not if, you’ll have failures.

At Google, Dean found that in an established cluster or data center, 1% to 5% of disk drives die and servers have a 2-4% failure rate.

In a new cluster, these kinds of failures and maintenance problems result in down time. Using Dean’s statistics:

* 0.5 overheating (power down most machines in under five minutes, expect 1-2 days to recover)
* 1 PDU (Power Distribution Unit) failure (about 500-1000 machines suddenly disappear, budget 6 hours to come back)
* 1 rack-move (You have plenty of warning: 500-1000 machines powered down, about 6 hours)
* 1 network rewiring (rolling 5% of machines down over 2-day span)
* 20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)
* 5 racks go wonky (40-80 machines see 50% packet loss)
* 8 network maintenances (4 might cause ~30-minute random connectivity losses)
* 12 router reloads (takes out DNS and external virtual IP address (VIPS) for a couple minutes)
* 3 router failures (have to immediately pull traffic for an hour)
* dozens of minor 30-second blips for DNS
* 1000 individual machine failures
* thousands of hard drive failures

Finally, expect to see a variety of issues with slow disks, bad memory, mis-configured machines, flaky machines, and that enemy of all data centers everywhere: the dreaded backhoe of doom.

So, what can you do about these kinds of problems, which Google has to deal with on a daily basis? Google’s answers: First, plan on these problems being normal and not the exception.

That means fault-tolerant software; hardware and network infrastructure is a must. When – not if – a rack or cluster in your data center fails, your server farm needs to be able to keep soldiering on.

Next, in Google’s case, the company uses distributed systems. Even single searches can touch on hundreds of servers, according to Dean.

Google engineers and developers work hard to carefully design so that when something does go pear shaped, the problems are partitioned away from the other data centers, clusters, or racks.

While you do not want to move to distributed systems model with your software, trying to organize your programs and data so that problems are partitioned away from each other when they do happen is a good idea.

In other words, just like death and taxes, you can count on endless problems at your data center.

If Google can’t get to 100% up-time, I doubt very much that your enterprise can either. But that’s not a bad thing.

By getting rid of the idea that data-centers can be made perfect, you can move on to the far better idea of working to deal with problems as they come up.

That way, your data center – and everyone’s nerves – end up feeling better.

No comments:

Post a Comment