Web Search For a Planet: The Google Cluster Architecture

Web Search For a Planet: The Google Cluster Architecture Luiz Andre Barroso Jeffrey Dean Urs Holzle Google

Outline • Introduction • Architecture Overview • Serving a Google Query • Design Principles of Google Clusters • The Power Problem • Parallelism • Hardware-level Application Characteristics • Conclusion

Introduction • Search engines • require high amounts of computation per request • A single query on Google (on average) • reads hundreds of megabytes of data • consumes tens of billions of CPU cycles • Instead of a smaller number of high-end servers • Google combines more than 15,000 commodity-class PCs

Architecture Overview • Provide reliability in software • Use cluster of PCs • Not high-end servers • Design system to maximize aggregate throughput • No server response time can parallelize individual • requests

Serving a Google Query Figure shows Google query-serving architecture

Serving a Google Query(Cont.) • Geographically distributed clusters • Each with a few thousand machines • First perform Domain Name System (DNS) lookup • Maps request to nearby cluster • Send HTTP request to selected cluster • Request serviced locally w/in that cluster • Clusters consist of Google Web Servers (GWSs) • Hardware-based load balancer distributes load among GWSs

Design Principles of Google Clusters • Software level reliability • No fault-tolerant hardware features; e.g. • redundant power supplies • A redundant array of inexpensive disks (RAID) • Use replication • for better request throughput and availability • Price/performance beats peak performance • CPUs giving the best performance per unit price • Not the CPUs with best absolute performance • Using commodity PCs • reduces the cost of computation

The Power Problem • Power is a very big issue • Google servers consumes 400-700 W/sq. ft • Typical commercial data center consumes 70-150 W/sq. ft. • Reducing power would reduce performance but it may not reduce cooling and power issues

Parallelism • Lookup of matching docs in a large index • Many lookups in a set of smaller indexes followed by a merge step • A query stream is of multiple streams each handled by a cluster • Adding machines to a pool increases serving capacity

Hardware-level Application Characteristics • Price/Performancebeats pure Performance • Use dual CPU servers rather than quad • Use IDE rather than SCSI • Use commodity PCs • Essentially mid-range PCs except for disks • No redundancy as in many high-end servers • Use rack mounted clusters connected via Ethernet

Conclusion Web search engine can be implemented like Google using: • Lots of commodity machines. • Parallel algorithms for latency. • Replication for throughput and reliability. • Software based reliability. • This resulting distributed system will have a very good price/performance ratio.

Web Search For a Planet: The Google Cluster Architecture

Web Search For a Planet: The Google Cluster Architecture

Presentation Transcript

Data Challenges and Fabric Architecture

Search Engines: Beyond Google

google talks

Google Architecture

Improving Search

Nowcasting RDU with trends

Beyond Google Search

Advanced Google Search

Google is the Internet’s most popular search engine. Largest search index Most users of any site

Architecture as non -activism?

G o o g l e

Google Scholar

How many moons does each planet have?

Ms M’s Top Ten Google Search Tips

Supporting Ranked Search in Parallel Search Cluster Networks

Make Your search easy with Google search engine google

mammoth-technology

Google Local Search API Research and Implementation

A Cluster-On-A-Chip Architecture For High Throughput Phylogeny Search

Revisited

Google Search Tips

Google Search Architecture