Scaling to the Modern Internet

Scaling to the Modern Internet CSCI 572: Information Retrieval and Search Engines Summer 2010

Outline • The paradigm shift: BigData • Search Engine Models for BigData • Map Reduce • GFS • Looking forward: what to do with the data • Upcoming technologies • Challenges

Grand Data Challenges • We’ve talked about the end to end search lifecycle • So, now what • Projects are collecting huge amounts of data • Let’s take a few examples

The Square Kilometer Array • 1 sq. km ofantennas • Never-beforeseen resolution looking intothe sky • 700 TB • Per second!

NASA DESDynI Mission • 16 TB/day • Geographically distributed • 10s of 1000s of jobs per day • Tier 1 Earth Science Decadal Mission

How do we scale? • Biggest search engines are on the order of 40B records • Size on disk in the 10-100s of GB range • Web pages, other forms of content are fairly small • What happens when we have • Indexes on the order of 10 x 40B? What about 100x? • Large data files that folks want to make available?

One solution: Commodity • Early 2000s • Google decides to buy up a bunch of Intel P3 computerswith IDE slab disk • Super cheap • Everyone thought exotic expensive hardware was theway to do large scale computing • Problem: cheap hardware fails a lot

One solution: Commodity • Solve the reliability problem in software • Replicate data across the disks for resiliency • Queue up multiple copies of the same job to ensure at least one completes • CPU and disk are cheap, and otherwise under spent, so why not • Suggests an infrastructure as the means of dealing with resiliency • Developers need to be able to write their code in familiar programming constructs, while leveraging the underlying commodity hardware

Google: GFS and Map Reduce • 2 seminal papers published • Google File System: ACM SOPS, 2003 • http://labs.google.com/papers/gfs.html • Map Reduce distributed programming model: OSDI, 2004 • http://labs.google.com/papers/mapreduce.html • Teaches the world how Google was able to make use of those 1000s of node clusters built on cheap Pentium 3s and IDE disk

Google Infrastructure Infusion • Rewrote their production crawling system on top of GFS and Map Reduce • Reduced time to crawl the web by orders of magnitude • Allowed developers to write simple map and reduce functions that could then scale out • Users wanted structured data on top of the underlying core • Big Table: OSDI, 2006 • http://labs.google.com/papers/bigtable.html • Column Oriented Database

The Open Source World • Doug Cutting decided in 2006 thatthe Google papers on Map Reduce and GFS were the appropriate guidance to take his open source search engine project, Nutch, and overcome its limitations of scalingto multiple computers • He and Mike Cafarella went off andbranched Nutch and implemented a version of Nutch built on a GFS like system, and on M/R

The origin of scalable OSS ecosystems • Once M/R and NDFSwere implemented, manyfolks became interested injust the M/R andNDFS infra • Branched off intoHadoop project • Eventually Mike Cafarella and others decided to implement BigTable =>HBase

Assumptions • You have a job that runs for a really long time on sets of independent, “shared nothing” infrastructure • Your job is mostly data independent (i.e. your job doesn’t have to wait on the results of the prior job to run, etc.) • “Embarrassingly” parallel • You can program your algorithm or job in M/R • Not always the easiest mapping • See: http://berlinbuzzwords.de/content/nutch-web-mining-platform-present-and-future for how Nutch did it

Science Data Systems • Need search • Have web-scale knowledge bases that need to be made available to scientists • Job processing is traditionally not embarrassingly parallel • How to leverage Hadoop and Nutch and all of the scalable search technologies?

Build out Reusable SDS Infra

Dump the data • Scale out and treat SDS as gold source • Make Search available as a “service” back to the SDS jobs • Leverage commodity hardware and open source infrastructures

Example: NASA PDS

Where it’s going • Amazon • Elastic Compute Cloud (EC2) • Simple StorageService (S3) • …and many others • Rackspace • Microsoft Azure • Public versus Private cloud

Clouds vs. Grids: Clouds • lowest common denominator services (compute/store), that are broadly applicable independent of application domain • scalability and performance improvements come at economic cost, amortized • must provide externally accessible APIs or service interfaces to the internal workings of the cloud to leverage “cloud” in your application. I.e., you aren’t “cloud” if you are doing computation and storage locally using UNIX pipe and filters... • does not explicitly deal with virtual organizations • constructing clouds is hard and should not be attempted by those with inexperience in the domain of discourse

Clouds vs. Grids: Grids • focused on creation of virtual organizations • focused on scientific applications • at least the successful attempts • goal is to provide all software to enable creation of virtual organizations • very few grid solutions that provide services in all 5 of the grid’s architectural layers. • grid systems/applications are not built with extensibility in mind. • More exploratory • focused on the creation of entire “systems” rather than low level “services”

Challenges • Overcoming the complexity of new programming models • It’s not terribly easy to program in M/R or even in newer constructs like leveraging cloud services • Testing things at scale is difficult • Do you have a 2000 node cluster lying around? • Do you have the $$$ to pay for it on EC2? • Makes it hard to integrate patches and update software because you have to test it at scale

Wrapup • The scalability of the web is only increasing • Software to deal with the web scale has to be resilient against failure • If you use commodity hardware, which seems to be a great trend • Several successful commercial and open source examples at scale • Stormy weather ahead: clouds • Dealing with the challenges

Scaling to the Modern Internet