Google: Case Study

Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov

Introduction: What’s new? • Amount of web information growing • Amount of inexperienced users growing • Surfers willing to start from indices like Yahoo! • Expensive to build and maintain; • Slow to improve; • Cannot cover all topics! • Google – large scale search engine • Name from “googol” = 10100 • Uses heavily additional structure = quality results Google: Case Study

Introduction (continued…) • Search engine technology to scale • Server requests to scale similarly… • Technology advances help… but no so much! • E.g. disk seek time, operating system problems • Expect cost of indexing/storing text/html to drop relative to amount of information available! Google: Case Study

Main goals: Quality, Quality,… • Completeness of index is just one factor • Lots of junk in the results • Number of documents increase exponentially, but user ability does not! • High precision very important! • Link structure & Link text are valuable • … Not much information; Commercial! Google: Case Study

Features: PageRank • Heavy use of the link structure • Performs well even indexing only the titles • Counting links to a page • Weghts on the sources • Page A has pages Ti pointing to it. • d: damping factor • C(A): # of links out of A Google: Case Study

Related Work: Applicability • Information retrieval • Size does matter! Large corpuses are small for the means of Web search (20GB/147GB) • Vector methods often tend to return short documents • Argument: Users should specify more concretely what they search for!Google: disagree! • Other differences from controlled collections • No format, language restrictions, control • Extended meta information Google: Case Study

From Inside… • Mostly C/C++ • Solaris/Linux • Module-based architecture • Multi-machine • Multi-thread • Resource dedication Google: Case Study

Major Structures • BigFiles • Span several file systems • 64-bit addressed • Descriptor management • Compression • Document index • ISAM (Index sequential access mode), ordered by docID • Pointer to Repository, Status, Statistics • Pointer to URL and Title in docinfo file if crawled • URL to docID conversion (checksum) Google: Case Study

Major Structures (continued) • Repository • Zlib compressed • docID, Length, URL • Self-consistent data • Lexicon • Memory resident • List of words and a hash-table of pointers • Other auxiliary information… (out of scope) Google: Case Study

Major Structures (continued 2) • Hit Lists • Word in a document + typesetting information (hand-encoded) • Take most of the space of all indices Google: Case Study

Major Structures (continued 3) • Forward Index • Partially sorted • Stored in a number of barrels • Each barrel holds range of wordIDs + hitlist Google: Case Study

Major Structures (continued 4) • Inverted Index • Same barrels, but processed by the sorter • Not stored by ranking in occurrence for the sake of speed • Two sets of inverted barrels Google: Case Study

Crawling the Web • We talked before… • Fragile, beyond our control • Implemented in Python • Internal DNS cache for each crawler • Social issues • Phone calls, support • Preventing indexing • Virtually unable to debug… just test! Google: Case Study

Indexing the Web • Parsing problems • Errors in HTML • Non-ASCII characters • Home-grown parser (not YACC) • Indexing documents into barrels • Shared lexicon – too much locking • Log file of new words… processed at end • Sorting Google: Case Study

Searching • Parse the query • Convert words to wordIDs • Seek to start of doclist in the short barrel for every word • Scan through until a document that matches all terms is encountered • Compute the rank of that document • Repeat the same thing for the full barrel • Sort the documents matched by rank and return the first few Google: Case Study

Query: bill clinton http://www.whitehouse.gov/100.00% (no date) (0K) http://www.whitehouse.gov/ Office of the President 99.67% (Dec 23 1996) (2K) http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html Welcome To The White House 99.98% (Nov 09 1997) (5K) http://www.whitehouse.gov/WH/Welcome.html Send Electronic Mail to the President 99.86% (Jul 14 1997) (5K) http://www.whitehouse.gov/WH/Mail/html/Mail_President.html mailto:president@whitehouse.gov99.98% mailto:President@whitehouse.gov 99.27% The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K) http://zpub.com/un/un-bc.html Bill Clinton Meets The Shrinks 86.27% (Jun 29 1997) (63K) http://zpub.com/un/un-bc9.html President Bill Clinton - The Dark Side97.27% (Nov 10 1997) (15K) http://www.realchange.org/clinton.htm $3 Bill Clinton94.73% (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html Results and Performance • Quality of results • Manual ranking • Sorting • PageRank • Anchor text • Proximity • Broken links Google: Case Study

Performance • Storage • Scale with the size of the Web • Repository is comparatively small • Good/Fast compression/decompression • System • Crawling, Indexing, Sorting • Last two simultaneously • Searching • Bounded by dish IO over LAN (NFS) Google: Case Study

Conclusion • Google: • Scalable search engine • Complete architecture • Many research ideas arise • Always something to improve • Matter of time • High quality search is the dominant factor Google: Case Study

Google: Case Study

Google: Case Study

Presentation Transcript

Case Google Mail

Google Penguin Recovery Case Study

Case Study

CASE STUDY:

CASE STUDY:

CASE STUDY

CASE STUDY

Case Study

Case Study

Case study

Case Study

CASE STUDY

Case Study

Case Study

Case Study

Case Study

Case Study

Case Study

Case –Study

Google Adwords Case Study PDF - Upreports Success Stories

The adwords agency google Case Study You'll Never Forget