180 likes | 382 Views
Google: Case Study. cs430 lecture 15 03/13/01 Kamen Yotov. Introduction: What’s new?. Amount of web information growing Amount of inexperienced users growing Surfers willing to start from indices like Yahoo! Expensive to build and maintain; Slow to improve; Cannot cover all topics!
E N D
Google: Case Study cs430 lecture 15 03/13/01 Kamen Yotov
Introduction: What’s new? • Amount of web information growing • Amount of inexperienced users growing • Surfers willing to start from indices like Yahoo! • Expensive to build and maintain; • Slow to improve; • Cannot cover all topics! • Google – large scale search engine • Name from “googol” = 10100 • Uses heavily additional structure = quality results Google: Case Study
Introduction (continued…) • Search engine technology to scale • Server requests to scale similarly… • Technology advances help… but no so much! • E.g. disk seek time, operating system problems • Expect cost of indexing/storing text/html to drop relative to amount of information available! Google: Case Study
Main goals: Quality, Quality,… • Completeness of index is just one factor • Lots of junk in the results • Number of documents increase exponentially, but user ability does not! • High precision very important! • Link structure & Link text are valuable • … Not much information; Commercial! Google: Case Study
Features: PageRank • Heavy use of the link structure • Performs well even indexing only the titles • Counting links to a page • Weghts on the sources • Page A has pages Ti pointing to it. • d: damping factor • C(A): # of links out of A Google: Case Study
Related Work: Applicability • Information retrieval • Size does matter! Large corpuses are small for the means of Web search (20GB/147GB) • Vector methods often tend to return short documents • Argument: Users should specify more concretely what they search for!Google: disagree! • Other differences from controlled collections • No format, language restrictions, control • Extended meta information Google: Case Study
From Inside… • Mostly C/C++ • Solaris/Linux • Module-based architecture • Multi-machine • Multi-thread • Resource dedication Google: Case Study
Major Structures • BigFiles • Span several file systems • 64-bit addressed • Descriptor management • Compression • Document index • ISAM (Index sequential access mode), ordered by docID • Pointer to Repository, Status, Statistics • Pointer to URL and Title in docinfo file if crawled • URL to docID conversion (checksum) Google: Case Study
Major Structures (continued) • Repository • Zlib compressed • docID, Length, URL • Self-consistent data • Lexicon • Memory resident • List of words and a hash-table of pointers • Other auxiliary information… (out of scope) Google: Case Study
Major Structures (continued 2) • Hit Lists • Word in a document + typesetting information (hand-encoded) • Take most of the space of all indices Google: Case Study
Major Structures (continued 3) • Forward Index • Partially sorted • Stored in a number of barrels • Each barrel holds range of wordIDs + hitlist Google: Case Study
Major Structures (continued 4) • Inverted Index • Same barrels, but processed by the sorter • Not stored by ranking in occurrence for the sake of speed • Two sets of inverted barrels Google: Case Study
Crawling the Web • We talked before… • Fragile, beyond our control • Implemented in Python • Internal DNS cache for each crawler • Social issues • Phone calls, support • Preventing indexing • Virtually unable to debug… just test! Google: Case Study
Indexing the Web • Parsing problems • Errors in HTML • Non-ASCII characters • Home-grown parser (not YACC) • Indexing documents into barrels • Shared lexicon – too much locking • Log file of new words… processed at end • Sorting Google: Case Study
Searching • Parse the query • Convert words to wordIDs • Seek to start of doclist in the short barrel for every word • Scan through until a document that matches all terms is encountered • Compute the rank of that document • Repeat the same thing for the full barrel • Sort the documents matched by rank and return the first few Google: Case Study
Query: bill clinton http://www.whitehouse.gov/100.00% (no date) (0K) http://www.whitehouse.gov/ Office of the President 99.67% (Dec 23 1996) (2K) http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html Welcome To The White House 99.98% (Nov 09 1997) (5K) http://www.whitehouse.gov/WH/Welcome.html Send Electronic Mail to the President 99.86% (Jul 14 1997) (5K) http://www.whitehouse.gov/WH/Mail/html/Mail_President.html mailto:president@whitehouse.gov99.98% mailto:President@whitehouse.gov 99.27% The "Unofficial" Bill Clinton 94.06% (Nov 11 1997) (14K) http://zpub.com/un/un-bc.html Bill Clinton Meets The Shrinks 86.27% (Jun 29 1997) (63K) http://zpub.com/un/un-bc9.html President Bill Clinton - The Dark Side97.27% (Nov 10 1997) (15K) http://www.realchange.org/clinton.htm $3 Bill Clinton94.73% (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html Results and Performance • Quality of results • Manual ranking • Sorting • PageRank • Anchor text • Proximity • Broken links Google: Case Study
Performance • Storage • Scale with the size of the Web • Repository is comparatively small • Good/Fast compression/decompression • System • Crawling, Indexing, Sorting • Last two simultaneously • Searching • Bounded by dish IO over LAN (NFS) Google: Case Study
Conclusion • Google: • Scalable search engine • Complete architecture • Many research ideas arise • Always something to improve • Matter of time • High quality search is the dominant factor Google: Case Study