420 likes | 445 Views
Explore web crawling, inverted indices, and search engine infrastructure in database management systems (DBMS). Understand the process of building web indexes, crawling the web, and implementing inverted indices for efficient search algorithms. Learn about Google-like infrastructure and sorting results in DBMS queries.
E N D
C20.0046: Database Management SystemsLecture #22 M.P. Johnson Stern School of Business, NYU Spring, 2008 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Agenda • Websearch • Datamining • RAID • XML? • Regexs? • Indices? • Etc. (which?) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Next topic: Websearch • DBMS queries use tables and (optionally) indices • How does a search engine search the web? • First thing to understand about websearch: • we never run queries on the web • Way too expensive, for several reasons • Instead: • Build an index of the web • Search the index • Return the links found M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Crawling • To obtain the data for the index, we crawl the web • Automated web-surfing • Conceptually very simple • But difficult to do robustly • First, must get pages • Prof. Davis (NYU/CS)’s example: http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java • http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java • Run program on grid.stern.nyu.edu: sales> cd ~mjohnson/public_html/dbms/websearch sales> java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Crawling issues in practice • DNS bottleneck • to view page by text link, must get address • BP claim: 87% crawling time ~ DNS look-up • Search strategy? • Refresh strategy? • Primary key for webpages • Use artificial IDs, not URLs • more popular pages get shorter DocIDs (why?) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Crawling issues in practice • Non-responsive servers • Bad HTML • Tolerant parsing • Content-seen test • compute fingerprint/hash (again!) of page content • robots.txt • http://www.robotstxt.org/wc/robots.html M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Inverted indices • Which data structure for the index? • A inverted index mapping words to pages • First, think of each webpage as a tuple • One column for each possible word • True means the word appears on the page • Index on all columns • Now can search: john yoo • select url from T where john=T and yoo=T M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Inverted indices (draw pictures) • Initially a mapping: urls array of booleans • Could construe as a mapping: urls buckets of words • But instead invert the mapping: words bucket of urls M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Inverted Indices • What’s stored? • For each word W, for each doc D store relevance of D to W: • #/% occurs. of W in D • meta-data/context: bold, font size, title, etc. • In addition to page importance, keep in mind: • this info is used to determine relevance of particular words appearing on the page M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Search engine infrastructure • Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Google-like infrastructure (draw picture) • Very large distributed system • File sizes routines in GBs Google File System • Block size = 64MB (not kb)! • 100k+ low-quality Linux boxes • system failures are the rule, not exception • Divide index up by words into many barrels • lexicon maps word ids to word’s barrel • also, do RAID-like strategy two-D matrix of servers • many commodity machines frequent crashes • May have more duplication for popular pages… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Google-like infrastructure • To respond to single-word query Q(w): • find to the barrel column for word w • pick random server in that column • return (some) sorted results • To respond to multi-word query Q(w1…wn): • for each word wi, find its results as above • for all words in parallel, merge and prune • step through until find doc containing all words, add to results • index ordered on word;docID, so linear time • return (some) sorted results M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Websearch v. DBMS (quickly) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
New topic: Sorting Results • How to respond to Q(w1,w2,…,wn)? • Search index for pages with w1,w2,…,wn • Return in sorted order (how?) • Soln 1: current order • Return 100,000 (mostly) useless results • Sturgeon's Law: “Ninety percent of everything is crud.” • Soln 2: ways from Information Retrieval Theory • library science + CS = IR M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Simple IR-style approach (draw pictures) • for each word W in a doc D, compute • # occurs of W in D / total # word occurs in D • each document becomes a point in a space • one dimension for every possible word • Like k-NN and k-means • value in that dim is ratio from above (maybe weighted, etc.) • Choose pages with high values for query words • A little more precisely: each doc becomes a vector in space • Values same as above • But: think of the query itself as a document vector • Similarity between query and doc = dot product / cos M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Information Retrieval Theory • With some extensions, this works well for relatively small sets of quality documents • But the web has 8 billion documents (old!) • Problem: if based just on percentages, very short pages containing query words score very high • BP: query a “major search engine” for “bill clinton” • “Bill Clinton Sucks” page M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Soln 3: sort by “quality” • What do you mean by quality? • Hire readers to rate my webpage (early Yahoo) • Problem: this doesn’t scale well* • more webpages than Yahoo employees… • * actually, now coming back, Wiki-style! M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Soln 4: count # citations (links) • Big idea: you don’t have to hire human webpage raters • The rest of the web has already voted on the quality of my webpage • 1 link to my page = 1 vote • Similar to counting academic citations • Peer review M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Soln 5: Google’s PageRank • Count citations, but not equally • weighted sum • Motiv: we said we believe that some pages are better than others • those pages’ votes should count for more • A page can get a high PageRank many ways • Two cases at ends of a continuum: • many pages link to you • yahoo.com links to you M.P. Johnson, DBMS, Stern/NYU, Spring 2008
PageRank definition for a page P • for each page Li that links to P, let C(Li) be the # of pages Li links to • Then PR0(P) = SUM(PR0(Li)/C(Li))) • Motiv: each page votes with its quality; • its quality is divided among the pages it votes for • Extensions: bold/large type/etc. links may get larger proportions… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Systems of Trust 1. Friendster/Orkut • someone “good” invites you in • someone else “good” invited that person in, etc. 2. PKE certificates • my cert authenticated by your cert • your cert endorsed by someone else's… • Both cases here: eventually reach a foundation 3. job/school recommendations • three people recommend you • why should anyone believe them? • three other people rec-ed them, etc. • eventually, we take a leap of faith M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Random Surfer Model • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link… • Turns out: after long time surfing, • Pr(were at some page P right now) = PR0(P) • PRs are normalized M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Methods for Computing PageRank • For each page P, we want to find its PR: • PR(P) = SUM(PR(Li)/C(Li))) • But this is circular – how to compute? • for n pages, we've got n linear eqs and n unknowns • can solve for all PR(P)s, but too hard/expensive • see your linear algebra course… • iteratively • start with PR0(P) set to E for each P • iterate until no more significant change • PB report O(50) iterations for O(30M) pages/O(300M) links • #iters req. grows only with log of web size M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Analogous iterative systems (skip?) • Sornette: “Why Stock Markets Crash” • Si(t+1) = sign(ei + SUM(Sj(t)) • trader buys/sells based on • is inclination and • what is associates are saying • directions. of magnet det-ed by • old direction and • dirs. of neighbors • activation of neuron det-ed by • its props and • activation of neighbors connected by synapses • PR of P based on • its inherent value and • PR of in-links M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points nowhere draw picture • Start A,Y,M at 1: • (1,1,1) (0,0,0) • The rank dissipates • Soln: add (implicit) self link to any dead-end sales% cd ~mjohnson/public_html/dbms/websearch sales% java PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points to self • Start A,Y,M at 1: • (1,1,1) (0,0,3) • Now M becomes a rank sink • RSM interp: we eventually end up at M and then get stuck • Soln: add “inherent quality” E to each page sales% java PageRank2 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model sales% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Random Surfer Model’ • Motiv: if we (qua random surfer) end up at page M, we don’t really stay there forever • We type in a new URL • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link • But occasionally, we get bored and jump to a random new page • Turns out: after long time surfing, • Pr(we’re at some page P right now) = PR(P) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Non-uniform Es (skip?) • So far, assumed E was const for all pages • But can make E a function E(P) • vary by page • How do we choose E(P)? • One idea: set high for pages that I like • BP paper gave high E to John McCarthy’s homepage and links • pages he links to get high PR, etc. • Result: his own personalized search engine • Q: How would google.com get your prefs? M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Hydraulic model • Picture the web as graph again • imagine each link as a pipe connecting two nodes • imagine quality as fluid • each node is a reservoir initialized with amount E of fluid • Now let flow… • Steady state is: each node P w/PR(P) amount of fluid • PR(P) of fluid eventually settles in node P • the equilibrium state M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: linear algebra (skip) • Store webgraph as a (weighted) matrix • PageRank values ~ eigenvalues of the webgraph maxtrix M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Tricking search engines • “Search Engine Optimization” • Challenge: include on your page lots of words you think people will query on • maybe hidden with same color as background • Response: PR/popularity ranking • the pages doing this probably aren't linked to that much M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Tricking search engines: acting popular • Challenge: create a page with 1000 links to my page • Response: PageRank itself • Challenge: Create 1000 other pages linking to it • Response: limit the weight a single domain can give to itself • Challenge: buy a second domain and put the 1000 pages there • Response: limit the weight from any single domain… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Another good idea: using anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain “search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text offers mores descriptions of the page: • many pages link to google.com • many of them likely say "search engine" in/near the link • Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Tricking search engines: anchor text • This provides a new way to trick the search engine • Use of anchor text is a big part of result quality • but has potential for abuse • Lets you influence the appearance of other people’s pages • Google Bombing • put up lots of pages linking to my page, using some particular phrase in the anchor text • result: search for words you chose produces my page • Old eg: "talentless hack", "miserable failure", “waffles" Anchor text as extra dataset: http://anand.typepad.com/datawocky/ M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Next: Bidding for ads • Google had two big ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience isn’t interested • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so invert the question • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query • Or: whatever’s on the page he’s viewing M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order (e.g.) • Company pays only if user clicks • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2008
How to assign ads? • On each query, must assign a bidder for top place) • Matching problem • Online problem • Also: each bidder has daily budget • Many-to-one • One idea: choose highest bid (in expectation) • Can’t be better than ½-approx (why?) • Another: choose highest remaining budget • Definitely suboptimal (why?) • “Trade-off” algorithm: 1-1/e-approx (~.6321) • Best possible guarantee (see Vazirani et al.) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Click Fraud • Medium-sized challenge: • Users who click on ad links to cost their competitors money • Or pay housewives in India $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2008
For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Why search the web when you could download it? • Webaroo M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Future • RAID, etc. • Project Presentations • Final Exam • Info up soon… M.P. Johnson, DBMS, Stern/NYU, Spring 2008