1 / 42

C20.0046: Database Management Systems Lecture #22

Explore web crawling, inverted indices, and search engine infrastructure in database management systems (DBMS). Understand the process of building web indexes, crawling the web, and implementing inverted indices for efficient search algorithms. Learn about Google-like infrastructure and sorting results in DBMS queries.

gazelle
Download Presentation

C20.0046: Database Management Systems Lecture #22

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. C20.0046: Database Management SystemsLecture #22 M.P. Johnson Stern School of Business, NYU Spring, 2008 M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  2. Agenda • Websearch • Datamining • RAID • XML? • Regexs? • Indices? • Etc. (which?) M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  3. Next topic: Websearch • DBMS queries use tables and (optionally) indices • How does a search engine search the web? • First thing to understand about websearch: • we never run queries on the web • Way too expensive, for several reasons • Instead: • Build an index of the web • Search the index • Return the links found M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  4. Crawling • To obtain the data for the index, we crawl the web • Automated web-surfing • Conceptually very simple • But difficult to do robustly • First, must get pages • Prof. Davis (NYU/CS)’s example: http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java • http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java • Run program on grid.stern.nyu.edu: sales> cd ~mjohnson/public_html/dbms/websearch sales> java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  5. Crawling issues in practice • DNS bottleneck • to view page by text link, must get address • BP claim: 87% crawling time ~ DNS look-up • Search strategy? • Refresh strategy? • Primary key for webpages • Use artificial IDs, not URLs • more popular pages get shorter DocIDs (why?) M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  6. Crawling issues in practice • Non-responsive servers • Bad HTML • Tolerant parsing • Content-seen test • compute fingerprint/hash (again!) of page content • robots.txt • http://www.robotstxt.org/wc/robots.html M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  7. Inverted indices • Which data structure for the index? • A inverted index mapping words to pages • First, think of each webpage as a tuple • One column for each possible word • True means the word appears on the page • Index on all columns • Now can search: john yoo •  select url from T where john=T and yoo=T M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  8. Inverted indices (draw pictures) • Initially a mapping: urls  array of booleans • Could construe as a mapping: urls buckets of words • But instead invert the mapping: words  bucket of urls M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  9. Inverted Indices • What’s stored? • For each word W, for each doc D store relevance of D to W: • #/% occurs. of W in D • meta-data/context: bold, font size, title, etc. • In addition to page importance, keep in mind: • this info is used to determine relevance of particular words appearing on the page M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  10. Search engine infrastructure • Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  11. Google-like infrastructure (draw picture) • Very large distributed system • File sizes routines in GBs  Google File System • Block size = 64MB (not kb)! • 100k+ low-quality Linux boxes •  system failures are the rule, not exception • Divide index up by words into many barrels • lexicon maps word ids to word’s barrel • also, do RAID-like strategy  two-D matrix of servers • many commodity machines  frequent crashes • May have more duplication for popular pages… M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  12. Google-like infrastructure • To respond to single-word query Q(w): • find to the barrel column for word w • pick random server in that column • return (some) sorted results • To respond to multi-word query Q(w1…wn): • for each word wi, find its results as above • for all words in parallel, merge and prune • step through until find doc containing all words, add to results • index ordered on word;docID, so linear time • return (some) sorted results M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  13. Websearch v. DBMS (quickly) M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  14. New topic: Sorting Results • How to respond to Q(w1,w2,…,wn)? • Search index for pages with w1,w2,…,wn • Return in sorted order (how?) • Soln 1: current order • Return 100,000 (mostly) useless results • Sturgeon's Law: “Ninety percent of everything is crud.” • Soln 2: ways from Information Retrieval Theory • library science + CS = IR M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  15. Simple IR-style approach (draw pictures) • for each word W in a doc D, compute • # occurs of W in D / total # word occurs in D •  each document becomes a point in a space • one dimension for every possible word • Like k-NN and k-means • value in that dim is ratio from above (maybe weighted, etc.) • Choose pages with high values for query words • A little more precisely: each doc becomes a vector in space • Values same as above • But: think of the query itself as a document vector • Similarity between query and doc = dot product / cos M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  16. Information Retrieval Theory • With some extensions, this works well for relatively small sets of quality documents • But the web has 8 billion documents (old!) • Problem: if based just on percentages, very short pages containing query words score very high • BP: query a “major search engine” for “bill clinton” •  “Bill Clinton Sucks” page M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  17. Soln 3: sort by “quality” • What do you mean by quality? • Hire readers to rate my webpage (early Yahoo) • Problem: this doesn’t scale well* • more webpages than Yahoo employees… • * actually, now coming back, Wiki-style! M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  18. Soln 4: count # citations (links) • Big idea: you don’t have to hire human webpage raters • The rest of the web has already voted on the quality of my webpage • 1 link to my page = 1 vote • Similar to counting academic citations • Peer review M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  19. Soln 5: Google’s PageRank • Count citations, but not equally • weighted sum • Motiv: we said we believe that some pages are better than others •  those pages’ votes should count for more • A page can get a high PageRank many ways • Two cases at ends of a continuum: • many pages link to you • yahoo.com links to you M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  20. PageRank definition for a page P • for each page Li that links to P, let C(Li) be the # of pages Li links to • Then PR0(P) = SUM(PR0(Li)/C(Li))) • Motiv: each page votes with its quality; • its quality is divided among the pages it votes for • Extensions: bold/large type/etc. links may get larger proportions… M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  21. Understanding PR: Systems of Trust 1. Friendster/Orkut • someone “good” invites you in • someone else “good” invited that person in, etc. 2. PKE certificates • my cert authenticated by your cert • your cert endorsed by someone else's… • Both cases here: eventually reach a foundation 3. job/school recommendations • three people recommend you • why should anyone believe them? • three other people rec-ed them, etc. • eventually, we take a leap of faith M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  22. Understanding PR: Random Surfer Model • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link… • Turns out: after long time surfing, • Pr(were at some page P right now) = PR0(P) • PRs are normalized M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  23. Methods for Computing PageRank • For each page P, we want to find its PR: • PR(P) = SUM(PR(Li)/C(Li))) • But this is circular – how to compute? • for n pages, we've got n linear eqs and n unknowns • can solve for all PR(P)s, but too hard/expensive • see your linear algebra course… • iteratively • start with PR0(P) set to E for each P • iterate until no more significant change • PB report O(50) iterations for O(30M) pages/O(300M) links • #iters req. grows only with log of web size M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  24. Analogous iterative systems (skip?) • Sornette: “Why Stock Markets Crash” • Si(t+1) = sign(ei + SUM(Sj(t)) • trader buys/sells based on • is inclination and • what is associates are saying • directions. of magnet det-ed by • old direction and • dirs. of neighbors • activation of neuron det-ed by • its props and • activation of neighbors connected by synapses • PR of P based on • its inherent value and • PR of in-links M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  25. Fixing PR bugs • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points nowhere draw picture • Start A,Y,M at 1: • (1,1,1)  (0,0,0) • The rank dissipates • Soln: add (implicit) self link to any dead-end sales% cd ~mjohnson/public_html/dbms/websearch sales% java PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  26. Fixing PR bugs • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points to self • Start A,Y,M at 1: • (1,1,1)  (0,0,3) • Now M becomes a rank sink • RSM interp: we eventually end up at M and then get stuck • Soln: add “inherent quality” E to each page sales% java PageRank2 M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  27. Fixing PR bugs • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model sales% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  28. Understanding PR: Random Surfer Model’ • Motiv: if we (qua random surfer) end up at page M, we don’t really stay there forever • We type in a new URL • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link • But occasionally, we get bored and jump to a random new page • Turns out: after long time surfing, • Pr(we’re at some page P right now) = PR(P) M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  29. Non-uniform Es (skip?) • So far, assumed E was const for all pages • But can make E a function E(P) • vary by page • How do we choose E(P)? • One idea: set high for pages that I like • BP paper gave high E to John McCarthy’s homepage and links •  pages he links to get high PR, etc. • Result: his own personalized search engine • Q: How would google.com get your prefs? M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  30. Understanding PR: Hydraulic model • Picture the web as graph again • imagine each link as a pipe connecting two nodes • imagine quality as fluid • each node is a reservoir initialized with amount E of fluid • Now let flow… • Steady state is: each node P w/PR(P) amount of fluid • PR(P) of fluid eventually settles in node P • the equilibrium state M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  31. Understanding PR: linear algebra (skip) • Store webgraph as a (weighted) matrix • PageRank values ~ eigenvalues of the webgraph maxtrix M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  32. Tricking search engines • “Search Engine Optimization” • Challenge: include on your page lots of words you think people will query on • maybe hidden with same color as background • Response: PR/popularity ranking • the pages doing this probably aren't linked to that much M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  33. Tricking search engines: acting popular • Challenge: create a page with 1000 links to my page • Response: PageRank itself • Challenge: Create 1000 other pages linking to it • Response: limit the weight a single domain can give to itself • Challenge: buy a second domain and put the 1000 pages there • Response: limit the weight from any single domain… M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  34. Another good idea: using anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain “search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text offers mores descriptions of the page: • many pages link to google.com • many of them likely say "search engine" in/near the link •  Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  35. Tricking search engines: anchor text • This provides a new way to trick the search engine • Use of anchor text is a big part of result quality • but has potential for abuse • Lets you influence the appearance of other people’s pages • Google Bombing • put up lots of pages linking to my page, using some particular phrase in the anchor text • result: search for words you chose produces my page • Old eg: "talentless hack", "miserable failure", “waffles" Anchor text as extra dataset: http://anand.typepad.com/datawocky/ M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  36. Next: Bidding for ads • Google had two big ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience isn’t interested • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  37. Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so invert the question • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query • Or: whatever’s on the page he’s viewing M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  38. Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order (e.g.) • Company pays only if user clicks • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  39. How to assign ads? • On each query, must assign a bidder for top place) • Matching problem • Online problem • Also: each bidder has daily budget • Many-to-one • One idea: choose highest bid (in expectation) • Can’t be better than ½-approx (why?) • Another: choose highest remaining budget • Definitely suboptimal (why?) • “Trade-off” algorithm: 1-1/e-approx (~.6321) • Best possible guarantee (see Vazirani et al.) M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  40. Click Fraud • Medium-sized challenge: • Users who click on ad links to cost their competitors money • Or pay housewives in India $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  41. For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Why search the web when you could download it? • Webaroo M.P. Johnson, DBMS, Stern/NYU, Spring 2008

  42. Future • RAID, etc. • Project Presentations • Final Exam • Info up soon… M.P. Johnson, DBMS, Stern/NYU, Spring 2008

More Related