420 likes | 443 Views
C20.0046: Database Management Systems Lecture #22. M.P. Johnson Stern School of Business, NYU Spring, 2008. Agenda. Websearch Datamining RAID XML? Regexs? Indices? Etc. (which?). Next topic: Websearch. DBMS queries use tables and (optionally) indices
E N D
C20.0046: Database Management SystemsLecture #22 M.P. Johnson Stern School of Business, NYU Spring, 2008 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Agenda • Websearch • Datamining • RAID • XML? • Regexs? • Indices? • Etc. (which?) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Next topic: Websearch • DBMS queries use tables and (optionally) indices • How does a search engine search the web? • First thing to understand about websearch: • we never run queries on the web • Way too expensive, for several reasons • Instead: • Build an index of the web • Search the index • Return the links found M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Crawling • To obtain the data for the index, we crawl the web • Automated web-surfing • Conceptually very simple • But difficult to do robustly • First, must get pages • Prof. Davis (NYU/CS)’s example: http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java • http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java • Run program on grid.stern.nyu.edu: sales> cd ~mjohnson/public_html/dbms/websearch sales> java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Crawling issues in practice • DNS bottleneck • to view page by text link, must get address • BP claim: 87% crawling time ~ DNS look-up • Search strategy? • Refresh strategy? • Primary key for webpages • Use artificial IDs, not URLs • more popular pages get shorter DocIDs (why?) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Crawling issues in practice • Non-responsive servers • Bad HTML • Tolerant parsing • Content-seen test • compute fingerprint/hash (again!) of page content • robots.txt • http://www.robotstxt.org/wc/robots.html M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Inverted indices • Which data structure for the index? • A inverted index mapping words to pages • First, think of each webpage as a tuple • One column for each possible word • True means the word appears on the page • Index on all columns • Now can search: john yoo • select url from T where john=T and yoo=T M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Inverted indices (draw pictures) • Initially a mapping: urls array of booleans • Could construe as a mapping: urls buckets of words • But instead invert the mapping: words bucket of urls M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Inverted Indices • What’s stored? • For each word W, for each doc D store relevance of D to W: • #/% occurs. of W in D • meta-data/context: bold, font size, title, etc. • In addition to page importance, keep in mind: • this info is used to determine relevance of particular words appearing on the page M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Search engine infrastructure • Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Google-like infrastructure (draw picture) • Very large distributed system • File sizes routines in GBs Google File System • Block size = 64MB (not kb)! • 100k+ low-quality Linux boxes • system failures are the rule, not exception • Divide index up by words into many barrels • lexicon maps word ids to word’s barrel • also, do RAID-like strategy two-D matrix of servers • many commodity machines frequent crashes • May have more duplication for popular pages… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Google-like infrastructure • To respond to single-word query Q(w): • find to the barrel column for word w • pick random server in that column • return (some) sorted results • To respond to multi-word query Q(w1…wn): • for each word wi, find its results as above • for all words in parallel, merge and prune • step through until find doc containing all words, add to results • index ordered on word;docID, so linear time • return (some) sorted results M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Websearch v. DBMS (quickly) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
New topic: Sorting Results • How to respond to Q(w1,w2,…,wn)? • Search index for pages with w1,w2,…,wn • Return in sorted order (how?) • Soln 1: current order • Return 100,000 (mostly) useless results • Sturgeon's Law: “Ninety percent of everything is crud.” • Soln 2: ways from Information Retrieval Theory • library science + CS = IR M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Simple IR-style approach (draw pictures) • for each word W in a doc D, compute • # occurs of W in D / total # word occurs in D • each document becomes a point in a space • one dimension for every possible word • Like k-NN and k-means • value in that dim is ratio from above (maybe weighted, etc.) • Choose pages with high values for query words • A little more precisely: each doc becomes a vector in space • Values same as above • But: think of the query itself as a document vector • Similarity between query and doc = dot product / cos M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Information Retrieval Theory • With some extensions, this works well for relatively small sets of quality documents • But the web has 8 billion documents (old!) • Problem: if based just on percentages, very short pages containing query words score very high • BP: query a “major search engine” for “bill clinton” • “Bill Clinton Sucks” page M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Soln 3: sort by “quality” • What do you mean by quality? • Hire readers to rate my webpage (early Yahoo) • Problem: this doesn’t scale well* • more webpages than Yahoo employees… • * actually, now coming back, Wiki-style! M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Soln 4: count # citations (links) • Big idea: you don’t have to hire human webpage raters • The rest of the web has already voted on the quality of my webpage • 1 link to my page = 1 vote • Similar to counting academic citations • Peer review M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Soln 5: Google’s PageRank • Count citations, but not equally • weighted sum • Motiv: we said we believe that some pages are better than others • those pages’ votes should count for more • A page can get a high PageRank many ways • Two cases at ends of a continuum: • many pages link to you • yahoo.com links to you M.P. Johnson, DBMS, Stern/NYU, Spring 2008
PageRank definition for a page P • for each page Li that links to P, let C(Li) be the # of pages Li links to • Then PR0(P) = SUM(PR0(Li)/C(Li))) • Motiv: each page votes with its quality; • its quality is divided among the pages it votes for • Extensions: bold/large type/etc. links may get larger proportions… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Systems of Trust 1. Friendster/Orkut • someone “good” invites you in • someone else “good” invited that person in, etc. 2. PKE certificates • my cert authenticated by your cert • your cert endorsed by someone else's… • Both cases here: eventually reach a foundation 3. job/school recommendations • three people recommend you • why should anyone believe them? • three other people rec-ed them, etc. • eventually, we take a leap of faith M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Random Surfer Model • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link… • Turns out: after long time surfing, • Pr(were at some page P right now) = PR0(P) • PRs are normalized M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Methods for Computing PageRank • For each page P, we want to find its PR: • PR(P) = SUM(PR(Li)/C(Li))) • But this is circular – how to compute? • for n pages, we've got n linear eqs and n unknowns • can solve for all PR(P)s, but too hard/expensive • see your linear algebra course… • iteratively • start with PR0(P) set to E for each P • iterate until no more significant change • PB report O(50) iterations for O(30M) pages/O(300M) links • #iters req. grows only with log of web size M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Analogous iterative systems (skip?) • Sornette: “Why Stock Markets Crash” • Si(t+1) = sign(ei + SUM(Sj(t)) • trader buys/sells based on • is inclination and • what is associates are saying • directions. of magnet det-ed by • old direction and • dirs. of neighbors • activation of neuron det-ed by • its props and • activation of neighbors connected by synapses • PR of P based on • its inherent value and • PR of in-links M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points nowhere draw picture • Start A,Y,M at 1: • (1,1,1) (0,0,0) • The rank dissipates • Soln: add (implicit) self link to any dead-end sales% cd ~mjohnson/public_html/dbms/websearch sales% java PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points to self • Start A,Y,M at 1: • (1,1,1) (0,0,3) • Now M becomes a rank sink • RSM interp: we eventually end up at M and then get stuck • Soln: add “inherent quality” E to each page sales% java PageRank2 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model sales% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Random Surfer Model’ • Motiv: if we (qua random surfer) end up at page M, we don’t really stay there forever • We type in a new URL • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link • But occasionally, we get bored and jump to a random new page • Turns out: after long time surfing, • Pr(we’re at some page P right now) = PR(P) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Non-uniform Es (skip?) • So far, assumed E was const for all pages • But can make E a function E(P) • vary by page • How do we choose E(P)? • One idea: set high for pages that I like • BP paper gave high E to John McCarthy’s homepage and links • pages he links to get high PR, etc. • Result: his own personalized search engine • Q: How would google.com get your prefs? M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: Hydraulic model • Picture the web as graph again • imagine each link as a pipe connecting two nodes • imagine quality as fluid • each node is a reservoir initialized with amount E of fluid • Now let flow… • Steady state is: each node P w/PR(P) amount of fluid • PR(P) of fluid eventually settles in node P • the equilibrium state M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Understanding PR: linear algebra (skip) • Store webgraph as a (weighted) matrix • PageRank values ~ eigenvalues of the webgraph maxtrix M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Tricking search engines • “Search Engine Optimization” • Challenge: include on your page lots of words you think people will query on • maybe hidden with same color as background • Response: PR/popularity ranking • the pages doing this probably aren't linked to that much M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Tricking search engines: acting popular • Challenge: create a page with 1000 links to my page • Response: PageRank itself • Challenge: Create 1000 other pages linking to it • Response: limit the weight a single domain can give to itself • Challenge: buy a second domain and put the 1000 pages there • Response: limit the weight from any single domain… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Another good idea: using anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain “search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text offers mores descriptions of the page: • many pages link to google.com • many of them likely say "search engine" in/near the link • Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Tricking search engines: anchor text • This provides a new way to trick the search engine • Use of anchor text is a big part of result quality • but has potential for abuse • Lets you influence the appearance of other people’s pages • Google Bombing • put up lots of pages linking to my page, using some particular phrase in the anchor text • result: search for words you chose produces my page • Old eg: "talentless hack", "miserable failure", “waffles" Anchor text as extra dataset: http://anand.typepad.com/datawocky/ M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Next: Bidding for ads • Google had two big ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience isn’t interested • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so invert the question • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query • Or: whatever’s on the page he’s viewing M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order (e.g.) • Company pays only if user clicks • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2008
How to assign ads? • On each query, must assign a bidder for top place) • Matching problem • Online problem • Also: each bidder has daily budget • Many-to-one • One idea: choose highest bid (in expectation) • Can’t be better than ½-approx (why?) • Another: choose highest remaining budget • Definitely suboptimal (why?) • “Trade-off” algorithm: 1-1/e-approx (~.6321) • Best possible guarantee (see Vazirani et al.) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Click Fraud • Medium-sized challenge: • Users who click on ad links to cost their competitors money • Or pay housewives in India $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2008
For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Why search the web when you could download it? • Webaroo M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Future • RAID, etc. • Project Presentations • Final Exam • Info up soon… M.P. Johnson, DBMS, Stern/NYU, Spring 2008