460 likes | 564 Views
C20.0046: Database Management Systems Lecture #27. M.P. Johnson Stern School of Business, NYU Spring, 2005. Agenda. Last time: Data Mining RAID Websearch Etc. Goals after today:. Understand what RAID is Be able to perform RAID 4 Understand some issues in websearch
E N D
C20.0046: Database Management SystemsLecture #27 M.P. Johnson Stern School of Business, NYU Spring, 2005 M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Agenda • Last time: • Data Mining • RAID • Websearch • Etc. M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Goals after today: • Understand what RAID is • Be able to perform RAID 4 • Understand some issues in websearch • Be able to perform PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2005
New topic: Recovery M.P. Johnson, DBMS, Stern/NYU, Spring 2005
System Failures (skip?) • Each transaction has internal state • When system crashes, internal state is lost • Don’t know which parts executed and which didn’t • Remedy: use a log • A file that records each action of each xact • Trail of breadcrumbs • See text for details… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Media Failures • Rule of thumb: Pr(hard drive has head crash within 10 years) = 50% • Simpler rule of thumb: Pr(hard drive has head crash within 1 year) = (say) 10% • If have many drives, then regular occurrence • Soln: different RAID strategies • RAID: Redundant Arrays of Independent Disks M.P. Johnson, DBMS, Stern/NYU, Spring 2005
RAID levels • RAID level 1: each disk gets a mirror • RAID level 4: one disk is xor of all others • Each bit is sum mod 2 of corresponding bits • E.g.: • Disk 1: 11110000 • Disk 2: 10101010 • Disk 3: 00111000 • Disk 4: • How to recover? • Various other RAID levels in text… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
RAID levels • RAID level 1: each disk gets a mirror • RAID level 4: one disk is xor of all others • Each bit is sum mod 2 of corresponding bits • E.g.: • Disk 1: • Disk 2: 10101010 • Disk 3: 00111000 • Disk 4: • How to recover? • Various other RAID levels in text… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Next topic: Websearch • Create a search engine for searching the web • DBMS queries use tables and (optionally) indices • First thing to understand about websearch: • we never run queries on the web • Way too expensive, for several reasons • Instead: • Build an index of the web • Search the index • Return the results M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Crawling • To obtain the data for the index, we crawl the web • Automated web-surfing • Conceptually very simple • But difficult to do robustly • First, must get pages • Prof. Davis (NYU/CS)’s example: http://www.cs.nyu.edu/courses/fall02/G22.3033-008/WebCrawler.java • http://pages.stern.nyu.edu/~mjohnson/dbms/eg/WebCrawler.java • Rule of thumb: 1 page per minute • Run program: sales% cd ~mjohnson/public_html/dbms/eg sales% java WebCrawler http://pages.stern.nyu.edu/~mjohnson/dbms 200 M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Crawling issues in practice • DNS bottleneck • to view page by text link, must get address • BP claim: 87% crawling time ~ DNS look-up • Search strategy? • Refresh strategy? • Primary key for webpages • Use artificial IDs, not URLs • more popular pages get shorter DocIDs (why?) M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Crawling issues in practice • Content-seen test • compute fingerprint/hash (again!) of page content • robots.txt • http://www.robotstxt.org/wc/robots.html • Bad HTML • Tolerant parsing • Non-responsive servers • Spurious text M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Inverted indices • Basic idea of finding pages: • Create inverted index mapping words to pages • First, think of each webpage as a tuple • One column for each possible word • True means the word appears on the page • Index on all columns • Now can search: john bolton • select * from T where john=T and bolton=T M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Inverted indices • Can simplify somewhat: • For each field index, delete False entries • True entries for each index become a bucket • Create an inverted index: • One entry for each search word • the lexicon • Search word entry points to corresponding bucket • Bucket points to pages with its word • the postings file • Final intuition: the inverted index doesn’t map URLs to words • It maps words to URLs M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Inverted Indices • What’s stored? • For each word W, for each doc D • relevance of D to W • #/% occurs. of W in D • meta-data/context: bold, font size, title, etc. • In addition to page importance, keep in mind: • this info is used to determine relevance of particular words appearing on the page M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Search engine infrastructure • Image from here: http://www.cs.wisc.edu/~dbbook/openAccess/thirdEdition/slides/slides3ed-english/Ch27c_ir3-websearch-95.pdf M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Google-like infrastructure • Very large distributed system • File sizes routines in GBs Google File System • Block size = 64MB (not kb)! • 100k+ low-quality Linux boxes • system failures are the rule, not exception • Divide index up by words into many barrels • lexicon maps word ids to word’s barrel • also, do RAID-like stragegy two-D matrix of servers • many commodity machines frequent crashes • Draw picture • May have more duplication for popular pages… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Google-like infrastructure • To respond to single-word query Q(w): • send to the barrel column for word w • pick random server in that column • return (some) sorted results • To respond to multi-word query Q(w1…wn): • for each word wi, send to the barrel column for wi • pick random server in that column • for all words in parallel, merge and prune • step through until find doc containing all words, add to results • index ordered on word;docID, so linear time • return (some) sorted results M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Websearch v. DBMS M.P. Johnson, DBMS, Stern/NYU, Spring 2005
New topic: Sorting Results • How to respond to Q(w1,w2,…,wn)? • Search index for pages with w1,w2,…,wn • Return in sorted order (how?) • Soln 1: current order • Return 100,000 (mostly) useless results • Sturgeon's Law: “Ninety percent of everything is crud.” • Soln 2: ways from Information Retrieval Theory • library science + CS = IR M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Simple IR-style approach • for each word W in a doc D, compute • # occurs of W in D / total # word occurs in D • each document becomes a point in a space • one dimension for every possible word • Like k-NN and k-means • value in that dim is ratio from above (maybe weighted, etc.) • Choose pages with high values for query words • A little more precisely: each doc becomes a vector in space • Values same as above • But: think of the query itself as a document vector • Similarity between query and doc = dot product / cos • Draw picture M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Information Retrieval Theory • With some extensions, this works well for relatively small sets of quality documents • But the web has 8 billion documents • Problem: if based just on percentages, very short pages containing query words score very high • BP: query a “major search engine” for “bill clinton” • “Bill Clinton Sucks” page M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Soln 3: sort by “quality” • What do you mean by quality? • Hire readers to rate my webpage (early Yahoo) • Problem: doesn’t scale well • more webpages than Yahoo employees… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Soln 4: count # citations (links) • Idea: you don’t have to hire webpage raters • The rest of the web has already voted on the quality of my webpage • 1 link to my page = 1 vote • Similar to counting academic citations • Peer review M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Soln 5: Google’s PageRank • Count citations, but not equally – weighted sum • Motiv: we said we believe that some pages are better than others • those pages’ votes should count for more • A page can get a high PageRank many ways • Two cases at ends of a continuum: • many pages link to you • yahoo.com links to you • PageRank, not PigeonRank • Search for “PigeonRank”… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
PageRank • More precisely, let P be a page; • for each page Li that links to P, • let C(Li) be the number of pages Li links to. • Then PR0(P) = SUM(PR0(Li)/C(Li))) • Motiv: each page votes with its quality; • its quality is divided among the pages it votes for • Extensions: bold/large type/etc. links may get larger proportions… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Understanding PageRank (skip?) • Analogy 1: Friendster/Orkut • someone “good” invites you in • someone else “good” invited that person in, etc. • Analogy 2: PKE certificates • my cert authenticated by your cert • your cert endorsed by someone else's… • Both cases here: eventually reach a foundation • Analogy 3: job/school recommendations • three people recommend you • why should anyone believe them? • three other people rec-ed them, etc. • eventually, we take a leap of faith M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Understanding PageRank • Analogy 4: Random Surfer Model • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link… • Turns out: after long time surfing, • Pr(were at some page P right now) = PR0(P) • PRs are normalized M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Computing PageRank • For each page P, we want: • PR(P) = SUM(PR(Li)/C(Li))) • But its circular – how to compute? • Meth 1: for n pages, we've got n linear eqs and n unknowns • can solve for all PR(P)s, but too hard • see your linear algebra course… • Meth 2: iteratively • start with PR0(P) set to E for each P • iterate until no more significant change • PB report O(50) iterations for O(30M) pages/O(300M) links • #iters req. grows only with log of web size M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Problems with PageRank • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points nowhere draw picture • Start A,Y,M at 1: • (1,1,1) (0,0,0) • The rank dissipates • Soln: add (implicit) self link to any dead-end sales% cd ~mjohnson/public_html/dbms/eg stern% java PageRank M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Problems with PageRank • Example (from Ullman): • A points to Y, M; • Y points to self, A; • M points to self • Start A,Y,M at 1: • (1,1,1) (0,0,3) • Now M becomes a rank sink • RSM interp: we eventually end up at M and then get stuck • Soln: add “inherent quality” E to each page stern% java PageRank2 M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Modified PageRank • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model stern% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Random Surfer Model’ • Motiv: if we (qua random surfer) end up at page M, we don’t really stay there forever • We type in a new URL • Idealized web surfer: • First, start at some page • Then, at each page, pick a random link • But occasionally, we get bored and jump to a random new page • Turns out: after long time surfing, • Pr(we’re at some page P right now) = PR(P) M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Understanding PageRank • One more interp: hydraulic model • picture the web graph again • imagine each link as a tube bet. two nodes • imagine quality as fluid • each node is a reservoir initialized with amount E of fluid • Now let flow… • Steady state is: each node P w/PR(P) amount of fluid • PR(P) of fluid eventually settles in node P • equilibrium M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Somewhat analogous systems (skip?) • Sornette: “Why Stock Markets Crash” • Si(t+1) = sign(ei + SUM(Sj(t)) • trader buys/sells based on • is inclination and • what is associates are saying • directions. of magnet det-ed by • old direction and • dirs. of neighbors • activation of neuron det-ed by • its props and • activation of neighbors connected by synapses • PR of P based on • its inherent value and • PR of in-links M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Non-uniform Es (skip?) • So far, assumed E was const for all pages • But can make E a function E(P) • vary by page • How do we choose E(P)? • Idea 1: set high for pages with high PR from earlier iterations • Idea 2: set high for pages I like • BP paper gave high E to John McCarthy’s homepage • pages he links to get high PR, etc. • Result: his own personalized search engine • Q: How would google.com get your prefs? M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Tricking search engines • “Search Engine Optimization” • Challenge: include on your page lots of words you think people will query on • maybe hidden with same color as background • Response: popularity ranking • the pages doing this probably aren't linked to that much • but… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Tricking search engines • I can try to make my page look popular to the search engine • Challenge: create a page with 1000 links to my page • does this work? • Challenge: Create 1000 other pages linking to it • Response: limit the weight a single domain can give to itself • Challenge: buy a second domain and put the 1000 pages there • Response: limit the weight from any single domain… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Using anchor text • Another good idea: use anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain "search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text also describes page: • many pages link to google.com • many of them likely say "search engine" in/near the link • Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Tricking search engines • This provides a new way to trick the search engine • Use of anchor text is a big part of result quality • but has potential for abuse • Lets you influence the appearance of other people’s pages • Google Bombs • put up lots of pages linking to my page, using some particular phrase in the anchor text • result: search for words you chose produces my page • Examples: "talentless hack", "miserable failure", “waffles", the last name of a prominent US senator… M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Bidding for ads • Google had two really great ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience does want it • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so answer its inversion • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order • Company pays only if user clicks • AdSense: place ads on external webpages, auction based on page content instead of query • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2005
Click Fraud • The latest challenge: • Users who click on ad links to cost their competitors money • Or pay Indian housewives $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2005
For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Google Labs: http://labs.google.com M.P. Johnson, DBMS, Stern/NYU, Spring 2005
You mean that’s it? • Final Exam: next Thursday, 5/5,10-11:50am • Final exam info is up • Course grades are cuvered • Interest in a review session? • Please fill out course evals! • https://ais.stern.nyu.edu/ • Comments by email, etc., are welcome • Thanks! M.P. Johnson, DBMS, Stern/NYU, Spring 2005