450 likes | 620 Views
C20.0046: Database Management Systems Lecture #23. M.P. Johnson Stern School of Business, NYU Spring, 2008. Agenda. Ads Datamining RAID? Regexs? Set Theory? Indices? Etc.?. Fixing PR bugs. Apart from inherited quality, each page also has inherent quality E:
E N D
C20.0046: Database Management SystemsLecture #23 M.P. Johnson Stern School of Business, NYU Spring, 2008 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Agenda • Ads • Datamining • RAID? • Regexs? • Set Theory? • Indices? • Etc.? M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Fixing PR bugs • Apart from inherited quality, each page also has inherent quality E: • PR(P) = E + SUM(PR(Li)/C(Li))) • More precisely, have weighted sum of the two terms: • PR(P) = .15*E + .85*SUM(PR(Li)/C(Li))) • Leads to a modified random surfer model sales% java PageRank3 M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Anchor text • Motiv: pages may not give best descrips. of themselves • most search engines don’t contain “search engine" • BP claim: only 1 of 4 “top search engines” could find themselves on query "search engine" • Anchor text offers mores descriptions of the page: • many pages link to google.com • many of them likely say "search engine" in/near the link • Treat anchor text words as part of page • Search for “US West” or for “g++” M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Next: Bidding for ads • Google had two big ideas: • PageRank • AdWords/AdSense • Fundamental difficulty with mass-advertising: • Most of the audience isn’t interested • Most people don’t want what you’re selling • Think of car commercials on TV • But some of them do! M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Bidding for ads • If you’re selling widgets, how do you know who wants them? • Hard question, so invert the question • If someone is searching for widgets, what should you try to sell them? • Easy – widgets! • Whatever the user searches for, display ads relevant to that query • Or: whatever’s on the page he’s viewing M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Bidding for ads • Q: How to divvy correspondences up? • A: Create a market, and let the divvying take care of itself • Each company places the bid it’s willing to pay for an ad responding to a particular query • Ad auction “takes place” at query-time • Relevant ads displayed in descending bid order (e.g.) • Company pays only if user clicks • Huge huge huge business M.P. Johnson, DBMS, Stern/NYU, Spring 2008
How to assign ads? • On each query, must assign a bidder for top place) • Matching problem • Online problem • Also: each bidder has daily budget • Many-to-one • One idea: choose highest bid (in expectation) • Can’t be better than ½-approx (why?) • Another: choose highest remaining budget • Definitely suboptimal (why?) • “Trade-off” algorithm: 1-1/e-approx (~.6321) • Best possible guarantee (see Vazirani et al.) M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Click Fraud • Medium-sized challenge: • Users who click on ad links to cost their competitors money • Or pay housewives in India $.25/click • http://online.wsj.com/public/article/0,,SB111275037030799121-k_SZdfSzVxCwQL4r9ep_KgUWBE8_20050506,00.html?mod=tff_article • http://timesofindia.indiatimes.com/articleshow/msid-654822,curpg-1.cms M.P. Johnson, DBMS, Stern/NYU, Spring 2008
For more info • See sources drawn upon here: • Prof. Davis (NYU/CS) search engines course • http://www.cs.nyu.edu/courses/fall02/G22.3033-008/ • Original research papers by Page & Brin: • The PageRank Citation Ranking: Bringing Order to the Web • The Anatomy of a Large-Scale Hypertextual Web Search Engine • Links on class page • Interesting and very accessible • Why search the web when you could download it? • Webaroo M.P. Johnson, DBMS, Stern/NYU, Spring 2008
New topic: Data Mining • Situ: data rich but knowledge poor • Terabytes and terabytes of data • Searching for needles in haystacks • Can collect lots of data • credit card xacts • bar codes • club cards • webserver logs • call centers • cameras • 311 calls in NYC • http://blogs.law.harvard.edu/jim/discuss/msgReader$351 • http://www.nytimes.com/2003/12/01/nyregion/01PHON.html?ex=1385701200&en=0d74f2f58599fe35&ei=5007&partner=USERLAND M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Lots of data • Can store this data • DBMSs, data warehousing • Can query it • SQL • DW/OLAP queries (CUBE, ROLLUP, etc.) • find the things such that… • Find the totals for each of… • Can get answers to specific questions, but what does it all mean? M.P. Johnson, DBMS, Stern/NYU, Spring 2006
But how to learn from this data? • What kinds of people will buy my widgets? • Whom should I send my direct-mail literature to? • How can I segment my market? • http://www.joelonsoftware.com/articles/CamelsandRubberDuckies.html • Whom should we approve for a gold card? • Who gets approved for car insurance? • And what should we charge? M.P. Johnson, DBMS, Stern/NYU, Spring 2006
ML/DM: two basic kinds • Supervised learning • Classification • Regression • Unsupervised learning • Clustering • Find something interesting, frequent sets… M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Supervised learning • Situ: you have many particular instances • Transactions • Customers • Credit applications • Have various fields describing them • But other one property you’re interested in • Goal: infer this dependent property from the other data M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Supervised learning • Supervised learning starts with training data • Many instances including the dependent property • Use the training data to build a model • Many different algorithms • GAs, NNs, decision trees, etc. • Given the model, you can now determine the dependent property for new instances • And ideally, the answers are “correct” • Categorical property “classification” • Numerical property “regression” M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Classification applications • Construe as inference of one property from others: • Spam filtering • Network intrusion detection • Trading strategies • Fraud detection • See also: Zipf’s & Bedford’s laws • Political marketing? • Mail order tulip bulbs Conservative party voters M.P. Johnson, DBMS, Stern/NYU, Spring 2006
k-Nearest Neighbor • Very simple algorithm • Sometimes works well • Map training data points in a space • Given any two points, can compute the distance between them • Euclidean • Given an unlabeled point, label this way: • Find the k nearest points • Let them vote on the label M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Naïve Bayes Classifier • Bayes Theorem: Pr(B|A) = Pr(B,A)/Pr(A) = Pr(A|B)*Pr(B)/Pr(A) • Idea: Pr(B|A) can be defined in terms of Pr(A|B) • Here: Pr(S|W) = Pr(S,W)/Pr(W) = Pr(W|S)*Pr(S)/Pr(W) • W means: “the msg has the word W (or several words…)” • S means: “it’s spam” • Easy: We can (mathematically) compute Pr(W|S) • Goal: We want to get Pr(S|W) – i.e., is this spam? M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Naïve Bayes Classifier • This is supervised learning, but training and use are intermingled • Spam filter improves over time • But let’s assume we’re given labeled sets of spam and good • Training phase: • For each word Wi, compute Pr(Wi) • For each word Wi, compute Pr(Wi|S) • Compute Pr(S) • That’s it! • Now, we wait for email to arrive… M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Naïve Bayes Classifier • Msgs may have many words W = W1..Wn • So we need to predict Pr(S|W) = Pr(S|W1..Wn) = Pr(W1..Wn|S)*Pr(S)/Pr(W1..Wn) • Assuming total independence*, we get • Pr(W1…Wn) = Pr(W1)*…*Pr(Wn) • Pr(W1..Wn|S) = Pr(W1|S)*…*Pr(Wn|S) • We’ve precomputed the nums on the right • Now we can compute what we need * Which isn’t true… M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Naïve Bayes Classifier • To decide spam status of a new msg, we just compute Pr(S|W) • If this is > 90% (say), call spam • False positives are very bad (why?) • For each new message (spam/not), we update our probabilities • Simple and unprincipled, but works very well • can write in a page of Perl code • See also Paul Graham: http://www.paulgraham.com/spam.html M.P. Johnson, DBMS, Stern/NYU, Spring 2006
New topic: Clustering • Unsupervised learning • partition items into clusters of same type • What different types of customers we have? • What different types of webpages are there? • Each item becomes a point in the space • One dim for every property • Poss plotting for webpages: dim. for each word • Value in dim d is # occur.s of d / # all word occur.s • k-means M.P. Johnson, DBMS, Stern/NYU, Spring 2006
k-means • Simple clustering algorithm • Want to partition data points into sets of “similar” instances • Plot in n-space • partition points into clumps in space • Like a unsupervised analog to k-NN • But iterative M.P. Johnson, DBMS, Stern/NYU, Spring 2006
k-means • Alg: • Choose initial means (centers) • do { assign each point to the closest mean recompute means of clumps } until change in means < e • Visualization: • http://www.delft-cluster.nl/textminer/theory/kmeans/kmeans.html • Gone! M.P. Johnson, DBMS, Stern/NYU, Spring 2006
New topic: Frequent Sets • Find sets of items that go together • Intuition: market-basket data • Suppose you’re Wal-Mart, with 460TB of data • http://developers.slashdot.org/article.pl?sid=04/11/14/2057228 • Might not know customer hist (but maybe: club cards!) • One thing you do know: groupings of items • Each set of item wrung up in a an individual purchase • Famous (claimed) example: beer and diapers are positively correlated • Intuition: fathers can’t take babies to the bar • colocate beer and diapers M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Finding Frequent Sets • Situ: one table • Baskets(id, item) • One row for each purchase of each item • Alt interp: words on webpages • Goal: given support s, find set of items that appear together in >=s baskets • For pairs, obvious attempt: SELECT B1.item, B2.item, count(B.id) FROM Baskets B1, Baskets B2 WHERE B1.id = B2.id AND B1.item < B2.item GROUP BY B1.item, B2.item HAVING count(B1.basket) >= s; Can we do better? M.P. Johnson, DBMS, Stern/NYU, Spring 2006
A Priori Property • a priori ~ prior to • Prior to investigation • Deductive/from reason/non-empirical • As opposed to a posteriori ~ post- • After investigation/experience • Logical observation: if support(X) >= s, then for every individual x in X, support(x) >= s • any item y with support < s can be ignored M.P. Johnson, DBMS, Stern/NYU, Spring 2006
A Priori e.g. • You sell 10,000 items, stores record of 1,000,000 baskets, with avg of 4 items each • 4,000,000 rows in Baskets • #pairs of items from 1,000,000 baskets = C(4,2)*1,000,000 = 6,000,000 pairs! • One idea: • Find support-s items • Run old query on them M.P. Johnson, DBMS, Stern/NYU, Spring 2006
A Priori Property • Suppose we’re looking for sup-10,000 pairs • Each item in pair must be sup-10,000 • Counting arg: have only 4,000,000 individual items purchased • Can have at most 400 = 4,000,000/10,000 popular items • we get to ignore most of the rows • Naïve solution above is now much cheaper • Can we do better? M.P. Johnson, DBMS, Stern/NYU, Spring 2006
A Priori Algorithm • But: a frequent itemset may have >2 items • Idea: build frequent sets (not just pairs) iteratively • Alg: • First, find all frequent items • These are the size-1 frequent sets • Next, for each size-k frequent set • For each frequent item • Check whether the union is frequent • If so, add to collection of size-k+1 frequent sets M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Related goal: rules • Would like to infer: bought A interested in B • Related to frequent sets, but now • A,B should be frequent (have high support) and • Rule should be true (have high confidence) • More advanced algs… M.P. Johnson, DBMS, Stern/NYU, Spring 2008
Next: Media Failures • Rule of thumb: Pr(hard drive has head crash within 10 years) = 50% • Simpler rule of thumb: Pr(hard drive has head crash within 1 year) = (say) 10% • If have many drives, then regular occurrence • Soln: different RAID strategies • RAID: Redundant Arrays of Independent Disks M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RAID levels • RAID level 1: each disk gets a mirror • RAID level 4: one disk is xor of all others • Each bit is sum mod 2 of corresponding bits • E.g.: • Disk 1: 10110011 • Disk 2: 10101010 • Disk 3: 00111000 • Disk 4: • How to recover? • What’s the disadvantage of R4? • Various other RAID levels in text… M.P. Johnson, DBMS, Stern/NYU, Spring 2006
New topic: Regular Expressions • Finite Automata are studied in automata theory • FAs are the simplest/weakest kind of computer • Turing Machines the strongest • Chomsky’s Hierarchy • Each FA is equivalent to some regular expression • Expressions that specify a pattern • Can check whether a string matches the pattern • The REGEX(P) generates words • the FA accepts them M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • Use REGEX_LIKE in Oracle • Metachar for any char is . • First, get employee_comment table: • http://pages.stern.nyu.edu/~mjohnson/oracle/empcomm.sql • And install… M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Working with SQL*Plus • Command on sales: sqlplus • See my login.sql for settings… • Run a local file of scripts: • SQL> @filename • Get list of my tables: • SQL> select * from cat; • Describe a table: • SQL> describe table-name M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • Now do search: • So far, a lot like LIKE SELECT emp_id, text FROM employee_comment WHERE REGEXP_LIKE(text,'...-....'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • Can also pull out the matching text with REGEXP_SUBSTR: • If want only numbers, can specify a set of chars rather than a dot: SELECT emp_id, REGEXP_SUBSTR(text,'...-....') text FROM employee_comment WHERE REGEXP_LIKE(text,'...-....'); SELECT emp_id, REGEXP_SUBSTR(text, '[0123456789]..-...[0123456789]') text FROM employee_comment WHERE REGEXP_LIKE(text, '[0123456789]..-...[0123456789]'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • Or can specify a range of chars: • Or, finally, can repeat a match several times: SELECT emp_id, REGEXP_SUBSTR(text, '[0-9]..-...[0-9]') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]..-...[0-9]'); SELECT emp_id, REGEXP_SUBSTR(text, '[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • Other operators: • * - 0 or more matches • + - 1 or more matches • ? - 0 or 1 match • | - OR two options together • Here: we should catch area codes, if they appear: SELECT emp_id, REGEXP_SUBSTR(text, '[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'[0-9]{3}-[0-9]{3}-[0-9]{4}|[0-9]{3}-[0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • There’s a shared structure between the two, but • Area code is just optional • Can use ? op • Must use parens to indicate the whole area code regex is optional SELECT emp_id, REGEXP_SUBSTR(text, '([0-9]{3}-)?[0-9]{3}-[0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text, '([0-9]{3}-)?[0-9]{3}-[0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • Also, different kinds of phone number separators: • dash, dot, blank • One solution: • Write RegEx for each case • OR them all together • Better: just use set of choices of each sep • Characters in [ ] means match any one: SELECT emp_id, REGEXP_SUBSTR(text, '([0-9]{3}[-. ])?[0-9]{3}[-. ][0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'([0-9]{3}[-. ])?[0-9]{3}[-. ][0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
RegEx matching • One other thing: area codes in parentheses • Parens are part of the regex language • So these parens must be escaped: \( \) SELECT emp_id, REGEXP_SUBSTR(text, '([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}') text FROM employee_comment WHERE REGEXP_LIKE(text,'([0-9]{3}[-. ]|\([0-9]{3}\) )?[0-9]{3}[-. ][0-9]{4}'); M.P. Johnson, DBMS, Stern/NYU, Spring 2006
Future • Project Presentations • Practice hw3 • Final Exam • Info up soon… M.P. Johnson, DBMS, Stern/NYU, Spring 2008