Information Carnivores

Information Carnivores 1

Built by Selberg & Etzioni • Release in Jun ‘95 • In 2000 aggregated 12 search engines: • LookSmart, About, Infoseek, • GoTo, Google, DirectHit, • RealNames, Webcrawler, AltaVista, • Excite, Lycos, Thunderstone • History • Netbot • Go2net • Infospace • ??? 2

User enters query Formulate queries Remove duplicates . . . Lycos Excite Post-process + rank Collate results Download? Present to user 3

The Need for Wrappers lots ofinformationbutcomputers don’tunderstandmuch of it 4

Example 1: Seminar announcement news article <0.15.4.95.15.11.55.rudibear+@CMU.EDU.0> Type: cmu.andrew.assocs.UEA Topic: Re: entreprenuership speaker Dates: 17-Apr-95 Time: 7:00 PM PostedBy: Colin S Osburn on 15-Apr-95 at 15:11 from CMU.EDU Abstract: hello again to reiterate there will be a speaker on the law and startup businesses this monday evening the 17th it will be at 7pm in room 261 of GSIA in the new building, ie upstairs. please attend if you have any interest in starting your own business or are even curious. Colin IE date = monday evening the 17th speaker = ? start-time = 7pm end-time = ? location = room 261 of GSIA 5

Example 2: Seminar announcement Web pages IE date = Nov 5 speaker = Dr. Rodger Kibble affil = University of Brighton title = Using centering... date = Nov 19 speaker = Dr. Reinhard Muskens affil = Katholieke Univ... title = Underspecification... date = Nov 26 speaker = Dr. Julie Berndsen affil = University College... title = A Generic Lexicon... ... 6

Example 3: Job listings IE 7

Strategy: Wrappers resource A resource B resource C queries wrapper A wrapper B wrapper C results Mediator user 8

Scaling issues Need custom wrapper for each resource. <HTML><BODY BGCOLOR="FFFFFF" LINK="00009C" ALINK="00009C" VLINK="00009C”TEXT= "000000"> <center> <table><tr><td> <img src="/ypimages/b_r_hd_a.gif”border=0 ALT="Switchboard Results" width=407height=20 align=top><A HREF="/bin/cgiqa.dll?MEM=1" TARGET ="_top"><img src="/ypimages/b_r_hd_1.gif" border=0 ALT="People" width=54 height=20align=top></A><A HREF="/bin/cgidir.dll?MEM=1”TARGET="_top"><img src= "/ypimages/b_r_hd_2.gif”border=0 ALT= "Business" width=62 height=24 align=top></A><A HREF="/" TARGET="_top"><img src=”/ypimages /b_r_hd_3.gif" border=0 ALT="Home”width=47 height=20 align=top></A> </td></tr></table> </center><center><table border=0width=576> <tr><td colspan=2 align =center> <center> But hand-coding is tedious. Especially since sites frequently change format usefulinformation 9

Wrapper Approaches • Perl-like languages • Simple and effective (if tedious) • Proprietary languages & tools • Click and generalize • Conversion to tree form • Use XML as intermediate representation • Extract children of specified node • Machine Learning • Promising, but not yet fielded 10

Kushmerick Contribution <HTML><HEAD>Some Country Codes</HEAD><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD><BODY>Some Country CodesCongo 242 Egypt 20 Belize 501 Spain 34 <HR>End</BODY></HTML> machine learning techniques to automatically construct wrappers from examples wrapperprocedure 11

Example  (Congo, 242) (Egypt, 20) (Belize, 501) (Spain, 34)  12

LR wrappers: The basic idea exploit fortuitous non-linguistic regularity <HTML><TITLE>Some Country Codes</TITLE> Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML>  Use , , , for parsing 13

Country/Code LR wrapper procedure ExtractCountryCodeswhile there are more occurrences of 1. extract Country between and 2. extract Code between and Left-Right wrappers 14

“Generic” LR wrapper procedure ExtractAttributes:while there are more occurrence of l1 1. extract 1st attribute between l1 and r1 . . .K. extract Kth attribute between lK andrK LR wrapper 2K stringsl1,r1,…,lK,rK Not just HTML tags! left delimiter right delimiter K= number of attributes 15

Wrapper induction algorithm example pagesupply 1. Gather enough pages to satisfy the termination condition (PAC model). 2. Label example pages. 3. Find a wrapper consistent with the examples. PAC modelparameters wrapper automaticpage labeler 16

Finding an LR wrapper <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> <HTML><HEAD>Some Country Codes</HEAD>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> Example: Find 4 strings , , ,  l1, r1, l2, r2 labeled pages wrapper l1,r1,…,lK,rK 17

LR: Finding r1 <HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> r1 can be any prefixeg 18

LR: Finding l1, l2 and r2 <HTML><TITLE>Some Country Codes</TITLE>Congo 242 Egypt 20 Belize 501 Spain 34 </BODY></HTML> r2 can be any prefixeg l2 can be any suffix eg l1 can be any suffixeg 19

Finding an LR wrapper: Algorithm naïve algorithmenumerate all combinationsfor each candidatel1 for each candidater1 ··· for each candidatelK for each candidaterK succeed if consistent with examples efficient algorithmconstraints are independent for k = 1 to K for each candidaterk succeed if consistent with examplesfor k = 1 to K for each candidatelk succeed if consistent with examples Real “ave case” value is K L^(2/3) so this stmt is not wrong but say this so it won’t confuse anyone O(S2K) O(KS) S = length of examplesK = number of attributes 20

Summary of Kushmerick PhD Results wrapper class useful? learnable? O(KS) LR 53 % O(KS2) HLRT 57 % O(KS2) OCLR 53 % O(KS4) HOCLRT 57 % O(S2K) 13 % N-LR O(S2K+2) 50 % N-HLRT total 70% time to automatically build wrappers K = number of attributes S = size of examples “search.com” surveyAltaVista, WebCrawler, WhoWhere, CNN Headlines,Lycos, Shareware.Com,AT&T 800 Directory, ... 21

“Strong” trainable IE systems • Examples: • CRYSTAL (Soderland et al, 1995) • SRV (Freitag, 1999) • Rapier (Califf & Mooney, 1999) • General approach: • Define a space of possible extraction rules. • Learning = search rule space for set of rules that individually cover many positive examples and few negative examples • Sometimes use POS tagging and other shallow linguistic pre-processing 22

SRV (Freitag’s CMU PhD) FOLinterpretation rule = conjunction of literals literal = predefined relational encoding of a document Englishinterpretation ... ... exampledocument 23

Learing Curves for Rapier ~ SRV more training data (job-listings domain) 24

SRV: Pseudo-pseudo-code procedure SRV(training examples E) RuleSet {} while (E is not empty) RuleTRUE repeat let Best be the literal that most improves Ruleaccording to an information-theoretic gain metric Rule RuleBest until no such Best exists remove examples covered by Rule from E RuleSetRuleSet + Rule return RuleSet 25

Covering algorithm: Pseudo-Example - - - - - - + - + - + - - + - + - + - - - + + + - - - + - + - + - + + + - - - - - - + + + + + + + + + - - - - - - + + + + + + - - + - - + - + - + - + - - + + + - - + + 2 3 1 4 5 - 6 - - - + - + - - + - + - - + + - - + - + - + + - - - - + + + + + + - - - - + + + + 7 8 9 - - - - - - + - + - + - - + - + - + - - - + + + - - - + - + - + - + + + - - - - - - + + + + + + + + + - - - - - - + + + + + + 26

Why am I telling you this? • “Strong” trainable IE systems explore a complex rule space… • Complicated algorithm/implementation • Deep & bushy search space • Susceptible to overfitting (?) • Existing algorithms are covering algorithms • Other ways to reweight examples (eg, Boosting) • Theoretically more satisfying • Learned rules are more accurate (?) • If we use a cleverer reweighting scheme, can we get away with simpler rules? Can we do better than the “strong” learner?! 27

Boosting • Boosting (Schapire, Freund, et al) • General ML technique for improving the performance of a “weak” learning algorithm, by repeatedly applying the learner,each time modifying the training data weights to force the weak learner to focus on examples which were previously classified incorrectly • Given:Weak Learner L • Output: Boosted Learner L using L as a “subroutine” • Theorem: Any learning algorithm L with training error  ½ can be mechanically converted into an algorithm L with error arbitrarily close to 0. 28

Reweighting Example weight weight weight weight training instance training instance training instance training instance t=1 t=2 weak hypothesish1 h2 t=4 t=3 h3 weak learner will focus onthese instances on iteration t=5 = instances correctly classified by ht 29

BWI’s extraction patterns • Basic building block: Boundary detector • Associated with every boundary detector dis a numeric confidence value Cd Detector d = [who :][dr . Capitalized] wildcard “suffix”pattern “prefix”pattern Detector d matches a boundary B if:“prefix” pattern matches tokens to B’s left, and“suffix” pattern matches tokens to B’s right example: Who:Dr. Jane Smith B 30

Wildcards • “Standard” wildcards • Anything • Alphabetic • Capitalized • Lowercase • Alphanumeric • Numeric • Punctuation • SingleChar(one-character token) • Tried several simple “lexical” wildcards • Firstname(dictionary of names from US Census) • Lastname • NonEnglishWord(tokens not in /usr/dict/words) 31

Detector learning algorithm Input: training examples Output: boundary detector d = p,s start with empty detector d = [][], and growdetector one token at a time repeat this process until d can’t be improved: Consider all ways to grow prefix by one token and all ways to grow suffix by one token Pick the extension that most improves d’s accuracy on the training data. 32

Boosted Wrapper Induction • Wrapper = 1. Start detectors dS1, dS2, …2. End detectors dE1, dE2, … 3. Length histogram L:[-,+][0,1] • To invoke wrapper on a document: • Apply all detectors to entire document • Score every boundary B: • Extract all substrings (BS,BE) that satisfy user-specified confidence threshold 33

Extraction Example 1 2 3 4 StartScore=0.6 StartScore=0.2 StartScore=0.4 L 0.3 Start detectors dS1 = [a b c][d e], conf 0.2 dS2 = [p q][r s t], conf 0.4 End detectors dE1 = [w x][y z], conf 0.5 dE2 = [m][n o], conf 0.3 0.2 34

Extraction Example 1 2 3 4 EndScore=0.5 EndScore=0.3 EndScore=0.3 L 0.3 Start detectors dS1 = [a b c][d e], conf 0.2 dS2 = [p q][r s t], conf 0.4 End detectors dE1 = [w x][y z], conf 0.5 dE2 = [m][n o], conf 0.3 0.2 35

Extraction Example 1 3 2 4 EndScore(SE)=0.5 StartScore(SB)=0.6 SE-SB = 3 tokens L 0.3 Start detectors dS1 = [a b c][d e], conf 0.2 dS2 = [p q][r s t], conf 0.4 End detectors dE1 = [w x][y z], conf 0.5 dE2 = [m][n o], conf 0.3 0.2 ? StartScore(SB)EndScore(SE)L(SE-SB) = 0.60.50.3 = 0.09 >  roughly, “probability that ‘38-44K’ is a correct value” 36

BWI Algorithm • Procecure BWIInput: training examplesOutput: Start & end detectors, length histogram Parameters: Number of boosting rounds T Lookahead depth L Wildcards • S = Start boundary examples • E = End boundary examples • Start-detectors = AdaBoost(LearnDetector, S) • End-detectors = AdaBoost(LearnDetector, E) • Construct length histogram L from training data 37

Example 38

Experiments • 16 IE tasks from 8 document collections • 8 fields from 3 “traditional” domains: Seminar announcements, Job listings; Reuters corporate acquisition articles; • 8 fields from 5 “wrapper” domains: CS department faculty lists; Zagats restaurants reviews; LA Times restaurant reviews; Internet Address Finder; Stock quote server • Performance metrics • Precision (fraction of extracted items that are correct) • Recall (fraction of items in the documents that were extracted) • F1 = 2/(1/Precision + 1/Recall) • Competitors • SRV, Rapier, algorithm based on hidden Markov models 39

Results: 16 tasks  4 algorithms 21cases 7cases 40

Summary & Conclusions • BWI learns simple wrapper-like extraction patterns; each pattern has high accuracy but low coverage • Uses boosting to focus the weak pattern learner on difficult training examples • Works because a few dozen or hundred (but not millions!) of patterns suffice for broad coverage. • Many real-world natural corpora have their own stereotypical language, nongrammatical utterances, stylistic constraints, editorial guidelines, formatting regularities, etc that greatly simplify extraction • BWI outperforms 3 competitors in 75% of comparisons 41

Information Carnivores