560 likes | 688 Views
Integration of Friendly Data Islands on the Web. Information Extraction. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems
E N D
Integration of Friendly Data Islands on the Web.Information Extraction.
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions
User Interface Controller Data Layer The theory • A wrapper is a building block that provides an ad-hoc, message-based API to an app • They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer Business Logic Data AccessLayer
The Da Vinci Code Dan Brown Doubleday, 2006 15.95 € Robert Langdon is a Harvard Professor of Symbology… Buy The problem
Features of current web documents • Trillions of documents • Generated on demand by software applications • Change continuously • Require navigation from search forms • Written in telegraphic language • Formatted according to HTML templates
The solution Wrapping
Wrapping in a nutshell • Goals • Endow data islands with APIs • Ease implementing software applications • Implications • Form filling • Navigation • Info extraction • “Ontologisation”
Look out! • Information extraction has driven most research efforts • Few wrapping systems are complete • Wrapping is usually mistaken for information extraction • This talk is about engineering information extraction for enabling information integration
Document Extraction rules B1 The Da Vinci Code A1 Dan Brown Templating/ Ontologisation rules 15.95 € 2006 The Da Vinci Code P1 Doubleday Doubleday 2006 Dan Brown 15.95 € Robert Langdon… Robert Langdon… Message ID: MUC-0001 Message Template: Court resolution Date of Event: April, 30 2007 Charge: Terrorist attack Perpetrator: Salahuddin Amin Perpetrator: Anthony Garcia Perpetrator: Waheed Mahmood Perpetrator: Omar Khyam … Attributes Ontology instances Templates How IE works Information extractor
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Side by side comparison • Conclusions
Running example <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html> <!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>
Kinds of extraction rules • Regular expressions • First-order logic rules • Pointers into DOM tree • Context-free grammars • Tag trees
RoadRunner $FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html> Regular expressions TSIMMIS [Root, get("page.html"), "#"] [BookReview, Root, "<body>#</body>"] [BookName, BookReview, "</b>#<br/>"] [Tmp, Rook, "<ul>#</ul>"] [Reviews, Tmp, "split(Tmp, '<li>')"] [ReviewerNames, Reviews, "Reviewer:</b>#<br/>"] [Ratings, Reviews, "Rating:</b>#<br/>"] [Text, Reviews, "Text:</b>#<br/>"]
First-order logic rules SRV bookTitle(X) :- prev(X, "Bookname:</b>"), next(X, "<br/>"). reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X). rating(X) :- isNatural(X), length(X, 1), inList(X). text(X) :- prev(X, "Text:</b>"),next(X, "</li>").
Pointer into the DOM tree WebOQL select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Text from x, y in browse("page.html") where x.Text = "Book name:" and y.Text = "Reviewer:"
Context-free grammars Minerva Page ::= $FileName <html><body> Review </body></html> Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul> Reviewer ::= <b>Reviewer:</b> $Reviewer <br/> Rating ::= <b>Rating:</b> $Rating <br/> Text ::= <b>Text:</b> $Text
Tag trees DEPTA li b br b br b
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions
Classification • Hand-crafted • Supervised induction • Little-supervised induction • Unsupervised induction
Hand-crafted • Techniques • Natural intelligence • Systems • TSIMMIS • Minerva • WebOQL • W4F • XWrap The pattern to extract the title is “…”
Raw documents Labelled documents Automated induction Supervised induction • Techniques • Bottom-up ILP • Top-down ILP • Ad-hoc algorithms • Systems • SRV • RAPIER • WIEN • WHISK • NoDoSE • SoftMealy • STALKER • DEByE
Automated induction Little-supervised induction • Techniques • String alignment • Tree alignment • Systems • OLERA • Thresher Raw document Record and attribute labelling
Raw documents Automated induction Pattern interpretation Unsupervised induction • Techniques • String alignment • Tree alignment • Statistical roles • Systems • DeLa • RoadRunner • EXALG • DEPTA • IEPAD
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • RoadRunner • SRV • Conclusions
Token matching <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> String mistmatch $1
...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag match $1<html>
...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag match $1<html><body>
...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag match, string match, … $1<html><body> <b>Book name:</b>
...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> String mismatch, tag match $1<html><body> <b>Book name:</b> $2 <br/>
...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> … $1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>
Stop: lists and optionals $1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li> <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag mismatch
Stop: lists and optionals $1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li> <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
Stop: lists and optionals $1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>
…and matching finishes <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> $1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>
Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • RoadRunner • SRV • Conclusions
Exercise • Support predicates: next(x,y), previous(x,y) • Try to explain isCorD(X) abcabdab bbcaabda
Exercise • Support Predicates: next(x,y), previous(x,y) • Now, try to Explain isCorDorE(X) abcabdabee bbcaabdaee
Define target predicates Target Predicates title: #PCDATA. reviewer: #PCDATA. rating: #PCDATA. text: #PCDATA.
Instantiate target predicates <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html> <!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>
Instantiate target predicates Positive Samples Negative Samples title("Ontologies"). title("SPARQL in action"). title("W4F Explained"). reviewer("John Doe"). reviewer("Alan Wohl"). reviewer("Dan Smith"). rating("7"). rating("8"). rating("9"). text("blah, blah"). text("yeah, yeah"). text("cough, cough"). !title("Book name:"). !reviewer("Book name:"). !rating("Book name:"). !text("Book name:"). !title("Reviews:"). !reviewer("Reviews:"). !rating("Reviews:"). !text("Reviews:"). !title("Reviewer:"). !reviewer("Reviewer:"). !rating("Reviewer:"). !text("Reviewer:"). !title("Rating:"). !reviewer("Rating:"). !rating("Rating:"). …
Define support predicates Support Predicates prev: #PCDATA, #PCDATA. next: #PCDATA, #PCDATA. length: #PCDATA, #PCDATA. isNatural: #PCDATA.
On Negative Samples prev("Book name:", "<b>"). next("Book name:", "</b>"). length("Book name:", 10). !isNatural("Book name:"). prev("Reviews:", "<b>"). next("Reviews:", "</b>"). !isNatural("Reviews:"). prev("Reviewer:", "<b>"). next("Reviewer:", "</b>"). !isNatural("Reviewer:"). prev("Rating:", "<b>"). next("Rating:", "</b>"). !isNatural("Rating:"). … Instantiate support predicates On Positive Samples prev("Ontologies", "</b>"). next("Ontologies", "<br/>"). length("Ontologies", 10). !isNatural("Ontologies"). prev("SPARQL in action", "</b>"). next("SPARQL in action", "<br/>"). length("SPARQL in action", 16). !isNatural("SPARQL in action"). prev("W4F explained", "</b>"). next("W4F explained", "<br/>"). length("W4F explained", 16). !isNatural("W4F explained"). …
Top-down induction title(X) :- . (3, 14) title(X) :- prev(X, X). (0, 0) title(X) :- prev(X, "<b>"). (0, 5) title(X) :- !prev(X, X). (3, 14) title(X) :- !prev(X, "<b>"). (3, 9) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, "</b>"). (3, 9) title(X) :- !prev(X, Y). (?, ?) title(X) :- !prev(X, "</b>"). (0, 5) title(X) :- next(X, X). (0, 0) … title(X) :- !next(X, X). (3, 14) title(X) :- next(X, Y). (3, 14) title(X) :- !next(X, Y). (?, ?) title(X) :- length(X, X). (0, 0) …
Rule selection Combined covering New covering Old covering p0 = # positive bindings of R n0 = # negative bindings of R p1 = # positive bindings of R&A n0 = # negative bindings of R&A t = # positive bindings of both R and R&A
Induction goes on… title(X) :- . (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), X = Y. (?, ?) title(X) :- prev(X, Y), X != Y. (?, ?) title(X) :- prev(X, Y), prev(X, X). (?, ?) title(X) :- prev(X, Y), !prev(X, X). (?, ?) title(X) :- prev(X, Y), prev(X, Z). (?, ?) title(X) :- prev(X, Y), !prev(X, Z). (?, ?) title(X) :- prev(X, Y), prev(Y, X). (?, ?) …
…and on… title(X) :- . (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), Y = "</b>". (?, ?) title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?) title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?) title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?) title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?) title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?) title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?) …