1 / 56

Integration of Friendly Data Islands on the Web. Information Extraction.

Integration of Friendly Data Islands on the Web. Information Extraction. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems

Download Presentation

Integration of Friendly Data Islands on the Web. Information Extraction.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Integration of Friendly Data Islands on the Web.Information Extraction.

  2. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions

  3. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions

  4. User Interface Controller Data Layer The theory • A wrapper is a building block that provides an ad-hoc, message-based API to an app • They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer Business Logic Data AccessLayer

  5. The Da Vinci Code Dan Brown Doubleday, 2006 15.95 € Robert Langdon is a Harvard Professor of Symbology… Buy The problem

  6. Features of current web documents • Trillions of documents • Generated on demand by software applications • Change continuously • Require navigation from search forms • Written in telegraphic language • Formatted according to HTML templates

  7. The solution Wrapping

  8. Wrapping in a nutshell • Goals • Endow data islands with APIs • Ease implementing software applications • Implications • Form filling • Navigation • Info extraction • “Ontologisation”

  9. Look out! • Information extraction has driven most research efforts • Few wrapping systems are complete • Wrapping is usually mistaken for information extraction • This talk is about engineering information extraction for enabling information integration

  10. Document Extraction rules B1 The Da Vinci Code A1 Dan Brown Templating/ Ontologisation rules 15.95 € 2006 The Da Vinci Code P1 Doubleday Doubleday 2006 Dan Brown 15.95 € Robert Langdon… Robert Langdon… Message ID: MUC-0001 Message Template: Court resolution Date of Event: April, 30 2007 Charge: Terrorist attack Perpetrator: Salahuddin Amin Perpetrator: Anthony Garcia Perpetrator: Waheed Mahmood Perpetrator: Omar Khyam … Attributes Ontology instances Templates How IE works Information extractor

  11. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Side by side comparison • Conclusions

  12. Running example

  13. Running example <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html> <!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

  14. Kinds of extraction rules • Regular expressions • First-order logic rules • Pointers into DOM tree • Context-free grammars • Tag trees

  15. RoadRunner $FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html> Regular expressions TSIMMIS [Root, get("page.html"), "#"] [BookReview, Root, "<body>#</body>"] [BookName, BookReview, "</b>#<br/>"] [Tmp, Rook, "<ul>#</ul>"] [Reviews, Tmp, "split(Tmp, '<li>')"] [ReviewerNames, Reviews, "Reviewer:</b>#<br/>"] [Ratings, Reviews, "Rating:</b>#<br/>"] [Text, Reviews, "Text:</b>#<br/>"]

  16. First-order logic rules SRV bookTitle(X) :- prev(X, "Bookname:</b>"), next(X, "<br/>"). reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X). rating(X) :- isNatural(X), length(X, 1), inList(X). text(X) :- prev(X, "Text:</b>"),next(X, "</li>").

  17. Pointer into the DOM tree WebOQL select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Text from x, y in browse("page.html") where x.Text = "Book name:" and y.Text = "Reviewer:"

  18. Context-free grammars Minerva Page ::= $FileName <html><body> Review </body></html> Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul> Reviewer ::= <b>Reviewer:</b> $Reviewer <br/> Rating ::= <b>Rating:</b> $Rating <br/> Text ::= <b>Text:</b> $Text

  19. Tag trees DEPTA li b br b br b

  20. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions

  21. Classification • Hand-crafted • Supervised induction • Little-supervised induction • Unsupervised induction

  22. Hand-crafted • Techniques • Natural intelligence • Systems • TSIMMIS • Minerva • WebOQL • W4F • XWrap The pattern to extract the title is “…”

  23. Raw documents Labelled documents Automated induction Supervised induction • Techniques • Bottom-up ILP • Top-down ILP • Ad-hoc algorithms • Systems • SRV • RAPIER • WIEN • WHISK • NoDoSE • SoftMealy • STALKER • DEByE

  24. Automated induction Little-supervised induction • Techniques • String alignment • Tree alignment • Systems • OLERA • Thresher Raw document Record and attribute labelling

  25. Raw documents Automated induction Pattern interpretation Unsupervised induction • Techniques • String alignment • Tree alignment • Statistical roles • Systems • DeLa • RoadRunner • EXALG • DEPTA • IEPAD

  26. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • Conclusions

  27. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • RoadRunner • SRV • Conclusions

  28. Token matching <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> String mistmatch $1

  29. ...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag match $1<html>

  30. ...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag match $1<html><body>

  31. ...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag match, string match, … $1<html><body> <b>Book name:</b>

  32. ...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> String mismatch, tag match $1<html><body> <b>Book name:</b> $2 <br/>

  33. ...and matching… <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> … $1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

  34. Stop: lists and optionals $1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li> <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> Tag mismatch

  35. Stop: lists and optionals $1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li> <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

  36. Stop: lists and optionals $1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

  37. …and matching finishes <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html> $1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>

  38. Just union-free grammars!

  39. Roadmap • Introduction • What extraction rules are • Generating extraction rules • A couple of systems • RoadRunner • SRV • Conclusions

  40. Exercise • Support predicates: next(x,y), previous(x,y) • Try to explain isCorD(X) abcabdab bbcaabda

  41. Exercise • Support Predicates: next(x,y), previous(x,y) • Now, try to Explain isCorDorE(X) abcabdabee bbcaabdaee

  42. Define target predicates Target Predicates title: #PCDATA. reviewer: #PCDATA. rating: #PCDATA. text: #PCDATA.

  43. Instantiate target predicates <!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html> <!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html> <!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

  44. Instantiate target predicates Positive Samples Negative Samples title("Ontologies"). title("SPARQL in action"). title("W4F Explained"). reviewer("John Doe"). reviewer("Alan Wohl"). reviewer("Dan Smith"). rating("7"). rating("8"). rating("9"). text("blah, blah"). text("yeah, yeah"). text("cough, cough"). !title("Book name:"). !reviewer("Book name:"). !rating("Book name:"). !text("Book name:"). !title("Reviews:"). !reviewer("Reviews:"). !rating("Reviews:"). !text("Reviews:"). !title("Reviewer:"). !reviewer("Reviewer:"). !rating("Reviewer:"). !text("Reviewer:"). !title("Rating:"). !reviewer("Rating:"). !rating("Rating:"). …

  45. Define support predicates Support Predicates prev: #PCDATA, #PCDATA. next: #PCDATA, #PCDATA. length: #PCDATA, #PCDATA. isNatural: #PCDATA.

  46. On Negative Samples prev("Book name:", "<b>"). next("Book name:", "</b>"). length("Book name:", 10). !isNatural("Book name:"). prev("Reviews:", "<b>"). next("Reviews:", "</b>"). !isNatural("Reviews:"). prev("Reviewer:", "<b>"). next("Reviewer:", "</b>"). !isNatural("Reviewer:"). prev("Rating:", "<b>"). next("Rating:", "</b>"). !isNatural("Rating:"). … Instantiate support predicates On Positive Samples prev("Ontologies", "</b>"). next("Ontologies", "<br/>"). length("Ontologies", 10). !isNatural("Ontologies"). prev("SPARQL in action", "</b>"). next("SPARQL in action", "<br/>"). length("SPARQL in action", 16). !isNatural("SPARQL in action"). prev("W4F explained", "</b>"). next("W4F explained", "<br/>"). length("W4F explained", 16). !isNatural("W4F explained"). …

  47. Top-down induction title(X) :- . (3, 14) title(X) :- prev(X, X). (0, 0) title(X) :- prev(X, "<b>"). (0, 5) title(X) :- !prev(X, X). (3, 14) title(X) :- !prev(X, "<b>"). (3, 9) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, "</b>"). (3, 9) title(X) :- !prev(X, Y). (?, ?) title(X) :- !prev(X, "</b>"). (0, 5) title(X) :- next(X, X). (0, 0) … title(X) :- !next(X, X). (3, 14) title(X) :- next(X, Y). (3, 14) title(X) :- !next(X, Y). (?, ?) title(X) :- length(X, X). (0, 0) …

  48. Rule selection Combined covering New covering Old covering p0 = # positive bindings of R n0 = # negative bindings of R p1 = # positive bindings of R&A n0 = # negative bindings of R&A t = # positive bindings of both R and R&A

  49. Induction goes on… title(X) :- . (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), X = Y. (?, ?) title(X) :- prev(X, Y), X != Y. (?, ?) title(X) :- prev(X, Y), prev(X, X). (?, ?) title(X) :- prev(X, Y), !prev(X, X). (?, ?) title(X) :- prev(X, Y), prev(X, Z). (?, ?) title(X) :- prev(X, Y), !prev(X, Z). (?, ?) title(X) :- prev(X, Y), prev(Y, X). (?, ?) …

  50. …and on… title(X) :- . (3, 14) title(X) :- prev(X, Y). (3, 14) title(X) :- prev(X, Y), Y = "</b>". (?, ?) title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?) title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?) title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?) title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?) title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?) title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?) …

More Related