1 / 15

The Semantic Web-Week 22 Information Extraction and Integration (continued)

The Semantic Web-Week 22 Information Extraction and Integration (continued). Module Website: http://scom.hud.ac.uk/scomtlm/chs2533 Practical this week: http://www.isi.edu/info-agents/. Recap.

jamil
Download Presentation

The Semantic Web-Week 22 Information Extraction and Integration (continued)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: http://scom.hud.ac.uk/scomtlm/chs2533 Practical this week: http://www.isi.edu/info-agents/

  2. Recap • Information extraction is the process of extracting “meaningful” data from raw or semi-structured text • Information Agents are capable of retrieving info from some web sites via database-like queries (such as required in the example above) and integrating info from web sites to solve complex queries • Use ‘similarity-based’ machine learning techniques to learn/extract meaning from traditional web page content

  3. Induction algorithm GENERALISATION SPACE INSTANCE SPACE SEED

  4. Induction algorithm - continued GENERALISATION SPACE INSTANCE SPACE SEED

  5. Induction algorithm - continued GENERALISATION SPACE Hypothesis = V INSTANCE SPACE NEW SEED

  6. Induction algorithm – abstract example GENERALISATION SPACE SEED INSTANCE SPACE a&f a&b&f a&b&c e&c&a

  7. Induction algorithm – abstract example GENERALISATION SPACE VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY a&b b&c a&c INSTANCE SPACE a&f a&b&f a&b&c e&c&a

  8. Induction algorithm – abstract example GENERALISATION SPACE VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY a a&b b&c a&c INSTANCE SPACE a&f a&b&f a&b&c e&c&a

  9. Induction algorithm – disjunction + ¬ GENERALISATION SPACE ¬a V ¬f b V c VERSION SPACE – all expressions that are cover all +exs and no -exs a&¬f a&b&f V a&b&c V e&c&a INSTANCE SPACE a&f a&b&f a&b&c e&c&a

  10. Generalisation hierarchies a c b b&c a&b a&c a&b&c ExEy on(x,y) Ex red(x) Onto(a,b)&red(a)

  11. Back to Example: ISI’s project “Wrappers” are rules (actually they are like finite state machines!) for extracting information from Web Pages. See “Hierarchical Wrapper Induction for Semi-structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, 1999. At the heart of ISI’s Heracles system is the Stalker inductive algorithm that generates certain types wrappers - rules that identify the start and end of an item within a web page.

  12. Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pico, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570 E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293 E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008 Imagine you had to write an FSM to extract this data – this is the kind of thing that the Learning Algorithm has to learn.

  13. Brief example of Stalker execution.. E1: 513 Pico, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570 E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293 E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008 • SEED = E2 R1 = SkipTo((), R2 = SkipTo(Punctuation ), R3 = SkipTo(AnyToken ) • Choose R1 - covers E4 and E2 and no –ve exs • NEW SEED for E1 /E3 = E1 R4 = SkipTo(<b>) R5 = SkipTo(HtmlTag ) R6 = SkipTo(AnyToken) • All cover E1/E3 but also covers –ve exs • Specialise R4 = SkipTo(<b>) : R7 = SkipTo( - <b>) R8 = SkipTo( Punctuation <b>) R9 = SkipTo( AnyToken <b>)

  14. Other Refinements: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 R10: SkipTo(Venice) SkipTo(<b>) R17: SkipTo(Numeric) SkipTo(<b>) R11: SkipTo(</b>) SkipTo(<b>) R18: SkipTo(Punctuation)SkipTo(<b>) R12: SkipTo(:) SkipTo(<b>) R19: SkipTo(HtmlTag) SkipTo(<b>) R13: SkipTo(-) SkipTo(<b>) R20: SkipTo(AlphaNum) SkipTo(<b>) R14: SkipTo(,) SkipTo(<b>) R21: SkipTo(Alphabetic) SkipTo(<b>) R15: SkipTo(Phone) SkipTo(<b>) R22: SkipTo(Capitalized) SkipTo(<b>) R16: SkipTo(1) SkipTo(<b>) R23: SkipTo(NonHtml) SkipTo(<b>) R24: SkipTo(Anything) SkipTo(<b>) R7, R11, R12, R13, R15, R16, and R19 all match correctly on E1 and E3, and fail to match on E2 and E4; R7 represents the best solution according to the algorithms heuristics - Consequently stalker completes its execution by returning the disjunctive rule either R1 or R7.

  15. Summary Stalker is an example of an inductive learning algorithm which is given -- examples of fields in web pages and learns -- the begin/end patterns of fields so that it can be used to ‘mine’ data in unseen web pages Many other examples exist of the use of “wrapper induction” in order to automatically extract information from web pages

More Related