150 likes | 286 Views
The Semantic Web-Week 22 Information Extraction and Integration (continued). Module Website: http://scom.hud.ac.uk/scomtlm/chs2533 Practical this week: http://www.isi.edu/info-agents/. Recap.
E N D
The Semantic Web-Week 22 Information Extraction and Integration (continued) Module Website: http://scom.hud.ac.uk/scomtlm/chs2533 Practical this week: http://www.isi.edu/info-agents/
Recap • Information extraction is the process of extracting “meaningful” data from raw or semi-structured text • Information Agents are capable of retrieving info from some web sites via database-like queries (such as required in the example above) and integrating info from web sites to solve complex queries • Use ‘similarity-based’ machine learning techniques to learn/extract meaning from traditional web page content
Induction algorithm GENERALISATION SPACE INSTANCE SPACE SEED
Induction algorithm - continued GENERALISATION SPACE INSTANCE SPACE SEED
Induction algorithm - continued GENERALISATION SPACE Hypothesis = V INSTANCE SPACE NEW SEED
Induction algorithm – abstract example GENERALISATION SPACE SEED INSTANCE SPACE a&f a&b&f a&b&c e&c&a
Induction algorithm – abstract example GENERALISATION SPACE VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY a&b b&c a&c INSTANCE SPACE a&f a&b&f a&b&c e&c&a
Induction algorithm – abstract example GENERALISATION SPACE VERSION SPACE – all expressions that are cover all +exs and no –exs IS EMPTY a a&b b&c a&c INSTANCE SPACE a&f a&b&f a&b&c e&c&a
Induction algorithm – disjunction + ¬ GENERALISATION SPACE ¬a V ¬f b V c VERSION SPACE – all expressions that are cover all +exs and no -exs a&¬f a&b&f V a&b&c V e&c&a INSTANCE SPACE a&f a&b&f a&b&c e&c&a
Generalisation hierarchies a c b b&c a&b a&c a&b&c ExEy on(x,y) Ex red(x) Onto(a,b)&red(a)
Back to Example: ISI’s project “Wrappers” are rules (actually they are like finite state machines!) for extracting information from Web Pages. See “Hierarchical Wrapper Induction for Semi-structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, 1999. At the heart of ISI’s Heracles system is the Stalker inductive algorithm that generates certain types wrappers - rules that identify the start and end of an item within a web page.
Example of training examples Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, E1: 513 Pico, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570 E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293 E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008 Imagine you had to write an FSM to extract this data – this is the kind of thing that the Learning Algorithm has to learn.
Brief example of Stalker execution.. E1: 513 Pico, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570 E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293 E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008 • SEED = E2 R1 = SkipTo((), R2 = SkipTo(Punctuation ), R3 = SkipTo(AnyToken ) • Choose R1 - covers E4 and E2 and no –ve exs • NEW SEED for E1 /E3 = E1 R4 = SkipTo(<b>) R5 = SkipTo(HtmlTag ) R6 = SkipTo(AnyToken) • All cover E1/E3 but also covers –ve exs • Specialise R4 = SkipTo(<b>) : R7 = SkipTo( - <b>) R8 = SkipTo( Punctuation <b>) R9 = SkipTo( AnyToken <b>)
Other Refinements: E1: 513 Pico, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 R10: SkipTo(Venice) SkipTo(<b>) R17: SkipTo(Numeric) SkipTo(<b>) R11: SkipTo(</b>) SkipTo(<b>) R18: SkipTo(Punctuation)SkipTo(<b>) R12: SkipTo(:) SkipTo(<b>) R19: SkipTo(HtmlTag) SkipTo(<b>) R13: SkipTo(-) SkipTo(<b>) R20: SkipTo(AlphaNum) SkipTo(<b>) R14: SkipTo(,) SkipTo(<b>) R21: SkipTo(Alphabetic) SkipTo(<b>) R15: SkipTo(Phone) SkipTo(<b>) R22: SkipTo(Capitalized) SkipTo(<b>) R16: SkipTo(1) SkipTo(<b>) R23: SkipTo(NonHtml) SkipTo(<b>) R24: SkipTo(Anything) SkipTo(<b>) R7, R11, R12, R13, R15, R16, and R19 all match correctly on E1 and E3, and fail to match on E2 and E4; R7 represents the best solution according to the algorithms heuristics - Consequently stalker completes its execution by returning the disjunctive rule either R1 or R7.
Summary Stalker is an example of an inductive learning algorithm which is given -- examples of fields in web pages and learns -- the begin/end patterns of fields so that it can be used to ‘mine’ data in unseen web pages Many other examples exist of the use of “wrapper induction” in order to automatically extract information from web pages