1 / 54

Machine-learning based Semi-structured IE

This paper discusses the development of a wrapper induction approach for extracting desired information from web pages using machine learning. It explores the challenges of building the extractor quickly, independently of traditional IE, and extracting data from multiple web-based sources. The paper also presents related work on shopbots, Ariadne, WIEN, and STALKER.

davidturner
Download Presentation

Machine-learning based Semi-structured IE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine-learning based Semi-structured IE Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw

  2. Wrapper Induction • Wrapper • An extracting program to extract desired information from Web pages. Semi-Structure Doc.– wrapper→ Structure Info. • Web wrappers wrap... • “Query-able’’ or “Search-able’’ Web sites • Web pages with large itemized lists • The primary issues are: • How to build the extractor quickly?

  3. Semi-structured IE • Independently of the traditional IE • The necessity of extracting and integrating data from multiple Web-based sources

  4. Machine-Learning Based Approach • A key component of IE systems is • a set of extraction patterns • that can be generated by machine learning algorithms.

  5. Related Work • Shopbot • Doorenbos, Etzioni, Weld, AA-97 • Ariadne • Ashish, Knoblock, Coopis-97 • WIEN • Kushmerick, Weld, Doorenbos, IJCAI-97 • SoftMealy wrapper representation • Hsu, IJCAI-99 • STALKER • Muslea, Minton, Knoblock, AA-99 • A hierarchical FST

  6. WIEN N. Kushmerick, D. S. Weld, R. Doorenbos, University of Washington, 1997 http://www.cs.ucd.ie/staff/nick/

  7. Example 1

  8. Extractor for Example 1

  9. HLRT

  10. Wrapper Induction • Induction: • The task of generalizing from labeled examples to a hypothesis • Instances: pages • Labels: {(Congo, 242), (Egypt, 20), (Belize, 501), (Spain, 34)} • Hypotheses: • E.g. (<p>, <HR>, <B>, </B>, <I>, </I>)

  11. BuildHLRT

  12. Other Family • OCLR (Open-Close-Left-Right) • Use Open and Close as delimiters for each tuple • HOCLRT • Combine OCLR with Head and Tail • N-LR and N-HLRT • Nested LR • Nested HLRT

  13. Terminology • Oracles • Page Oracle • Label Oracle • PAC analysis • is to determine how many examples are necessary to build an wrapper with two parameters: accuracy  and confidence : • Pr[E(w)<]>1-, or Pr[E(w)>]<

  14. Probably Approximate Correct (PAC) Analysis • With =0.1, =0.1, K=4, an average of 5 tuples/page, Build HLRT must examine at least 72 examples

  15. Empirical Evaluation • Extract 48% web pages successfully. • Weakness: • Missing attributes, attributes not in order, tabular data, etc.

  16. Softmealy Chun-Nan Hsu, Ming-Tzung Dung, 1998 Arizona State University http://kaukoai.iis.sinica.edu.tw/~chunnan/mypublications.html

  17. Softmealy Architecture Finite-State Transducers for Semi-Structured Text Mining • Labeling: use a interface to label example by manually. • Learner: FST (Finite-State Transducer) • Extractor: • Demonstration • http://kaukoai.iis.sinica.edu.tw/video.html

  18. Softmealy Wrapper • SoftMealy wrapper representation • Uses finite-state transducer where each distinct attribute permutations can be encoded as a successful path • Replaces delimiters with contextual rules that describes the context delimiting two adjacent attributes

  19. Example

  20. Label the Answer Key 4種情形

  21. b Finite State Transducer 多解決了(N, M)、(N, A, M)2個情形 skip extract skip extract U -U N skip -N extract skip extract skip -A e M A

  22. Find the starting position -- Single Pass 新增的定義

  23. Contextual based Rule Learning • Tokens • Separators • SL ::= … Punc(,) Spc(1) Html(<I>) • SR ::= C1Alph(Professor) Spc(1) OAlph(of) … • Rule generalization • Taxonomy Tree

  24. Tokens • All uppercase string: CALph • An uppercase letter, followed by at least one lowercase letter, C1Alph • A lowercase letter, followed by zero or more characters: OAlph • HTML tag: HTML • Punctuation symbol: Punc • Control characters: NL(1), Tab(4), Spc(3)

  25. Rule Generalization

  26. Generalize each column by replacing each token with their least common ancestor Learning Algorithm

  27. Taxonomy Tree

  28. Generating to Extract the Body • The contextual rules for the head and tail separators are: • hL::=C1alpha(Staff) Html(</H2>) NL(1)Html(<HR>) NL(1) Html(<UL>) • tR::=Html(</UL>) NL(1) Html(<HR>) NL(1) Html(<ADDRESS>) NL(1) Html(<I>) Clalpha(Please)

  29. More Expressive Power • Softmealy allows • Disjunction • Multiple attribute orders within tuples • Missing attributes • Features of candidate strings

  30. Stalker I. Muslea, S. Minton, C. Knoblock, University of Southern California http://www.isi.edu/~muslea/

  31. STALKER • Embedded Catalog Tree • Leaves (primitive items): 所要擷取的東西。 • Internal nodes (items): • Homogeneous list, or • Heterogeneous tuple.

  32. EC Tree of a page

  33. Extracting Data from a Document • For each node in the EC Tree, the wrapper needs a rule that extracts that particular node from its parent • Additionally, for each list node, the wrapper requires a list iteration rule that decomposes the list into individual tuples. • Advantages: • The hierarchical extraction based on the EC tree allows us to wrap information sources that have arbitrary many levels of embedded data. • Second, as each node is extracted independently of its siblings, our approach does not rely on there being a fixed ordering of the items, and we can easily handle extraction tasks from documents that may have missing items or items that appear in various orders.

  34. Extraction Rules as Finite Automata • Landmarks • A sequence of tokens and wildcards • Landmark automata • A non-deterministic finite automata

  35. Landmark Automata • A linear LA has one accepting state • from each non-accepting state, there are exactly two possible transitions: a loop to itself, and a transition to the next state; • each non-looping transition is labeled by a landmarks; • all looping transitions have the meaning “consume all tokens until you encounter the landmark that leads to the next state”.

  36. Rule Generating Extract Credit info. 1st : terminals: {; reservation _Symbol_ _Word_} Candidate:{; <i> _Symbol_ _HtmlTag_} perfect Disj:{<i> _HtmlTag_} positive example: D3, D4 2nd: uncover{D1, D2} Candicate:{; _Symbol_}

  37. Possible Rules

  38. The STALKER Algorithm

  39. Features • Process is performed in a hierarchical manner. • 沒有Attributes not in order的問題。 • Use disjunctive rule 可以解決Missing attributes的問題。

  40. Multi-pass Softmealy Chun-Nan Hsu and Chian-Chi Chang Institute of Information Science Academia Sinica Taipei, Taiwan

  41. Multi-pass

  42. Tabular style document (Quote Server)

  43. Tagged-list style document (Internet Address Finder)

  44. Layout styles and learnability • Tabular style • missing attributes, ordering as hints • Tagged-list style • variant ordering, tags as hints • Prediction • single-pass for tabular style • multi-pass for tagged-list style

  45. Tabular result (Quote Server)

  46. Tagged-list result (Internet Address Finder)

More Related