1 / 60

Plain Text Information Extraction (based on Machine Learning )

Plain Text Information Extraction (based on Machine Learning ). Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 9/24/2002. Introduction. Plain Text Information Extraction

thom
Download Presentation

Plain Text Information Extraction (based on Machine Learning )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Plain Text Information Extraction (based on Machine Learning) Chia-Hui Chang Department of Computer Science & Information Engineering National Central University chia@csie.ncu.edu.tw 9/24/2002

  2. Introduction • Plain Text Information Extraction • The task of locating specific pieces of data from a natural language document • To obtain useful structured information from unstructured text • DARPA’s MUC program • The extraction rules are based on • syntactic analyzer • semantic tagger

  3. On-line documents SRV, AAAI-1998 D. Freitag Rapier, ACL-1997, AAAI-1999 M. E. Califf WHISK, ML-1999 Solderland Free-text documents PALKA, MUC-5, 1993 AutoSlog, AAAI-1993 E. Riloff LIEP, IJCAI-1995 Huffman Crystal, IJCAI-1995, KDD-1997 Solderland Related Work

  4. SRVInformation Extraction from HTML: Application of a General Machine Learning Approach Dayne Freitag Dayne@cs.cmu.edu AAAI-98

  5. Introduction • SRV • A general-purpose relational learner • A top-down relational algorithm for IE • Reliance on a set of token-oriented features • Extraction pattern • First-order logic extraction pattern with predicates based on attribute-value tests

  6. Extraction as Text Classification • Extraction as Text Classification • Identify the boundaries of field instances • Treat each fragment as a bag-of-words • Find the relations from the surrounding context

  7. Relational Learning • Inductive Logic Programming (ILP) • Input: class-labeled instances • Output: classifier for unlabeled instances • Typical covering algorithm • Attribute values are added greedily to a rule • The number of positive examples is heuristically maximized while the number of negative examples is heuristically minimized

  8. Simple Features • Features on individual token • Length (e.g. single letter or multiple letters) • Character type (e.g. numeric or alphabet) • Orthography (e.g. capitalized) • Part of speech (e.g. verb) • Lexical meaning (e.g. geographical_place)

  9. Individual Predicates • Individual predicate: • Length (=3): accepts only fragments containing three tokens • Some(?A [] capitalizedp true): the fragment contains some token that is capitalized • Every(numericp false): every token in the fragment is non-numeric • Position(?A fromfirst <2): the token bound to ?A is either first or second in the fragment • Relpos(?A ?B =1) the token bound to ?A immediately preceds the token bound to ?B

  10. Relational Features • Relational Feature types • Adjacency (next_token) • Linguistic syntax (subject_verb)

  11. Example

  12. Search • Adding predicates greedily, attempting to cover as many positive and as few negative examples as possible. • At every step in rule construction, all documents in the training set are scanned and every text fragment of appropriate size counted. • Every legal predicate is assessed in terms of the number of positive and negative examples it covers. • A position-predicate is not legal unless some-predicate is already part of the rule

  13. Relational Paths • Relational features are used only in the Path argument to the some-predicate • Some(?A [prev_token prev_token] capitalized true): The fragment contains some token preceded by a capitalized token two tokens back.

  14. Validation • Training Phase • 2/3: learning • 1/3: validation • Testing • Bayesian m-estimates: • All rules matching a given fragment are used to assign a confidence score. • Combined confidence:

  15. Adapting SRV for HTML

  16. Experiments • Data Source: • Four university computer science departments: Cornell, U. of Texas, U. of Washington, U. of Wisconsin • Data Set: • Course: title, number, instructor • Project: title, member • 105 course pages • 96 project pages • Two Experiments • Random: 5 cross-validation • LOUO: 4-fold experiments

  17. OPD Coverage: Each rule has its own confidence

  18. MPD

  19. Baseline Strategies OPD MPD Simply memorizes field instances Random Guesser

  20. Conclusions • Increased modularity and flexibility • Domain-specific information is separate from the underlying learning algorithm • Top-down induction • From general to specific • Accuracy-coverage trade-off • Associate confidence score with predictions • Critique: single-slot extraction rule

  21. RAPIERRelational Learning of Pattern-Match Rules for Information Extraction M.E. Califf and R.J. Mooney ACL-97, AAAI-1999

  22. Rule Representation • Single-slot extraction patterns • Syntactic information (part-of-speech tagger) • Semantic class information (WordNet)

  23. The Learning Algorithm • A specific to general search • The pre-filler pattern contains an item for each word • The filler pattern has one item from each word in the filler • The post-filler has one item for each word • Compress the rules for each slot • Generate the least general generalization (LGG) of each pair of rules • When the LGG of two constraints is a disjunction, we create two alternatives (1) disjunction (2) removal of the constraints.

  24. Example • Located in Atlanta, Georgia. • Offices in Kansas City, Missouri. , , , ,

  25. Example: Located in Atlanta, Georgia. Offices in Kansas City, Missouri. • Assume there is a semantic class for states, but not one for cities.

  26. Experimental Evaluation • 300 computer-related Jobs • 17 slots: employer, location, salary, job requirements, language and platform.

  27. Experimental Evaluation • 485 seminar announcement • 4 slots:

  28. WHISK: S. Soderland University of Washington Journal of Machine Learning 1999

  29. Semi-structured Text

  30. Free Text Person name Position Verb stem Verb stem

  31. WHISK Rule Representation • For Semi-structured IE

  32. WHISK Rule Representation Skip only whithin the same syntactic field • For Free Text IE Person name Position Verb stem Verb stem

  33. Example – Tagged by Users

  34. The WHISK Algorithm

  35. Creating a Rule from a Seed Instance • Top-down rule induction • Start from an empty rule • Add terms within the extraction boundary (Base_1) • Add terms just outside the extraction (Base_2) • Until the seed is covered

  36. Example

  37. EN

  38. AutoSlog: Automatically Constructing a Dictionary for Information Extraction Tasks Ellen Riloff Dept. of Computer Science, University of Massachusetts, AAAI93

  39. AutoSlog • Purpose: • Automatically constructs a domain-specific dictionary for IE • Extraction pattern (concept nodes): • Conceptual anchor: a trigger word • Enabling conditions: constraints

  40. Concept Node Example Physical target slot of a bombing template

  41. Construction of Concept Nodes • Given a target piece of information. • AutoSlog finds the first sentence in the text that contains the string. • The sentence is handed over to CIRCUS which generates a conceptual analysis of the sentence. • The first clause in the sentence is used. • A set of heuristics are applied to suggest a good conceptual anchor point for a concept node. • If none of the heuristics is satisfied, AutoSlog searches for the next sentence, and goto 3.

  42. Conceptual Anchor Point Heuristics

  43. Background Knowledge • Concept Node Construction • Slot • The slot of the answer key • Hard and soft constraints • Type: Use template types such as bombing, kidnapping • Enabling condition: heuristic pattern • Domain Specification • The type of a template • The constraints for each template slot

  44. Another good concept node definition Perpetrator slot from a perpetrator template

  45. A bad concept node definition Victim slot from a kidnapping template

  46. Empirical Results • Input: • Annotated corpus of texts in which the targeted information is marked and annotated with semantic tags denoting the type of information (e.g., victim) and type of event (e.g., kidnapping) • 1500 texts with 1258 answer keys contain 4780 string fillers • Output: • 1237 concept node definitions • Human intervention: 5 user-hour to sift through all generated concept nodes • 450 definitions are kept • Performance:

  47. Conclusion • In 5 person-hour, AutoSlog creates a dictionary that achieves 98% of the performance of hand-crafted dictionary • Each concept node is a single-slot extraction pattern • Reasons for bad definitions • When a sentence contains the targeted string but does not describe the event • When a heuristic proposes the wrong conceptual anchor point • When CIRCUS incorrectly analyzes the sentence

More Related