460 likes | 650 Views
Lecture 13 Information Extraction. CSCE 771 Natural Language Processing. Topics Name Entity Recognition Relation detection Temporal and Event Processing Template Filling Readings: Chapter 22. February 27, 2013. Overview. Last Time Dialogues Human conversations Today
E N D
Lecture 13Information Extraction CSCE 771 Natural Language Processing • Topics • Name Entity Recognition • Relation detection • Temporal and Event Processing • Template Filling • Readings: Chapter 22 February 27, 2013
Overview • Last Time • Dialogues • Human conversations • Today • Slides from Lecture24 • Dialogue systems • Dialogue Manager Design • Finite State, Frame-based, Initiative: User, System, Mixed • VoiceXML • Information Extraction • Readings • Chapter 24, Chapter 22
Information extraction • Information extraction – turns unstructured information buried in texts into structured data • Extract proper nouns – “named entity recognition” • Reference resolution – \ • named entity mentions • Pronoun references • Relation Detection and classification • Event detection and classification • Temporal analysis • Template filling
Template Filling • Example template for “airfare raise”
Figure 22.6 Features used in Training NER • Gazetteers – lists of place names • www.geonames.com • www.census.gov
Evaluation of Named Entity Rec. Sys. • Recall terms from Information retreival • Recall = #correctly labeled / total # that should be labeled • Precision = # correctly labeled / total # labeled • F- measure where βweights preferences • β=1 balanced • β>1 favors recall • β<1 favors precision
NER Performance revisited • NER performance revisited • Recall, Precision, F • High performance systems • F ~ .92 for PERSONS and LOCATIONS and ~.84 for ORG • Practical NER • Make several passes on text • Start by using highest precision rules (maybe at expense of recall) make sure what you get is right • Search for substring matches or previously detected names using probabilistic searches string matching metrics(Chap 19) • Name lists focused on domain • Probabilistic sequence labeling techniques using previous tags
Relation Detection and classification • Consider Sample text: • Citing high fuel prices, [ORG United Airlines] said [TIME Friday] it has increased fares by [MONEY $6] per round trip on flights to some cities also served by lower-cost carriers. [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PERSON Tim Wagner] said. [ORG United Airlines] an unit of [ORG UAL Corp.], said the increase took effect [TIME Thursday] and applies to most routes where it competes against discount carriers, such as [LOC Chicago] to [LOC Dallas] and [LOC Denver] to [LOC San Francisco]. • After identifying named entities what else can we extract? • Relations
Figure 22.13 Supervised Learning Approaches to Relation Analysis • Algorithm two step process • Identify whether pair of named entities are related • Classifier is trained to label relations
Factors used in Classifying • Features of the named entities • Named entity types of the two arguments • Concatenation of the two entity types • Headwords of the arguments • Bag-of-words from each of the arguments • Words in text • Bag-of-words and Bag-of-digrams • Stemmed versions • Distance between named entities (words / named entities) • Syntactic structure • Parse related structures
Bootstrapping Example “Has a hub at” • Consider the pattern • / * has a hub at * / • Google search • 22.4 Milwaukee-based Midwest has a hub at KCI • 22.5 Delta has a hub at LaGuardia • … • Two ways to fail • False positive: e.g. a star topology has a hub at its center • False negative? Just miss • 22.11 No frill rival easyJet, which has established a hub at Liverpool
Using Features to restrict patterns • 22.13 Budget airline Ryanair, which uses Charleroi as a hub, scrapped all weekend flights • / [ORG] , which uses a hub at [LOC] /
Semantic Drift • Note it will be difficult (impossible) to get annotated materials for training • Accuracy of process is heavily dependant on initial sees • Semantic Drift – • Occurs when erroneous patterns(seeds) leads to the introduction of erroneous tuples
Fig 22.17 Temporal and Durational Expressions • Absolute temporal expressions • Relative temporal expressions
Temporal Normalization • iSO 8601 - standard for encoding temporal values • YYYY-MM-DD
Event Detection and Analysis • Event Detection and classification
Fig 22.23 Features for Event Detection • Features used in rule-based and statistical techniques