460 likes | 591 Views
CERATOPS Center for Extraction and Summarization of Events and Opinions in Text. Janyce Wiebe, U. Pittsburgh Claire Cardie, Cornell U. Ellen Riloff, U. Utah. Overview. Rapidly re-trainable, robust components for: Information extraction of facts and entities related to events from text
E N D
CERATOPSCenter for Extraction and Summarization of Events and Opinions in Text Janyce Wiebe, U. Pittsburgh Claire Cardie, Cornell U. Ellen Riloff, U. Utah
Overview Rapidly re-trainable, robust components for: • Information extraction of facts and entities related to events from text • Extraction of opinions and motivations expressed in text • Tracking, linking, and summarizing events and opinions and their progressions over time
Rapid semantic processing of large volumes of unstructured text Automatic merging of facts and entity relationships across sets of documents Automatic population of large databases with factual information from many text sources Motivation for Event IE Systems
Information Extraction from Text After a brief lull, the avian flu is on the march again through Fraser Valley poultry farms. The Canadian Food Inspection Agency says ongoing surveillance efforts have led to the detection of bird flu on 36 commercial premises. The agency says it is continuing depopulation efforts on infected farms on a priority basis. OUTBREAK Disease: Victims: Location: Country: Status: Containment: / bird flu / 36 commercial premises Canada confirmed avian flu poultry Fraser Valley poultry farms depopulation
Keywords and named entity recognition are not sufficient. Troops were vaccinated against anthrax, cholera, … Researchers have discovered how anthrax toxin destroys cells and rapidly causes death ... Information Extraction of Events Extracting facts and entity relations associated with events of interest. Terrorist incidents: perpetrators, victims, physical targets, weapons, date, location Disease outbreaks: disease, organisms, victim, symptoms, location, country, date, containment measures
Syntactic Analysis Extraction Coreference Resolution Template Generation 3 chickens died from avian flu. 3 chickens died from avian flu. SUBJVPPP Fact: DEATH Victim: 3 chickens Disease: avian flu 3 chickens died from avian flu. The birds were found in Canada. Event: Outbreak Victim: 3 chickens / the birds Disease: bird flu Country: Canada
kidnapper, arsonist, assassin agent (perpetrator) casualty, fatality, victim theme (victim) Disease Reports: toddler, girl, boy victim Crime Reports: restaurant, store, hotel location New Approach: Role-Identifying Nouns Lexically role-identifying nouns are defined by the role that the noun plays in an event. Semantically role-identifying nouns strongly evoke one event role in a domain based on semantics. (Intuition from Grice’s Maxim of Relevance)
Unannotated Texts Ex: murderer, sniper, criminal Bootstrapped Learning of Role-Identifying Nouns Ex: assassin, arsonist, kidnapper Ex: <subj> was arrested killed by <np> Best Extraction Patterns Best Extractions (Nouns)
<subject> was kidnapped by <np> in <np> victim perpetrator location • But sometimes, a verb identifies a role player in an event without identifying the event! <subject> participated <subject> was implicated perpetrator Role-Identifying Expressions • Typically, a verb refers to an event and the verb’s arguments identify the role players:
relevant event nouns Bootrapped Learning of Role-Identifying Expressions event STEP 1 STEP 2 AutoSlog Basilisk event extraction patterns event nouns Candidate RIE Pattern Generator candidate RIE patterns
Learning to Extract Perpetrators [Phillips & Riloff, RANLP-07] Role-Identifying Nouns: assailants, attackers, cell, culprits, extremists, hitmen, kidnappers, militiamen, MRTA, narco-terrorists, sniper Event-Specific Patterns: was kidnapped by <np> was killed by <np> Role-Identifying Patterns: EVENT was perpetrated by <np> <subject> was involved in EVENT
Decoupling Relevant Region Identification and Extraction Local pattern matching has two drawbacks: • Facts can be missed if they do not occur with the event description. • False hits can be generated from irrelevant contexts. …the explosion ripped through the busy neighborhood in New Delhi. A bombwas found under a parked car… • Solution: • Identify relevant text regions. • Apply general, but semantically appropriate patterns
pattern IE Pattern Learning with Relevant Regions and Semantic Affinity [Patwardhan & Riloff, EMNLP-07] relevant & irrelevant texts Self-training SVM Classifier Semantic Affinity Pattern Learner Relevant Region Classifier Relevant Sentences IE Patterns IE System Extractions
CERATOPS Text Extraction and Data Visualization for Animal Health Surveillance • Collaborative project between CERATOPS, PURVAC, and the Veterinary Information Network (VIN), with funding from LLNL. • Goal: proof-of-concept of an end-to-end NLP-based visual analytics system for unstructured text.
Animal Health Surveillance Monitoring animal health is important to DHS’ mission: • 73% of emerging infectious diseases are zoonotic in origin. • Pets can provide early warning signs of disease outbreaks and exposures to toxic substances. • Adverse pet reactions can be early indicators of food chain contamination.
The Veterinary Information Network • VIN is the largest on-line community, information resource, and on-line continuing education source for veterinarians. Over half of all veterinarians in the U.S. use VIN! • VIN hosts message boards where veterinarians discuss what they are seeing in their practices. 15 years of message board data has been archived! • VIN built a database of semantic information associated with pet health to support search. • Paul Pion, DVM, President and co-founder of VIN, and served as our consultant.
NLP fact fact fact… CERATOPS NLP-based Visual Analytics
Prototype System for We produced a prototype IE system to extract and visualize diseases, victims, dates, and locations from ProMed-mail disease outbreak reports. • Used the VIN database (248,108 entries) to create 3 new dictionaries for text analysis: • syntactic and semantic lexicon • phrasal lexicon • synonym dictionary • Enhanced the template generation process to use new types of semantic information. • Converted our IE templates into a format appropriate for Purdue’s visualization system.
NLP-based Visual Analytics for Animal Health Surveillance • Rapid identification of new disease outbreaks. • Trends or spikes in disease outbreaks. • Unusual symptoms or clusters of symptoms. • Statistical associations between foods & adverse pet reactions. • Improved diagnostic tools to associate symptoms with diseases and external events. Future Goals:
CERATOPS Semantic Class Learning from the Web [Kozareva, Riloff, & Hovy, ACL-08] • Goal: automatically create semantic dictionaries • Use a doubly-anchored hyponym pattern: <class name> such as <class member> and * • Construct pattern linkage graphs to capture the popularity and productivity of candidate terms and rank them. • Produces very accurate results with truly minimal supervision (class name and one seed)
Chain1: Chain2: Chain3: Chain4: U.S. State Dept. President Bush NIH Inspector General Coreference Resolution • Links entities, events, and opinions within and across documents
Queen Elizabeth her [Queen Elizabeth], set about transforming [her] [husband] , [King George VI], … coref? coref? Clustering Algorithm coref? coref? husband King George VI coref? Build on Prior Work in NP Coreference Resolution • Classification • given a description of two noun phrases, NPiand NPj, classify the pair as coreferentor not coreferent • Clustering • coordinates pairwise coreference decisions E.g., Ng & Cardie ACL [2002]
Partially Supervised Clustering for Source Coreference Resolution [Stoyanov & Cardie, EMNLP 2006] Labels for non-source NPs are unavailable Australian press has launched a bitter attack on Italy after seeing theirbelovedSocceroos eliminated on a controversial late penalty. ItaliancoachLippi has also been blasted for his comments after the game. In the opposite camp Lippi is preparing his side for the upcoming game with Ukraine. Hehailed 10-man Italy's determination to beat Australia and said the penalty was rightly given.
State-of-the-Art Coreference Resolution • Cornell, Utah, & LLNL are collaboratively building a state-of-the-art coreference resolver based on the best features identified in prior work. • We plan to make the system publicly available. • On-going work and future plans include: • systematic evaluations of coreference subproblems • incorporating external knowledge about entities • non-anaphoric NP identification • unsupervised, automatic training • topic coreference for opinion analysis
Overview • Text analysis to support a broad range of knowledge discovery tasks • Automatic annotators that assign semantic and conceptual labels to words, phrases, and documents • Automaticallyextracting, summarizing and tracking information about eventsand opinions
NLP fact fact fact… fact fact fact fact fact fact fact fact fact… fact… fact… fact… Images NLP NLP
Document Text One person was killed when a small bomb exploded at a police station in Basra town in Iraq's politically volatile southern region on Wednesday, residents said. The bomb was the first to hit an urban area since the riots in the southern city on May 16. The terrorist group Al Qaeda claimed responsibility for the attack, and… Event Weapon: a small bomb Location: Basra Victim: One person Perpetrator Org: Al Qaeda Physical Target: police station Event-oriented IE Goal of IE system: extract facts associated with events from unstructured text
Keywords and named entity recognition are not sufficient. Troops were vaccinated against anthrax, cholera, … Researchers have discovered how anthrax toxin destroys cells and rapidly causes death ... Information Extraction for Events Extracting facts and entity relations associated with events of interest. Terrorist incidents: perpetrators, victims, physical targets, weapons, date, location Disease outbreaks: disease, organisms, victim, symptoms, location, country, date, containment measures
Fact Extraction: Example After a brief lull, the avian flu is on the march again through Fraser Valley poultry farms. The Canadian Food Inspection Agency says ongoing surveillance efforts have led to the detection of bird flu on 36 commercial premises. The agency says it is continuing depopulation efforts on infected farms on a priority basis. OUTBREAK Disease: Victims: Location: Country: Status: Containment: / bird flu / 36 commercial premises Canada confirmed avian flu poultry Fraser Valley poultry farms depopulation
Text 3 chickens died from avian flu. 3 chickens died from avian flu. SUBJVPPP Syntactic Analysis Fact: DEATH Victim: 3 chickens Disease: avian flu Semantic Extraction 3 chickens died from avian flu. The birdswere found in Canada. Coreference Resolution Event: Outbreak Victim: 3 chickens / the birds Disease: bird flu Country: Canada Relation and Event Analysis
Pattern Learner Patterns & Statistics (More) Patterns & Statistics keywords Relevant Region Identifier Pattern Learner IE Pattern Bootstrapping Process
Unannotated Texts Semantic Dictionary Bootstrapping Ex: anthrax, ebola, cholera, flu, plague Ex: outbreak of <NP> Best Extraction Pattern(s) Ex: smallpox, tularemia, botulism Extractions (Nouns)
Semantic Learning Case Study • Input to Basilisk: 10 common disease names • Of the top 200 words hypothesized to be diseases: 89 were already in the UMLS metathesaurus (32,000 names of diseases and organisms), but 111 were not! Including: adenomatosis tularaemia tularamia diarrhoea diphtheriae enterovirus-71 fibropapillomas gastroeneteritis h5n1 h7n3 ev71 yf jyf nvcjd pepmv wsmv flu kawasaki mad-cow-disease smut pertussis pleuro-pneumonia polioencephalomyelitis poliovirus
Learning Subjective Phrases • Using Information Extraction techniques
Extractions expressed <dobj> condolences, hope, grief, views, worries indicative of <np> compromise, desire, thinking inject <dobj> vitality, hatred reaffirmed <dobj> resolve, position, commitment voiced <dobj> outrage, support, skepticism, opposition, gratitude, indignation show of <np> support, strength, goodwill, solidarity <subj> was shared anxiety, view, niceties, feeling
<subj> was expected 45 0.42 was expected from <np> 5 1.00 <subj> put 187 0.67 <subj> put end 10 0.90 <subj> talk 28 0.71 talk of <np> 10 0.90 <subj> is talk 5 1.00 <subj> is fact 38 1.00 fact is <dobj> 12 1.00 Subjective Expressions as IE Patterns PATTERNFREQP(Subj | Pattern) <subj> asked 128 0.63 <subj> was asked 11 1.00
Conclusions Rapidly re-trainable, robust components for • Information extraction of facts and entities • Extraction of opinions • Tracking, linking, and summarizing events and opinions and their progressions over time
Current Work: Topics • Topic coreference resolution • Treat as an NP coreference resolution task • Modify our existing NP coref approach • Initial results look promising