930 likes | 1.14k Views
CERATOPS Center for Extraction and Summarization of Events and Opinions in Text. Janyce Wiebe, U. Pittsburgh Claire Cardie, Cornell U. Ellen Riloff, U. Utah. Overview. Rapidly re-trainable, robust components for: Information extraction of facts and entities related to events from text
E N D
CERATOPSCenter for Extraction and Summarization of Events and Opinions in Text Janyce Wiebe, U. Pittsburgh Claire Cardie, Cornell U. Ellen Riloff, U. Utah
Overview Rapidly re-trainable, robust components for: • Information extraction of facts and entities related to events from text • Extraction of opinions and motivations expressed in text • Tracking, linking, and summarizing events and opinions and their progressions over time
Rapid semantic processing of large volumes of unstructured text Automatic merging of facts and entity relationships across sets of documents Automatic population of large databases with factual information from many text sources Motivation for Event IE Systems
Information Extraction from Text After a brief lull, the avian flu is on the march again through Fraser Valley poultry farms. The Canadian Food Inspection Agency says ongoing surveillance efforts have led to the detection of bird flu on 36 commercial premises. The agency says it is continuing depopulation efforts on infected farms on a priority basis. OUTBREAK Disease: Victims: Location: Country: Status: Containment: / bird flu / 36 commercial premises Canada confirmed avian flu poultry Fraser Valley poultry farms depopulation
Keywords and named entity recognition are not sufficient. Troops were vaccinated against anthrax, cholera, … Researchers have discovered how anthrax toxin destroys cells and rapidly causes death ... Information Extraction of Events Extracting facts and entity relations associated with events of interest. Terrorist incidents: perpetrators, victims, physical targets, weapons, date, location Disease outbreaks: disease, organisms, victim, symptoms, location, country, date, containment measures
Syntactic Analysis Extraction Coreference Resolution Template Generation 3 chickens died from avian flu. 3 chickens died from avian flu. SUBJVPPP Fact: DEATH Victim: 3 chickens Disease: avian flu 3 chickens died from avian flu. The birds were found in Canada. Event: Outbreak Victim: 3 chickens / the birds Disease: bird flu Country: Canada
kidnapper, arsonist, assassin agent (perpetrator) casualty, fatality, victim theme (victim) Disease Reports: toddler, girl, boy victim Crime Reports: restaurant, store, hotel location New Approach: Role-Identifying Nouns Lexically role-identifying nouns are defined by the role that the noun plays in an event. Semantically role-identifying nouns strongly evoke one event role in a domain based on semantics. (Intuition from Grice’s Maxim of Relevance)
Unannotated Texts Ex: murderer, sniper, criminal Bootstrapped Learning of Role-Identifying Nouns Ex: assassin, arsonist, kidnapper Ex: <subj> was arrested killed by <np> Best Extraction Patterns Best Extractions (Nouns)
<subject> was kidnapped by <np> in <np> victim perpetrator location • But sometimes, a verb identifies a role player in an event without identifying the event! <subject> participated <subject> was implicated perpetrator Role-Identifying Expressions • Typically, a verb refers to an event and the verb’s arguments identify the role players:
relevant event nouns Bootrapped Learning of Role-Identifying Expressions event STEP 1 STEP 2 AutoSlog Basilisk event extraction patterns event nouns Candidate RIE Pattern Generator candidate RIE patterns
Learning to Extract Perpetrators [Phillips & Riloff, RANLP-07] Role-Identifying Nouns: assailants, attackers, cell, culprits, extremists, hitmen, kidnappers, militiamen, MRTA, narco-terrorists, sniper Event-Specific Patterns: was kidnapped by <np> was killed by <np> Role-Identifying Patterns: EVENT was perpetrated by <np> <subject> was involved in EVENT
Decoupling Relevant Region Identification and Extraction Local pattern matching has two drawbacks: • Facts can be missed if they do not occur with the event description. • False hits can be generated from irrelevant contexts. …the explosion ripped through the busy neighborhood in New Delhi. A bombwas found under a parked car… • Solution: • Identify relevant text regions. • Apply general, but semantically appropriate patterns
pattern IE Pattern Learning with Relevant Regions and Semantic Affinity [Patwardhan & Riloff, EMNLP-07] relevant & irrelevant texts Self-training SVM Classifier Semantic Affinity Pattern Learner Relevant Region Classifier Relevant Sentences IE Patterns IE System Extractions
CERATOPS Text Extraction and Data Visualization for Animal Health Surveillance • Collaborative project between CERATOPS, PURVAC, and the Veterinary Information Network (VIN), with funding from LLNL. • Goal: proof-of-concept of an end-to-end NLP-based visual analytics system for unstructured text.
Animal Health Surveillance Monitoring animal health is important to DHS’ mission: • 73% of emerging infectious diseases are zoonotic in origin. • Pets can provide early warning signs of disease outbreaks and exposures to toxic substances. • Adverse pet reactions can be early indicators of food chain contamination.
The Veterinary Information Network • VIN is the largest on-line community, information resource, and on-line continuing education source for veterinarians. Over half of all veterinarians in the U.S. use VIN! • VIN hosts message boards where veterinarians discuss what they are seeing in their practices. 15 years of message board data has been archived! • VIN built a database of semantic information associated with pet health to support search. • Paul Pion, DVM, President and co-founder of VIN, and served as our consultant.
NLP fact fact fact… CERATOPS NLP-based Visual Analytics
Prototype System for We produced a prototype IE system to extract and visualize diseases, victims, dates, and locations from ProMed-mail disease outbreak reports. • Used the VIN database (248,108 entries) to create 3 new dictionaries for text analysis: • syntactic and semantic lexicon • phrasal lexicon • synonym dictionary • Enhanced the template generation process to use new types of semantic information. • Converted our IE templates into a format appropriate for Purdue’s visualization system.
NLP-based Visual Analytics for Animal Health Surveillance • Rapid identification of new disease outbreaks. • Trends or spikes in disease outbreaks. • Unusual symptoms or clusters of symptoms. • Statistical associations between foods & adverse pet reactions. • Improved diagnostic tools to associate symptoms with diseases and external events. Future Goals:
CERATOPS Semantic Class Learning from the Web [Kozareva, Riloff, & Hovy, ACL-08] • Goal: automatically create semantic dictionaries • Use a doubly-anchored hyponym pattern: <class name> such as <class member> and * • Construct pattern linkage graphs to capture the popularity and productivity of candidate terms and rank them. • Produces very accurate results with truly minimal supervision (class name and one seed)
Chain1: Chain2: Chain3: Chain4: U.S. State Dept. President Bush NIH Inspector General Coreference Resolution • Links entities, events, and opinions within and across documents
Queen Elizabeth her [Queen Elizabeth], set about transforming [her] [husband] , [King George VI], … coref? coref? Clustering Algorithm coref? coref? husband King George VI coref? Build on Prior Work in NP Coreference Resolution • Classification • given a description of two noun phrases, NPiand NPj, classify the pair as coreferentor not coreferent • Clustering • coordinates pairwise coreference decisions E.g., Ng & Cardie ACL [2002]
Partially Supervised Clustering for Source Coreference Resolution [Stoyanov & Cardie, EMNLP 2006] Labels for non-source NPs are unavailable Australian press has launched a bitter attack on Italy after seeing theirbelovedSocceroos eliminated on a controversial late penalty. ItaliancoachLippi has also been blasted for his comments after the game. In the opposite camp Lippi is preparing his side for the upcoming game with Ukraine. Hehailed 10-man Italy's determination to beat Australia and said the penalty was rightly given.
State-of-the-Art Coreference Resolution • Cornell, Utah, & LLNL are collaboratively building a state-of-the-art coreference resolver based on the best features identified in prior work. • We plan to make the system publicly available. • On-going work and future plans include: • systematic evaluations of coreference subproblems • incorporating external knowledge about entities • non-anaphoric NP identification • unsupervised, automatic training • topic coreference for opinion analysis
Source Attitude Target Negative Emotion Intensity: High Opinion Frame Source:Angolans Polarity:negative Attitude: emotion Intensity:high Target:Marburg virus Subjectivity: opinions, emotions, motivations, speculations, sentiments • Information Extraction of • NL expressions • Components • Properties Angolans are terrified of the Marburg virus
Fine-grained Opinions Australian press has launched a bitter attack on Italy after seeing their beloved Socceroos eliminated on a controversial late penalty. Italiancoach Lippi has also been blasted for his comments after the game. In the opposite camp Lippi is preparing his side for the upcoming game with Ukraine. He hailed 10-man Italy's determination to beat Australia and said the penalty was rightly given. [Stoyanov & Cardie, 2006]
Opinion Frame Source:Australian Press Polarity:negative Attitude: sentiment Intensity:high Target:Italy Fine-grained Opinion Extraction “The Australian Press launched a bitter attack on Italy”
Socceroos Australian Press penalty Italy Marcello Lippi Opinion Summary
Opinion Frame Source: Polarity: Intensity: Direct Subjective Source: Polarity: Intensity: Direct Subjective Source: Polarity: Intensity: Summary Representation Disease Outbreak Victim: Location: Disease: Date: … Summarization of Opinions + Events
Why Opinions? • Provide technology that can aid analysts in their • extracting socio-behavioral information from text • monitoring public health awareness, knowledge and speculations about disease outbreaks, … • Enrich Information Extraction, Question Answering, and Visualization tools
Opinion Frame Source: Polarity:negative Attitude: Intensity:high Target: E.g., are people extremely afraid or angry?
Opinion Frame Source: Polarity: Attitude: Intensity: Target: The industry is scared and so, even if they do find an ornamental carp with KHV, they will keep it secret Recognize motivations Predict actions
Opinion Frame Source: Polarity: Attitude: Intensity: Target: Ban on British beef Brugere-Picoux backs the decision to ban British Beef Search for opinions about particular named targets
Opinion Frame Source: Brugere-Picoux Polarity: Attitude: Intensity: Target: Brugere-Picoux backs the decision to ban British Beef Search for opinions held by particular named sources
Motivation for the Summaries • Quickly determine the opinions of a person, organization, community, region, etc. • Quickly determine the opinions toward a person, organization, issue, event, … • Across an entire document • Across multiple documents • Over time • Reveal relationships and identify cliques and communities of interest • Complement work in social network analysis
Outline • Motivations for opinion extraction • Extracting opinion frames and components • Lexicon of subjective expressions • Contextual disambiguation • Enriched tasks • Opinion summarization
Lexicon • Explore different uses of words, to zero in on the subjective ones • Example: benefit
Lexicon • Example: benefit • Very often objective, as a Verb: Children with ADHD benefited from a 15-course of fish oil
Lexicon • Noun uses look more promising: The innovative economic program has shown benefits to humanity
Lexicon • However, there are objective noun uses too: …tax benefits. …employee benefits. …tax benefits to provide a stable economy. …health benefits to cut costs.
Lexicon • Pattern:benefits as the head of a noun phrase containing a prepositional phrase • Matches this: The innovative economic program has shown proven benefits to humanity • But none of these: …tax benefits. …employee benefits. …tax benefits to provide a stable economy. …health benefits to cut costs.
LexiconLonger Constructionsbe soft on crime <item index="1"> <itemMorphoSyntax> <lemma>be</lemma></itemMorphoSyntax> <itemRelation xsi:type="ngramPattern"> <distance>2</distance> <landmark>2</landmark></itemRelation></item> <item index="2"> <itemMorphoSyntax> <word>soft</word> <majorClass>J</majorClass></itemMorphoSyntax> <itemRelation xsi:type="ngramPattern"> <distance>1</distance> <landmark>3</landmark></itemRelation></item> <item index="3"> <itemMorphoSyntax> <word>on</word></itemMorphoSyntax> <itemRelation xsi:type="ngramPattern"> <distance>1</distance> <landmark>4</landmark></itemRelation></item> <item index="4"> <itemMorphoSyntax> <word>crime</word> <majorClass>N</majorClass> </itemMorphoSyntax>
The entry contains a pattern for finding instances of the construction • Matches variations: • When I look into his past I see a man who is very soft on crime. • The data could also weaken her authority to criticize Patrick for being soft on crime.
Attributive information <entryAttributes origin="j"> <name>be soft on crime</name> <subjective>true</subjective> <reliability>h</reliability> <confidence>h</confidence> <subType>sen</subType> <example>The Obama campaign rejected the notion that the senator might be vulnerable to accusations that he is soft on crime.</example> <morphosyn>vp</morphosyn> <target>s</sp_target> <polarity>n</polarity> <intensity>m</intensity> <confidence>h</confidence> <regex>1:[morph:[lemma="be"] order:[distance="2" landmark="2"]] 2:[morph:[word="soft" majorClass="J"] order:[distance="1" landmark="3"]] 3:[morph:[word="on"] order:[distance="1" landmark="4"]] 4:[morph:[word="crime" majorClass="N"]]</regex> <patterntype>ngramPattern</patterntype>
Lexicon: Summary • Uniform representation for different types of subjectivity clues • Word stem: benefit • Word: benefits • Word/POS: benefits/nouns • Fixed n-grams: benefits to • Syntactic patterns • Combinations of the above • Learn subjective uses from corpora (bodies of texts) • Capture longer subjective constructions • Add relevant knowledge about expressions • Riloff, Wiebe, Wilson 2003; Riloff & Wiebe 2003; Wiebe & Riloff 2005; Riloff, Patwardhan, Wiebe 2006; Ruppenhofer, Akkaya, Wiebe in preparation