270 likes | 426 Views
Supervised Categorization for Habitual versus Episodic Sentences . Thomas Mathew tam52@georgetown.edu Graham Katz egk7@georgetown.edu Department of Linguistics Georgetown University. Introduction. Habitual sentences state general facts Describe properties of a class
E N D
Supervised Categorization for Habitual versus Episodic Sentences Thomas Mathew tam52@georgetown.edu Graham Katz egk7@georgetown.edu Department of Linguistics Georgetown University
Introduction • Habitual sentences state general facts • Describe properties of a class Bears eat blackberries • Characteristic of specific individual Angus Young wears school uniforms on stage • Is stative however main verb can be dynamic • Episodic sentences report on a finite number of specific events Mary ate a steak Angus Young wore a school uniform twice this week • Why the distinction matters ? • Event extraction • Document summarization
Scope • Determine automatically whether a sentence is habitual or episodic on the basis of sentence internal information John smoked cigarettes when he was young ® habitual John smoked a cigarette this morning ® episodic • Note: Lexically stative predicates excluded Italians like wine • Do not exhibit habitual/specific ambiguity
Related Work • Sangweon Suh (2006) • Distinguish generic from specific NP reference in context Cats like tuna A cat ate the tuna • Eric Siegel (1995), Michael Brent (1990) • Determine whether verb is stative or eventive He called his father He resembles his father • On basis of distribution of verbs with overt features • Siegel (1995) uses co-occurrence frequencies of 14 features
Approach • Supervised Classification • Built training corpus • Selected features for machine learning • Evaluated features • Applied Machine Learning algorithms
Annotation of Corpus • Generated set of 1,816 sentences with 72 verb types by: • Randomly selecting sentences from Penn Treebank (WSJ & Brown) • Ignoring sentences with a lexically stative predicate • Adding all sentences in Penn Treebank whose main verb was a morphological variant of a verb from initial set
Annotation of Corpus • Annotated each sentence as habitual/episodic by: • Checking for explicit attribution • Frequency adverbs (usually, often) ® habitual • Quantificational temporals (every night) ® habitual • Habitual past (used to) ® habitual • Definite temporals (yesterday) ® episodic • Tested whether sentence meaning changed by adding modifier usually • No change in meaning indicated habitual • Examining discourse context • Assumed bunching of categories in a discourse • Applying intuitive semantic judgment • Single event or habit
Data • Verbs varied significantly in lexical bias • report almost only episodic, require almost only habitual • Final step: • Eliminated highly biased • lexical verbs • Final data set • 1,052 sentences • 57 verb forms • Baseline distribution
Features • Selected 14 sentence internal features • Features that can be derived from annotation scheme of Penn Treebank • Evaluated features relevance to classification • Compare feature distribution by category against baseline
Tense Hungarian Radio saves its most politically outspoken broadcasts for around midnight ® habitual Mickie laughed ® episodic
Aspect Everyone else was running ® episodic The school has received letters from parents ® episodic
Temporals Every time I closed my eyes, I saw gray eyes rushing at me with a knife ® habitual On Tuesday, Trellborg’s directors announced plans to spin off two big divisions as separately quoted companies on Stockholm’s stock exchange ® episodic
Subject Features Commands go only from an office to the man of nearest lower rank ® habitual The women indicated which family member usually did household chores® episodic
Object Features Not surprisingly, he sometimes bites ® habitual In Los Angeles, in our lean years, we gave parties ® habitual Robert Bernstein, chairman and president of Random House Inc., announced his resignation from the publishing house he has run for 23 years ® episodic
Conditionals After all, gold prices soar when inflation is high ® habitual
Prepositional Features Anheuser-Busch announced its plan at the same time it reported third quarter net income rose a lower-than-anticipated 5.2% to $238.3 million ® episodic Treasury prices ended mixed in light trading ® episodic You ‘ve got blood on your cheek ® episodic
Feature Analysis Summary • Reliable features for episodicity • Less reliable features for habituality
Feature Limitations • Problem areas • Semantics of predicate arguments She was moving like a ballet dancer She was moving in café society as Lady Diana Harrington • Semantics of predicate He is meeting a girl from Brooklyn He is seeing a girl from Brooklyn • Sentence-external factors (discourse) John rarely ate fruit. He just ate oranges John didn’t eat much at breakfast. He just ate oranges • Sentences with ‘dual’-category • Too rare to analyze statistically After all, in all five recessions since 1960, stocks declined
Machine Learning • Considered three classifiers • Rule-based • Association Rule Classifier • Decision Tree (J48) Classifier • Probabilistic • Naïve Bayes • Evaluated against baseline where all sentences blindly with majority-class (episodic) • 73.1% overall precision
Association Rule Classifier • Applied Predictive Apriori algorithm (Scheffer 2004) for multivariate analysis • Algorithm generates n-best feature patterns predicting a category • Manually pruned results • Only patterns selecting for episodicity > 85% • Only patterns selecting for habituality > 80% • If R1Ì R2, discard R2 • If sorted list {R1,R2 ..Rn } has same coverage as {R1,R2 ..Rn+1 } for category, discard Rn+1 • Model • 4 patterns (213) are habitual 173 times • 11 patterns (882) are episodic 735 times
Decision Tree (J48) Classifier • Weka’s implementation of C4.5 • Used ten-fold cross validation for evaluation • Model • 2 patterns (184) are habitual 161 times • 2 patterns (829) are episodic 727 times
Decision Tree (J48) Classifier • Impact of feature groups (J48) • All select roughly the same number of episodic sentences • Variation is more on habitual/incorrect sentences
Results • Classifier Performance 1Not evaluated using an independent validation set • Habituality Recall • Tense and presence of a quantificational temporal are best indicators of habituality • However both do not provide sufficient coverage of habitual examples by themselves
Conclusion • Syntactic features is a viable method for category disambiguation • Identification of episodic sentences outperforms identification of habitual sentences • There are more overt markers of habituality however more features show bias for episodicity • Performance • Impact of lexical verb and sentence external features • Feature extraction process in some cases approximation • Annotation errors/consistency in corpus
Future Work • Impact of discourse • Independently annotate sentence, predecessor, successor in isolated context • Weighting factor for ambiguous situations • Annotate sentence, predecessor, successor conscious of context