210 likes | 217 Views
This paper presents an improved approach for stance classification in online debate forums. It addresses the challenge of determining the stance expressed in two-sided debates and proposes enhancements in data, features, models, and constraints. Experimental results demonstrate the effectiveness of the proposed approach.
E N D
Stance Classification of Ideological Debates Sen Han leonihsam@gmail.com 4. June 2019 1
Outline • Abstract • Problem • Introduction • Previous approach • Improved approach (for this paper) • Improvements in stance classification • Models • Features • Data • Constraints • Experiments and Evaluation • Results • Discussion 21 Sen Han 2
Problem • Determining the stance expressed in a post written for a two-sided debate in an online debate forum is a relatively new and challenging problem in opinion mining. • Improve the performance of a learning-based stance classification in different dimensions 21 Sen Han 3
Previous • “Should homosexual marriage be legal ?” • The goal of debate stance classification is to determine which of the two sides (i.e., for and against) its author is taking • But • colorful and emotional language to express one’s points,which may involve sarcasm, insults, and questioning another debater’s assumptions and evidence. (spam,disturbance term) • Limited stance-annotated debate 21 Sen Han 4
Improvement • Data: Increase the number of stance-annotated debate posts from different sources for training • Features: Addsemantic features on an n-gram-based stance classifier • Models:Exploitethe linear structure inherent in a post sequence, train a better model by learning only from the stance-related sentences without relying on sentences manually annotated with stance labels • Constraints: Extra-linguistic inter-post constraints, such as author constraints by postprocessing output of a stance classifier 21 Sen Han 5
Models • Binary classifier • Naive Bayes (NB) • Support Vector Machines (SVMs) • Sequence labelers • first-order Hidden Markov Models (HMMs) • linear-chain Conditional Random Fields (CRFs) • Our model • unigram • fine-grained models. • stance label of a debate post and the stance label of each of its sentences 21 Sen Han 6
Fine-grained model • Document • di • Adocument stance c with probability • P(c) • Sentence • em • Asentence stance s with probability • P(s|c) • N-thfeature representing em: • fn,with probability P(fn|s,c) • Sentence stance • P(s|em,di,c) 21 Sen Han 7
Fine-griand Model • Classify each test post di using fine-grained NB • Maximum conditional probability • S_max • Set of sentences in test post di • S(di) • E.g • p(“for homosexual marriage”|d1)=80% • p(“for abortion”| d2)=5% 21 Sen Han 8
Features • N-gram features • unigrams and bigrams collected from the training posts • Anand et al.’s (2011) features • n-grams • document statistics • punctuations • syntactic dependencies • the set of features computed for the immediately preceding post in its thread • Adding frame-semantic • framesemanticparse for each sentence • for each frame that a sentence contains, we create three types of frame-semantic features 21 Sen Han 9
Features • Frame-word interaction feature:(frame-word1-word2) • “Possession-right-woman; Possession-woman-choose”, unordered word pair • Frame-pair feature: (frame2:frame1) • “Choosing:Possession”, ordered 21 Sen Han 10
Frame-semantic features • Frame n-gram feature: • its frame name (if the word is a frame target) • its frame semantic role (if the word is present in a frame element). • “woman+has” • woman+Possession, People+has,People+Possession , Owner+Possessionand Owner+has. 21 Sen Han 11
Data • amount and quality of the training data • collect documents relevant to the debate domain from different sources • stancelabelthem heuristically • combination of noisily labeled documents with the stance-annotated debate posts 21 Sen Han 12
Data Roughly the same number of phrases were created for the two stances in a domain. 21 Sen Han 13
Constraints • Author constraints (Acs) • two posts written by the same author for the same debate domain should have the same stance • post-process the output of a stance classifier. • Probabilistic votes cast of posts • Majority voting for stance 21 Sen Han 14
Experiment and evaluation • 5-foldcross validation • accuracy is the percentage of test instances correctly classified • Three folds for model training, one fold for development, and one fold for testing in each fold experiment 21 Sen Han 15
Results • Results for three selected points on each learning curve, which correspond to the three major columns in each sub-table. 21 Sen Han 16
Results • ‘F’ • finegraind model • ‘W’ • only n-gram features . • ‘A’ • Anandet al.’s (2011) features • ‘A+FS’ • Anandet al.’s features and frame-semantic features. • The last two rows • noisily labeled documents and author constraints are added incrementally to A+FS. 21 Sen Han 17
Results • learning curves for HMM and HMMF for the four domains • the best-performing configuration is A+FS+N+AC, which is followed by A+FS+N and then A+FS 21 Sen Han 18
Discussion 21 Sen Han 19
Thanks 21 Sen Han 20
Unigram • List of words appearing in training data at least 10 times and is associated with document stance c at least 70% of times • A list of words Frequently appearing in training data, which is relevant to the stance of document • p(w)=#w/#(w in corpus) 21 Sen Han 21