310 likes | 515 Views
Stance Classification for Fact- Checking. Lecture : Web Science 04.06.2019 Luca Brandt. Table of Content. Introduction Motivation Fact- Checking Process Paper 1 – Fake News Challenge Paper 2 – Relevant Document Discovery for Fact- Checking Articles Future Work. Introduction.
E N D
Stance Classificationfor Fact-Checking Lecture: Web Science 04.06.2019 Luca Brandt
Table of Content • Introduction • Motivation • Fact-CheckingProcess • Paper 1 – Fake News Challenge • Paper 2 – Relevant Document Discovery for Fact-CheckingArticles • Future Work Luca Brandt Stance Classification Web Science 2019
Introduction • Whatare Fake News ? • Propaganda (not objectiveinformation) • Deliberatedisinformation/hoaxes • Reporters payingsourcesforstories • Made upstories • Problem with Fake News • Spreadingmisinformation • Reducestrust in newsmedia • Manipulatingsociety Luca Brandt Stance Classification Web Science 2019
Introduction • Whatis Fact-Checking? • Process to check the veracity and correctness of a claim/statement • In non-fictional texts • A classic journalism task • Claim • Statement made by a politician • Story published by another journalist • Rumor on Social Media • Etc. Luca Brandt Stance Classification Web Science 2019
Example Claim Luca Brandt Stance Classification Web Science 2019
Motivation • Why Fact-Checking? • Goal istoprovide a verdictif a claimistrue, falseormixed • Fight misinformation • Identify Fake News • Providing context to users to understand information better • Fact-Checking and AI • Automatically detect Fake News • Automatically gather documents relevant to a claim Luca Brandt Stance Classification Web Science 2019
Fact-CheckingProcess • Given a claim/statement • Find documents relevant toclaim • Classic Information Retrieval task • Understand the stance of relevant documents • Stance Classification, Classification Problem • Give verdict if claim is true or false • Classification Problem Luca Brandt Stance Classification Web Science 2019
Fake News Challenge (FNC-1) • Foster developmentof AI techtodetect Fake News • 50 teams participated from industry and academia • Task: Stance Detection of an entire document • Learn classifier f: (document, headline) -> stance {AGR,DSG,DSC,UNR} • Dataset: 300 Topics – claims with 5-20 documents each • Every document summarized to a headline • Each document matched with every headline to generate dataset Luca Brandt Stance Classification Web Science 2019
A Retrospective Analysis of the Fake News Challenge Stance Detection Task • By Hanselowski et al. • 13. June 2018 • Main Contributions: • First summarizing and analyzingpaperforthe FNC-1 • Reproductionofresultsof top-3 performers • Proposed a newevaluationmetric • Proposed a newmodel Luca Brandt Stance Classification Web Science 2019
Top-3 Performers • 1. TalosComb • Weightedaveragemodelofdeepconvnet and gradient-boosteddecisiontree • TalosCNN uses pre-trained word2vec embeddings • TalosTree based on word count, TF-IDF, sentiment, and word2vec embeddings • 2. Athene • Multilayer perceptron (MLP) with 6 hidden layers, with handcrafted features • Unigrams, cosine similarity, topic models • 3. UCL Machine Reading (UCLMR) • MLP as well but only 1 layer • Term frequency vectors of 5000 most frequent unigrams • Cosine Similarity between TF-IDF vectors of headline and doc Luca Brandt Stance Classification Web Science 2019
Problem withthemetric and dataset • Hierarchicalmetric • .25 pointsifclassifiedcorrectlyasrelated {AGR, DSG, DSC} orunrelated {UNR} • .75 pointsifclassifiedcorrectlyas AGR, DSG, or DSC • But relatedclassisimbalanced • Not difficult to predict related or unrelated (best systems reach 99% UNR) • Correctly predicting related vs unrelated and always picking DSC would achieve a FNC-1 score of .833 -> better than the winner Luca Brandt Stance Classification Web Science 2019
Theirmodel and metric • F1m metric • Class-wise F1 scores and macroaverageto F1m score • F1 = • Not affectedby large sizeofmajorityclass • Naive approachpredicting UNR and always DSC -> F1m = .444 • StackLSTM • Combinesbestfeaturesoftheir feature test • ConcatenateGloVeWordembeddingsfedthrough 2 stacked LSTMs • Understandthemeaningofwholesentence • Hidden stateof LSTMs fedthrough 3 layers NN • Softmaxtoobtainprobabilities Luca Brandt Stance Classification Web Science 2019
ReproductionofResults Luca Brandt Stance Classification Web Science 2019
Pros & Cons • Pros: • First papertosummarize and analyzethe FNC-1 • First papertoreproducetheresults • Proposed a bettermetric • Proposed a newmodelbetterthanthestateoftheart • Cons: • Proposednewmodel still haslowaccuracyofDisagreeingclass Luca Brandt Stance Classification Web Science 2019
Relevant Document Discovery for Fact-Checking Articles • Paper by Wang et al. • 23. April 2018 • Main Contributions: • End-to-end systemfor relevant documentdiscoveryforfact-chekingarticles • Betterthanstateoftheartstanceclassification • Betterthanstateoftheartrelevanceclassification Luca Brandt Stance Classification Web Science 2019
Fact-CheckingArticles • Adopted Schema.org ClaimReview Markup • Providesstructureof an article • Key Fields on top ofcontent: • Claim • Claimant • Verdict • Structured fieldscannotprovidedocuments relevant toclaim • Identifyingclaim relevant documentsextremelyusefull Luca Brandt Stance Classification Web Science 2019
Fact CheckingArticle & Claim Relevant Doc Luca Brandt Stance Classification Web Science 2019
Overviewoftheirsystem Luca Brandt Stance Classification Web Science 2019
Candidate Generation • Via Navigation • Outgoing links fromfact-checkingarticle • But mostofthem not relevant • Via Search with Google • Key challenge: generate the right set of queries • Texts from title and claim • Title and claim text transformed with entity annotations • Click graph queries • Combining both generating about 2400 related documents Luca Brandt Stance Classification Web Science 2019
Relevance Classification • Classifier M: (f,d) -> {relevant, irrelevant} • Features: • Building confidencevectorsofentities • Cosine similarity between confidence vectors of • Claim and text/sentence/paragraph of related doc • Sentence of fact-checking article and sentence of related doc • And whole documents • Publication date • Gradient Boosted Decision Tree • Combines all features – predicts relevant or irrelevant Luca Brandt Stance Classification Web Science 2019
Stance Classification • Buildmodel M: (f, d) -> {contradict, support} • Similarity not goodforstanceclassification • Find keycontradictingpatterns in contextsimilartotheclaim • Collected 3.1k (claim, contradictingstatement) pairs • Built 900-dim lexiconfrom uni- and bi-grams withgramsofhighestprobability • Uni-grams: hoax, fake, purportedly, rumor • Bi-grams: made-up, fact check, not true, noevidence Luca Brandt Stance Classification Web Science 2019
Stance Classification • from relevant docuse title, headline, text and pruneawaytextwhosesimilarityissmallerthan a threshold • Concatenatetextwithonesentencebefore and after the text-> keycomponents • Extract uni- and bi-grams ofkeycomponents -> final feature vector • Using a Gradient boosteddecisiontreeforprediction Luca Brandt Stance Classification Web Science 2019
Pros & Cons • Pros: • New stateoftheartstanceclassificationalgorithm • Proposedwhole end-to-end systemfor relevant documentdiscovery • Cons: • Not providingtheirdataset • Not providingthedistributionofthedataset • Not providingthe per classscores • Evaluation in isolation • IgnoredtheDiscussingclass Luca Brandt Stance Classification Web Science 2019
Conclusion • Whatare Fake News • Whatis Fact-Checking • Fake News Challenge • Top-3 Performers • Stack LSTM • End-to-End System for Relevant Document Discovery • The Disagreeing Class is not reallywellpredicted Luca Brandt Stance Classification Web Science 2019
Future Work • All featurestextbased • Including non-textualdataasfeatures • Videos -> Image & Speech Recognition • Social Media Pages -> embeddedimage/graphicinformation • Develop ML techniqueswithdeepersemanticunderstanding • Not relying on lexicalfeatures • Disagreeingclasshastobepredictedbetter Luca Brandt Stance Classification Web Science 2019
Thankyouforyourattention Any Questions?
Dataset of Paper 2 • Unlabled Corpus: 14731 fact-checkingarticlesbyhighly reputable fact-checkers • Relevance-labeledcorpus: usingtheircandidategenerationalgorithm 2400 relateddocuments per fact-checkingarticle 33.5M • With crowd-workers and balancingfor positive and negative examples a total of 8000 (claim, doc) pairs • Crowd-workingquestion: doesthisdocaddresstheclaim? • Stance-labeledcorpus: randomlysampled 1200 ofthe positive instancesofrelevance-labledcorpus and crowdsourced -> support, contradict, neither, can‘ttell • For 12% workerscouldn‘tagree -> removed • Manual Corpus: to measure candidate generation, randomly sampled 450 fact-checking articles and let crowdworkers search for candidates Luca Brandt Stance Classification Web Science 2019
Features Description • Bag of Words : • 1- and 2-Grams with 5000 tokenvocabularyforheadline and doc • Added a negationflag „_NEG“ asprefixtoeverywordbetweenspecialnegationwords like „not“, „never“, „no´“ untilnextpunctuationmark • Topic Models: • Non-negative matrixfactorization • Latent semanticindexing • Latent dirichletallocation • Similaritybetweentopicmodels and headlines and bodies Luca Brandt Stance Classification Web Science 2019
Feature Description II • Lexicon-basedfeatures • Based on lexiconswhich hold thesentiment/polarityforeachword • Computedseperatelyforheadline and body • Count positive and negative polarizedwords -> features • Find maximum positive and negative polarity -> features • Last wordwith negative or positive polarity -> feature • Refutingwordslist („fake“, „hoax“) -> features • Concatenating all featuresfromabove • Readabilityfeatures • Measuredwith different metrics (e.g. SMOG, Flesch-Kincaid, Gunningfog) Luca Brandt Stance Classification Web Science 2019