170 likes | 186 Views
Learn about automated fact-checking using ClaimBuster and deep learning models. Explore Claim Validation with RNN, scoring sentences, evidence retrieval, and more. Discover the future of factual claim verification.
E N D
Factual Claim Validation Models Deep Learning Research & Application Center October 2017 Claire Li
Available fact checking tools ClaimBuster Google search API and other free ones Claim Validation Model with RNN
Available fact checking tools Automated fact checking projects vary in what kinds of sources they deal with, what kinds of claims they deal with, and what topics they deal with
Narrow scope is the key for practical tools for fact-checkers • claimBuster • political sentences currently • Based on machine learning models • As a ranking and classification task • Fake news detection as a stance classification task
Claim Validation with ClaimBuster Scoring sentences: Classification &scoring models, features of tokens and tokens of PoS Retrieve evidence: Context from google SE; Ans from wolfram alpha& Google answer box; Verdicts from above Similarity calculation: Similarity of token & Sematic similarity from semilar Monitors & retrieves sentences
Claim Validation with ClaimBuster • Given a factual claim which is scored • Search in a repository for similar claims that have already been fact-checked by professionals (claim matcher) • Sematic similarity match (3-10) spots the matched fact-checked claims • Returns the truth rating if any • Otherwise goto 1) • ClaimBusteris not able to produce a verdict • processes search engine results for evidence based on the similarity to the input claim • Use question-answering systems • translate the natural languageclaim into questions • queries external knowledge bases (Google Answer Boxer and Wolfram Alpha ) with derived questions
Search in a repository for similar claims that have already been fact-checked by professionals, e.g. • claim (string) • the matched fact-checked claim • host (string) • the source of the fact-check • search (string) • the search measure which yielded the fact-match • similarity_rating (number) • 3-10 for a good match • speaker (string) • speaker of fact-checked claim • truth_rating (string) • true, false, pants on fire, indeterminate • url • the URL location of the matched fact-check
Processes search engine results for evidence based on the similarity to the input claim, ex • sentence (string) • an anchor sentence is the one which has a high similarity score to the input claim • context (array[string]) • a context is composed of, some sentences to the left of the anchor + the anchor sentence + some sentences to the right of the anchor • similarity_rating (number) • 0-1, measure between input claim and anchor • url (string) • urlof context • host (string) • the hostname of the URL
Use question-answering systems, e.g. • answer_box_html (string, optional) • Complete raw html where the justification was extracted from Google Answer Boxes • justification (string) • Either the text scraped from the Google Answer Box or the Wolframalpha response • question (string) • question which was derived from your input claim and subsequently input into the question answering system specified in the source parameter • source (string) • either Google Answer Boxes or Wolfram Alpha API • truth_rating (string, optional) • If the truth value of true, false, pants on fire, indeterminate is inferable
Use a world knowledge base of fact-checked statements Google Answer Boxer: what is the time in Hong Kong Wolfram Alpha: How many undocumented people in United States?
Google custom search API &Wolfram|Alpah API pricing • By default, the Google Custom Search API has a quota of 100 queries per day. If you exceed this quota, you can upgrade to 1000 queries per day for one month for $5
Free Open Source Search Engines • Information retrieval from free open source search engines • Given claims spotted, search for documents contain relevant fact checks or evidences • Ranking and classification problem • Apache Lucene, in Java, cross-platform • fuzzy searches: e.g. roam~0.8, find terms similar in spelling to roam as 0.8 • proximityquery: e.g., "Barack michellea"~10 • range query, title:{Aida TO Carmen} • phrase query: e.g., “new york" • used by infomedia, Bloomberg , and Twitter’s real time searching • Apache Solr (better for text search) and Elastic Search (better for complex time series search and aggregations) • Solr/elasticsearch are built on top of Lucene • Basic Queries, text: obama, all docs with text field containing obama • Phrase query, text: “Obama michellea” • Proximity query, text: ”big analytics”~1, big analytics, big data analytics • Boolean query, solr AND search OR facet NOT highlight • Range query, age: [18 To 30] • Used by Netflix, eBay, Instagram, and Amazon CloudSearch
Claim Validation Model with RNN [1][2] Monitor Model Claim Spotting Model Claim Verdict Model Create & publish LSTMs True Mostly true Half true Half false Mostly false False LSTMs
Claim Verdict Model: Claim Validation True, mostly true, half true, half false, mostly false, false
Works with RNN CNN- and LSTM-based Claim Classification in Online User Comments, COLING 2016 Turing at SemEval-2017 Task 8: Sequential Approach to Rumour Stance Classification with Branch-LSTM Fake News Detection using Stacked Ensemble of Classifiers, nlpj2017 Identification and Verification of Simple Claims about Statistical Properties, emnlp 2015