Real-time topic detection with burstyngrams: RGU participation in SNOW 2014 challenge Carlos Martin and AyseGoker (Robert Gordon University) SNOW Workshop, 8th April 2014
Outline • Architecturediagram • Results • Futurework
Architecturediagram Crawler Entities Extractor Tweets (with Entities) Tweets (English) Tweets Solr
Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Topics (+ label) Crawler Entities Extractor BNgram Topics Combiner Query Builder Topic Labeller Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
Entities Extractor • Extract entities per tweet using Stanford NER (http://nlp.stanford.edu/software/CRF-NER.shtml). • 3 class model Identifies Person, Location and Organization. • Efficient enough for a real-time system.
Architecturediagram Ranked topics Crawler Entities Extractor BNgram Tweets (with Entities) Tweets (English) Tweets Solr
BNgramapproach • Detection of burstyngramsbasedondf-idf score Burstyentities, hashtags and urls are alsoincluded in theapproach. Re ngrams, 2- and 3-grams are considered (no unigramsanymore). • Variant oftf-idf Penalization of frequentterms in previoustimeslots. • Termscontaininghashtags, entities, urls are boosted. • Twoprevioustimeslots(s=2) wereconsidered in ourexperiments.
BNgramapproach • “Partial” membershipclusteringapproachisaninterestingalternative as onetermcouldbelongtodifferentclusters (Forexample, entity “Obama” forthestories “Obama wins in Ohio” and “Obama wins in Illinois”). • Aprioriclusteringalgorithmhas beenused in theexperiments of SNOW challenge • Explore maximal associations between terms based on the number of shared tweets.
BNgramapproach • Output:Clusters of trendingtermswithtweetsfromthelasttimeslotassociatedtothem. • A tweetshouldcontain a minimumnumber of clustertermsto be included. • Clusters are rankedbytheirbursty scores (maximumdf-idfvalue of topicterms)
Architecturediagram Keyword Extractor Ranked topics Crawler Entities Extractor BNgram Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
Keyword Extractor and TopicAggregator modules • TopicAggregator module: • Aggregateentities, hashtags and urls per topic (comingfromtopictweets of thecorrespondingtimeslot) keepingtheirfrequencies. • Keepthoseoneswhosefrequencyishigherthan a threshold. • Keyword Extractor module: • Extractmainkeywords(includingngrams) per topic (notextractedfromTopicAggregator) usingburstytermsfromtheclusters. • Removal of urls, hashtags, usermentions, entities and acronyms. • Overlaps are also removed. • Keepdf-idf scores as theirweights.
Architecturediagram Keyword Extractor Ranked topics Merged topics Crawler Entities Extractor BNgram Topics Combiner Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
Topic Combiner module • TopicCombinermodule: • Mergesimilar topicsfromthesametimeslot. • Basedontheco-occurrence of keywords (unigrams),entities,hashtagsandurlsfromthecomparedtopics. • Accordingtopreliminaryresults, Apriorialgorithmmakesthis module more accurateas onetermcouldbelongtodifferenttopics.
Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Crawler Entities Extractor BNgram Topics Combiner Query Builder Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
QueryBuildermodule • Creation of final queries to retrieve all the related tweets to the topic (Solr queries) and also filtering by time (simulating real-time scenario). • 3 types of queries: • Keywords • Entities and Hashtags • Urls • If keywords and entities in topic, keywords closer to the entities are the selected ones. • Image population: If tweets contains links to images (metadata), they are added to the topic.
QueryBuilder module • Repliesare alsoconsidered. Be carefulwithspamreplies • Replies are nottext-querydependent. More diversity?. • Sentimentanalysis, extraction of relevantkeywords.
QueryBuilder module • Diversetweetsare computedbasedoncosinesimilarity. • Thisapproachcould be more orlessstrictdependingontheselectedthreshold.
Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Topics (+ label) Crawler Entities Extractor BNgram Topics Combiner Query Builder Topic Labeller Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
TopicLabeller module • BuzzFeed editor-in-chief Ben Smith: “Headlines sure look a lot like tweets these days.” (http://perryhewitt.com/5-lessons-buzzfeed-harvard/) • Foreachtopictweet, a scoreiscomputedbasedonthefollowing formula. whereα = 0.8. Thetweetwiththehighest score isselected as theTopiclabelaftercleaningit.
TopicLabeller module • Example of tweetsaftercleaningthem • Granularityisstillanissue Sometopiclabels are too general orspecific.
Future work • Improve Topic Combiner module – use of similarity measures. • Further research on the use of replies and diverse tweetsper Topic. • Improve Topic Labeller module – granularity issue. • Modifications in QueryBuildermodule – use of term weights (Solr).
Thank you! E-mail address: c.j.martin-dancausa@rgu.ac.uk Twitter account: @martincarloscit