230 likes | 413 Views
Real-time topic detection with bursty ngrams : RGU participation in SNOW 2014 challenge Carlos Martin and Ayse Goker (Robert Gordon University ). SNOW Workshop, 8th April 2014. Outline. Architecture diagram Results Future work. Architecture diagram. Crawler. Entities Extractor.
E N D
Real-time topic detection with burstyngrams: RGU participation in SNOW 2014 challenge Carlos Martin and AyseGoker (Robert Gordon University) SNOW Workshop, 8th April 2014
Outline • Architecturediagram • Results • Futurework
Architecturediagram Crawler Entities Extractor Tweets (with Entities) Tweets (English) Tweets Solr
Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Topics (+ label) Crawler Entities Extractor BNgram Topics Combiner Query Builder Topic Labeller Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
Entities Extractor • Extract entities per tweet using Stanford NER (http://nlp.stanford.edu/software/CRF-NER.shtml). • 3 class model Identifies Person, Location and Organization. • Efficient enough for a real-time system.
Architecturediagram Ranked topics Crawler Entities Extractor BNgram Tweets (with Entities) Tweets (English) Tweets Solr
BNgramapproach • Detection of burstyngramsbasedondf-idf score Burstyentities, hashtags and urls are alsoincluded in theapproach. Re ngrams, 2- and 3-grams are considered (no unigramsanymore). • Variant oftf-idf Penalization of frequentterms in previoustimeslots. • Termscontaininghashtags, entities, urls are boosted. • Twoprevioustimeslots(s=2) wereconsidered in ourexperiments.
BNgramapproach • “Partial” membershipclusteringapproachisaninterestingalternative as onetermcouldbelongtodifferentclusters (Forexample, entity “Obama” forthestories “Obama wins in Ohio” and “Obama wins in Illinois”). • Aprioriclusteringalgorithmhas beenused in theexperiments of SNOW challenge • Explore maximal associations between terms based on the number of shared tweets.
BNgramapproach • Output:Clusters of trendingtermswithtweetsfromthelasttimeslotassociatedtothem. • A tweetshouldcontain a minimumnumber of clustertermsto be included. • Clusters are rankedbytheirbursty scores (maximumdf-idfvalue of topicterms)
Architecturediagram Keyword Extractor Ranked topics Crawler Entities Extractor BNgram Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
Keyword Extractor and TopicAggregator modules • TopicAggregator module: • Aggregateentities, hashtags and urls per topic (comingfromtopictweets of thecorrespondingtimeslot) keepingtheirfrequencies. • Keepthoseoneswhosefrequencyishigherthan a threshold. • Keyword Extractor module: • Extractmainkeywords(includingngrams) per topic (notextractedfromTopicAggregator) usingburstytermsfromtheclusters. • Removal of urls, hashtags, usermentions, entities and acronyms. • Overlaps are also removed. • Keepdf-idf scores as theirweights.
Architecturediagram Keyword Extractor Ranked topics Merged topics Crawler Entities Extractor BNgram Topics Combiner Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
Topic Combiner module • TopicCombinermodule: • Mergesimilar topicsfromthesametimeslot. • Basedontheco-occurrence of keywords (unigrams),entities,hashtagsandurlsfromthecomparedtopics. • Accordingtopreliminaryresults, Apriorialgorithmmakesthis module more accurateas onetermcouldbelongtodifferenttopics.
Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Crawler Entities Extractor BNgram Topics Combiner Query Builder Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
QueryBuildermodule • Creation of final queries to retrieve all the related tweets to the topic (Solr queries) and also filtering by time (simulating real-time scenario). • 3 types of queries: • Keywords • Entities and Hashtags • Urls • If keywords and entities in topic, keywords closer to the entities are the selected ones. • Image population: If tweets contains links to images (metadata), they are added to the topic.
QueryBuilder module • Repliesare alsoconsidered. Be carefulwithspamreplies • Replies are nottext-querydependent. More diversity?. • Sentimentanalysis, extraction of relevantkeywords.
QueryBuilder module • Diversetweetsare computedbasedoncosinesimilarity. • Thisapproachcould be more orlessstrictdependingontheselectedthreshold.
Architecturediagram Keyword Extractor Ranked topics Topics (+ tweets) Merged topics Topics (+ label) Crawler Entities Extractor BNgram Topics Combiner Query Builder Topic Labeller Tweets (with Entities) Topics (+ keywords, entities, hashtags and urls) Topic Aggregator Tweets (English) Tweets Solr
TopicLabeller module • BuzzFeed editor-in-chief Ben Smith: “Headlines sure look a lot like tweets these days.” (http://perryhewitt.com/5-lessons-buzzfeed-harvard/) • Foreachtopictweet, a scoreiscomputedbasedonthefollowing formula. whereα = 0.8. Thetweetwiththehighest score isselected as theTopiclabelaftercleaningit.
TopicLabeller module • Example of tweetsaftercleaningthem • Granularityisstillanissue Sometopiclabels are too general orspecific.
Future work • Improve Topic Combiner module – use of similarity measures. • Further research on the use of replies and diverse tweetsper Topic. • Improve Topic Labeller module – granularity issue. • Modifications in QueryBuildermodule – use of term weights (Solr).
Thank you! E-mail address: c.j.martin-dancausa@rgu.ac.uk Twitter account: @martincarloscit