380 likes | 478 Views
Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert Gordon University) Andrew MacFarlance (City University London). BCS SGAI Workshop on Social Media Analysis, 10th December 2013. Outline. Introduction & Motivation BNgram approach
E N D
Mining Newsworthy Topics from Social Media Carlos Martin, David Corney and Ayse Goker (Robert Gordon University) Andrew MacFarlance (City University London) BCS SGAI Workshop on Social Media Analysis, 10th December 2013
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
Introduction & Motivation • Newsworthy stories are increasingly being shared through social networking platforms such as Twitter and Reddit • Journalists use Social Media to rapidly discover stories and eye-witness accounts.
Introduction & Motivation • Other tools to detect newsworthy stories: • Twitter trends – http://www.twitter.com • Trendsmap - http://trendsmap.com/ • Newship - http://www.newswhip.com/
Introduction & Motivation • Gap in the market • Story description is incomplete/unclear (based on the use of hashtags and entities) • Use of mainstream media • Proposal of an approach to detect newsworthy stories in real time from Twitter where story description is complete and posts from social network users are associated to each story • Journalists and news readers don’t get overwhelmed.
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
BNgram approach • Detection of the most representative topics from a timeslot making special emphasis on temporal dimension of data. • Detection of emerging phrases (word n-grams) based on df-idft score. It is a variant of tf-idf. Ranking of n-grams per timeslot sorted by df-idft, avoiding overlaps. Boost factor: Named entity recognition (Stanford) – 3 class classifier (Person, location and organization).
BNgram approach • Hierarchical clustering of the top k n-grams with the highest df-idft scores. Topic score is computed as the maximum df-idft of its n-grams.
BNgram approach • Evaluation benchmark: Comparison with other 4 TDT (document-pivot and feature-pivot) and a baseline (LDA) approach – TMM paper • User-centred evaluation: • Collections: FA Cup, Super Tuesday and US Elections (tracking keywords). • Ground truth: Set of representative topics (manually selected) corresponding to different timeslots, coming from main-stream media(MSM). Timeslot size: FA Cup – 1 min., Super Tuesday and US elections – 10 min. Topics: 13 FA Cup, 22 Super Tuesday and 64 US elections.
BNgram approach • Collections:
BNgram approach • Results – TMM paper
BNgram approach • Examples of topics
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
Further modifications • BNgram approach modifications: • Study of different types of n-grams. • Timeslots vs. Number of tweet slots • Clustering techniques have been tested for BNgram approach: Apriori and GMM algorithms. • New topic ranking technique has been considered.
N-grams • Word order is often essential to indicate meaning. For example, 'dog bites man' is not news, but 'man bites dog' is news. A bag-of-words approach cannot distinguish these cases. • Popular in NLP • In this work, n-gram we refer to sequences of up to n consecutive terms • Copies of posts and RTs are very frequent in Twitter space. Focused posts in 140 characters.
Timeslots vs. Number of tweet slots • What’s the best timeslot size?. • Other alternatives: Number of tweet slots – Minimum changes in the approach. • Small slot size missed stories • Large slot size delay in some stories (refresh rate) boost Fixed number of tweets instead of time
Clustering approaches • Weakness detected in our clustering technique: • Example: US elections ngram ranking (sorted by df-idft): • Basic hierarchical clustering: Incomplete stories. • From our example, the candidate clusters could be: • Cluster 1: Barack Obama wins + wins Wisconsin (Complete) • Cluster 2: wins California (Incomplete, who?) • New grouping techniques where one n-gram can be assigned to different clusters.
Clustering approaches – Gaussian Mixture Models (GMM) • Unsupervised method • Assign probabilities (or strengths) of membership of each n-gram to each cluster – Partial membership • Iterative approach. Tries to find the parameters of the probability distribution that has the maximum likelihood of its attributes. • Input: Number of clusters - Bayesian Information Criteria (BIC)
Clustering approaches – Gaussian Mixture Models (GMM) • Expectation-Maximisation - Two steps: • E-Step: Estimates the probability of each point belongs to each cluster. • M-step: Re-estimate the parameter vector of the probability distribution of each class. • The algorithm finishes when the distribution parameters converges or maximum number of iterations.
Clustering approaches - Apriori algorithm • Explore associations between n-grams based on the number of shared tweets. • Number of n-grams per association: Each association contains from 1 n-gram to the considered number of n-grams from the ranking. • One association is considered if the number of shared tweets for the n-grams of the association is bigger than a threshold (support value). • In a posterior step, the maximal associations are obtained to avoid overlaps.
Clustering approaches - Apriori algorithm • From the previous example (if threshold is 3): • Candidate associations: #1, #2, #3, #1#2, #1#3 • Maximal associations: #1#2, #1#3
Topic ranking • Maximum df-idft n-gram approachis not the best alternative for these new clustering techniques • Inconvenient for slots with active and diverse topics. N-gram ranking Topic ranking n-gram1 topic1 topic2 n-gram2 topic3 topic4 topic5
Topic ranking • Weighted topic-lengthapproach: where st is the score of topic t, Lt is the length of the topic, Lmax is the maximum number of terms in any topic from the current slot, Nt is the number of tweets in topic t and Ns is the number of tweets in the slot. Finally, α is a weighting term.
Evaluation • We have estimated the starting and ending times of each event in the ground-truth Ending time (event) Starting time (event) Topics for slot i-3 Topics for slot i-2 Topics for slot i-1 Topics for slot i m m m m Merged topics to evaluate the event (top m)
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
Experiments – n-grams • Topic recall for different types of n-grams and three datasets using hierarchical clustering and maximum n-gram topic ranking techniques and fixing the slot size to 1000 tweets (similar patterns observed using other configurations)
Experiments – n-grams • Normalised area under the curve for the three datasets and its weighted average.
Experiments- slot size • Topic recall for different slot-sizes using hierarchical clustering and weighted topic-lengthtopic ranking techniques (3-grams). • Possible correlation between slot size and tweet rate (Super Tuesday: 832 tpm, FA Cup: 1293 tpm, US elections: 2209 tpm) • Consider refresh rate UI
Experiments – clustering and topic ranking techniques • Topic recall for different clustering techniques in the three datasets and using both topic ranking techniques(3-grams and slot size = 1500 tweets)
Experiments – clustering and topic ranking techniques • Normalised area under the curve
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
Demo • Social Sensor project – http://www.socialsensor.eu
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
Conclusions and Future work • New TDT approach based on temporal dimension of data and n-grams in Twitter space • Improve tracking issues – ongoing • Trust and verifications based on following newshounds – ongoing • Improve Topic title – ongoing • Better association of tweets to topics – ongoing • Improve evaluation methods/metrics • Smoothing techniques for df-idft computation • Entity recognition – Other approaches (Illinois NLP tools,…) • Participation in TDT challenges (SNOW14)
Outline • Introduction & Motivation • BNgram approach • Further modifications • Experiments • Demo • Conclusions and Future work • References
References • Aiello, L., Petkos, G., Martin, C., Corney, D., Papadopoulos, S., Skraba, R., Goker, A., Kompatsiaris, I., Jaimes, A.: Sensing trending topics in twitter. Multimedia, IEEE Transactions on 15(6) (2013) 1268–1282 • Martin, C., Corney, D., Goker, A.: Finding newsworthy topics on Twitter. IEEE Computer Society Special Technical Community on Social Networking E-Letter 1(3) (September 2013) • Steve Schifferes, Nic Newman, Neil Thurman, David Corney, Ayse Göker, Carlos Martin. (2013). Identifying and verifying news through social media: Developing a user-centred tool for professional journalists. In The Future of Journalism Conference 2013, Cardiff, UK. • Spot the ball: Detecting sports events on Twitter. In proceedings of ECIR 2014, Amsterdam, Netherlands. (To appear)