1 / 19

Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering

Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering. Georgiana Ifrim , Bichen Shi, Igor Brigadir Insight Centre for Data Analytics University College Dublin. Outline. Background M ethod Proposed Method Details Results Future Work. Background.

lara
Download Presentation

Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Event Detection in Twitter using Aggressive Filtering and Hierarchical Tweet Clustering Georgiana Ifrim, Bichen Shi, Igor Brigadir Insight Centre for Data Analytics University College Dublin

  2. Outline • Background • Method Proposed • Method Details • Results • Future Work

  3. Background • Social media outlets (e.g., Twitter) play an increasing role in the cycle of news production • Journalists use Twitter for news selection and presentation • Twitter: • An endless, real-time, global stream of news • Large scale and very noisy (redundant, messy content) • Challenge: Extract (close to real-time) newsworthy topics/event/stories from the Twitter stream, in a format usable by news professionals (e.g., topic-timestamp, topic-headline, topic-tags, tweet-ids, photo-urls)

  4. Challenge • From this: • #Obama #follow #followme #followforfollow #followme #follower #followers #alwaysfollowback #followbackalways #teamfollowback • I VOTED !!! #OBAMA http://instagram.com/p/RsoNuMgLkr/  • @BarackObama #TeaamObama !!!! ✊🎋🎉🎊🇺🇸 • Om 12u zou de eersteuitslagbinnenzijn. Nu nog steeds niks. Dit trek ikniet. Wekker over 4u en we kijkendanwel. #obama #forward • My President is Black ★★★★★▄▄▄▄▄▄▄▄▄▄ ★★★★★▄▄▄▄▄▄▄▄▄▄ ★★★★★▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ ▄▄▄▄▄▄▄▄▄▄▄▄▄▄▄ #Obama2012 #Retweet • voted at school! @barackobama I love you! #Forward2012 🇺🇸💙 http://instagr.am/p/RtONyRC9Tm/  • Romney Romney Romney Romney Romney Romney Romney Romney!!!!!!!!!!!!!!!!!!!!!🇺🇸🇺🇸🇺🇸🇺🇸 #RomneyRyan2012

  5. Challenge • To this: • Obama wins Vermont • Romney wins Kentucky • Bernie Sanders wins Senate seat in Vermont • Romney wins Indiana

  6. Method Proposed • 1. Aggressive Data Filtering (to remove noise, to scale) • 2. Hierarchical Clustering of Tweets + Dendrogram Cutting (to obtain clusters without need of knowing #clusters a-priori) • 3. Ranking of Clusters (to favor news-like topics) • 4. Extracting Topic-Headlines (usable information) • 5. Re-clustering Topic-Headlines (to remove topic fragmentation) • 6. Extracting Final Topics (as presented to the user)

  7. Method Details • Software • Collecting Twitter streams • SNOW Challenge Code (based on Twitter4J API) • All other development (https://github.com/heerme/twitter-topics) • Python2.7 + libraries: scipy, numpy, sklearn, nltk, json • Tweet-NLP: CMUTweetTagger (trained on tweets, entity detection) • Efficient clustering: fastcluster(C++ lib, interface to Python/R)

  8. Data Collection • US Presidential Elections 2012 • Collected from tweet ids (06-11-2012, 23:30 to 07-11-2012 06:51) • 1,084,200 raw tweets text (english + non-english, 252MByte) • Syria, Ukraine, Bitcoin 2014 • Collected from keywords + user ids (25-02-2014, 17:30 to 26-02-2014, 18:15) • 1,088,593 raw tweets JSON (english + non-english, 4.3GByte) • 943,175 english tweets JSON (3.8GByte) • 943,175 tweets text (extract subset of fields from JSON object, 240MByte) • Replace re-tweet text with original tweet text

  9. Data Pre-processing • Tweet filtering • Cleantweet-text. Remove: urls, user mentions, hashtags, punctuation, digits • Tokenize remaining text into tokens • Rebuild tweet by appending: user mentions (@) + hashtags (#) + text tokens • Remove tweets based on structure (remove if too many @, # or too few text tokens) • Term filtering • Keep only bi-grams + tri-grams occurring in at least a percentage of tweets in time window (e.g., min(10, n_tweets_in_window * 0.0025)) • Tweet-Term Matrix (binary) • Remove out-of-vocabulary tweets and very short tweets (with less than 5 tokens) • Retains about 20% of the original raw tweet stream (in each time window)

  10. Hierarchical Tweet Clustering • Computing tweet pairwise-distance • Scale and normalize tweet-term matrix • Cosine as distance metric (euclidean similar results); sklearn + scipy • Computing hierarchical clustering=> dendrogram • fastcluster C++ library (interface to R and Python) • Dendrogram cutting • Cut at 0.5 distance threshold (better libraries based on topology of dendrogram available in R, e.g., Dynamic Tree Cut: only specify min number of examples in each cluster) • One cluster = one potential topic

  11. Hierarchical Tweet Clustering • Ranking clusters • Retain only clusters with at least 10 tweets (size constraint) • Score each cluster based on: • Compute cluster-centroid (vector of terms) • Get maximum term-score (over all centroid terms) • Term score: entity_score * burstiness_score • Assign the highest term score as cluster score • Normalize cluster score by cluster size • Entity score = 2.5 (identify entity-terms with CMUTweetTagger) • Burstiness score = df-idf_t, with t=4 (prior work on Bngram) • Interesting extensions to cluster score: article_score, tweet_importance based on trustworthiness or clout of users issuing tweet

  12. Hierarchical Tweet (Re)Clustering • Selecting topic-headlines • Take top-20 ranked clusters as potential topics • Select first (time-wise) tweet in each cluster as topic-headline • Re-cluster headlines • Hierarchical clustering of headlines • Score headline-clusters using max score headline • Rank headline-clusters, take top-10 • Final topics • Select first (published) headline in each cluster, present raw tweet (less url) to user • Gather all distinct keywords of headlines in headline-cluster to create topic-tags • Tweet ids for topic: the ids of corresponding headlines. If headlines do not cluster, only one tweet id

  13. Results • Top-10 topics first time window in US stream (07-11-2012 00:00 – 00:10) • 1. WASHINGTON (AP) - Obama wins Vermont; Romney wins Kentucky. #Election2012 • 2. Not a shocker NBC reporting #Romney wins Indiana & Kentucky #Obama wins Vermont • 3. RT @SkyNewsBreak: Sky News projection: Romney wins Kentucky. #election2012 • 4. AP RACE CALL: Democrat Peter Shumlinwins governor race in Vermont. #Election2012 • 5. CNN Virginia exit poll: Obama 49\%, Romney 49\% #election2012 • 6. Mitt Romney Losing in Massachusetts a state that he governed. Why vote for him when his own people don't want him? #Obama2012 • 7. Twitter is gonna be live and popping when Obama wins! #Obama2012 • 8. INDIANA RESULTS: Romney projected winner (via @NBC) #election2012 • 9. If Obama wins I'm going to celebrate... If Romney wins I'm going to watch Sesame Street one last time #Obama2012 • 10. #election2012 important that Romney won INdependents in Virginia by 11 pts. With parties about even, winning Inds is key

  14. Results • Top-10 topics first time window in Syria stream (25-02-2014 18:00 – 18:15) • 1. The new, full Godzilla trailer has roared online • 2. At half-time Borussia Dortmund lead Zenit St Petersburg 2-0. • 3. Ukraine Currency Hits Record Low Amid Uncertainty: Ukrainian currency, the hryvnia, hits all-time low against ... • 4. Ooh, my back! Why workers' aches pains are hurting the UK economy • 5. Uganda: how campaigners are preparing to counter the anti-gay bill • 6. JPost photographer snaps what must be the most inadvertantly hilarious political picture of the decade • 7. Fans gather outside Ghostbusters firehouse in N.Y.C. to pay tribute to Harold Ramis • 8. Man survives a shooting because the Bible in his top pocket stopped two bullets • 9. Ukraine's toppling craze reaches even legendary Russian commander, who fought Napoleon • 10. Newcastle City Hall. Impressive booking first from bottom on the left...

  15. Discussion • Parameter choices • Filtering parameters dependent on window size (nr of tweets in window) • Unigrams vs N-grams (N>1) • Bi-grams + N-grams good for content + scalability • Cluster ranking • (Normalized) Df-idf_t seems a good choice, but cluster-score may benefit from using tweet importance (based on user importance) • Topic Precision (~80%, based on googling topic-headlines) • On average about 8-9 out of 10 headlines are published news • Efficiency Aspect • System takes about 0.5min per 15min slot (scales well for larger time slots)

  16. Conclusion • Encouraging results in using Twitter stream as a news aggregator (truly global) • Both sides now: media outlets (CNN, BBC, Reuters, AP) and regular people post updates on (breaking) stories • We need a good topic-benchmark to refine techniques (e.g., comprehensive set of ground truth topics)

  17. Future Work • Improve retrieval of newsworthy stories • -E.g., ‘This is what happens when you put two pit bulls in a photo booth’, vs ‘Ukraine currency hits record low amid uncertainty’ • -May depend on type of stories we are after (BBC vs Sun) • -Tweet/user importance filtering may help • -News streamed in same time frame may help (vocabulary selection) • Fragmentation due to breaking news stories • -Same story discussed from different angles: • Lee Rigby murders: Michael Adebolajo given whole-life jail term • Lee Rigby murder sentence expected shortly. Pictured: the scene outside the Old Bailey in London • Judge Mr Justice Sweeney says behaviour of Lee #Rigby's killers was "sickening and pitiless" • -Combination of tweet and term clustering may help (e.g., cluster headlines in term rather than tweet space)

  18. Thank You! • Open source code: • https://github.com/heerme/twitter-topics

  19. Different Newspapers in UK • In the TV comedy seriesYesMinister, fictional Prime Minister Jim Hacker explains to his staff the readership of the main newspapers: • “The Daily Mirroris read by people who think they run the country, The Guardianis read by people who think they ought to run the country, The Timesis read by people who actually do run the country, The Daily Mailis read by the wives of the people who run the country, The Financial Timesis read by people who own the country, The Morning Staris read by people who think the country ought to be run by another country, and The Daily Telegraph is read by people who think it is.”, Source: Wikipedia

More Related