140 likes | 241 Views
Hashtags as Milestones in Time. Stewart Whiting University of Glasgow Omar Alonso Microsoft/Bing Time Aware Information Access Workshop, SIGIR Oregon, 2012 . (Work done while on internship at Microsoft). Identifying the hashtags for meaningful
E N D
Hashtags as Milestones in Time Stewart Whiting University of Glasgow Omar Alonso Microsoft/Bing Time Aware Information Access Workshop, SIGIR Oregon, 2012. (Work done while on internship at Microsoft) Identifying the hashtagsfor meaningful events using Twitter search logs and Wikipedia data
Alright… Outline • Hashtags as milestones in time • Introduction • Why milestones • Why hashtags? Can they useful as milestones? • Motivation • Approach • Data preparation • Approach steps • Constructing a timeline – examples • Preliminary conclusions
Abstract: Hashtagsas milestones in time What we want to do: • Identify event-based hashtags, for timeline creation • Currently using historic/past data • Filter out junk • Find most temporally significant hashtags • Use multiple signals: Twitter search logs + related Wikipedia article popularity • We are not doing topic detection/tracking! Why? • A good way to express (anchor) a topic on a timeline… • Help users make sense of/navigate temporal information #what?
Introduction • Hashtags used by authors to explicitly denote the relevant topic(s) in message • “Great passing, great game #euro2012” • Used by authors and searchers • Broadcast a consume a specific topic • Especially useful in short text retrieval where bag of words/language modelling are challenging • Reflect mainstream events (or memes!) in real-time • See trending topics right now • Timelines are very good for displaying events • But you need to express the events as a meaningful marker, or milestone!
Introduction to the data • Two crowds of people • Authors/searchers on Twitter • Editors/browsers on Wikipedia • Correlation between signals from the two crowds • People search for what is happening • People edit Wikipedia with what is happening • Two very distinctive signals!
Twitter hashtag signals (in search logs) • But plenty of memes too… • #20PeopleWhoIWantToMeet • #PresentingInTheBatCave • #whiteppldoitbutblackppldont
Wikipedia signals • Whitney Houston • TV appearances • Her death in February 2012 • Events were reflected by discussion with hashtags in Twitter, e.g. • #ripwhitney • #bgtwhitney (BGT = Britain’s got Talent)
Motivation • Both signals have large coverage • Celebrities, news, weather, people, science, movies etc. • Two robust signals coming from large crowds • Difficult to influence by individuals (spam?) • Not so reliant on single signal analysis (i.e. wavelets or burst detection etc) • Discard memes by looking for associated Wikipedia articles. • Meaningful milestones in timelines provide strong features to navigate temporal content • Alonso et al. (2010), Matthews et al. (2010), From et al. (2003)
Data Preparation – HashtagData • Extracted from Bing Social and IE8 query logs • Provides hashtag use, aggregated per day • (Proprietary, but could be extracted from other sources) • Hashtags are mostly a mix of unigrams and bigrams! • We also want the words in the hashtag • Need to use a word breaker… • We used Microsoft Web N-Gram Services • Breaks #crosstownshootout into ‘cross town shoutout’ and #basketballwivesla into ‘basketball wives la’
Data Preparation – Wikipedia Data • Created a Lucene index using the Wikipedia Extraction (WEX) data. • Wikipedia article viewing popularity statistics • Dump available for each hour since Dec 2007 • Published near real-time, for the past hour (on the hour) • Huge number of data points! • So we sampled 8am/8pm each day • Transformed into a daily aggregated time-series (therefore comparable with hashtag signals) • Smoothed with exponential smoothing (alpha = 0.2) • Over 2 billion data points!
Approach Outline • For each hashtags from the logs, use word breaker service to extract hashtag terms. • Use separated terms to query Wikipedia index – maps each hashtag to a set of possibly associated articles. • For each article/hashtag, prepare a same-length comparable time-series of popularity • Frequency of hashtag over time • Popularity of article over time • Pearson correlation co-efficient computed. • Measures association between temporality of the hashtag occurrence and the Wikipedia article popularity.
Conclusions • Early work, but correlating the signals does yield high-profile temporal events • Hashtag can therefore be used to anchor events on a timeline • Occasional spurious correlation (need better hashtag frequency data to improve this) • Correlation does not imply causation! • Future work… • Automatic construction of timelines • Improving correlation quality – examine time windows • Designing an evaluation framework to assess overall timeline quality