460 likes | 556 Views
Hashtag Retrieval in a Mircroblogging Environment. Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign mefron@illinois.edu http:// people.lis.illinois.edu/~mefron. Microblog overview.
E N D
Hashtag Retrieval in a Mircroblogging Environment Miles Efron Graduate School of Library and Information Science University of Illinois, Urbana-Champaign mefron@illinois.edu http://people.lis.illinois.edu/~mefron
Microblog overview • Twitter is by far the largest microblog platform (>50M tweets / day) • Brief posts • Temporally ordered • Follower / followed social model • Why would we search tweets? • Find answered questions • Gauge consensus or opinion • Find ‘elite’ web resources • ~19B searches per month vs. 4B for Bing http://searchengineland.com/twitter-does-19-billion-searches-per-month-39988
Anatomy of a Tweet screen name hashtag mention time stamp
#hashtags • Author-embedded metadata • Hashtags Collocate tweets • topically • contextually (e.g. #local, #toRead) • w/respect to affect (e.g. #iHateItWhen, #fail) • This research primarily concerns retrieving topical hashtags. • Find tags to ‘follow’ • Find a canonical tag for an entity (e.g. a conference)
Hypotheses General Hypothesis: Metadata in tweets can be marshaled to improve retrieval effectiveness. Specific Hypothesis: Traditional measures of term importance (such as IDF) don’t translate to the problem of identifying useful hashtags. An alternative ‘social’ measure is more appropriate.
Microblog Entity Search query entity1 entity2 cf. Balog et al. 2006. … entityn
Microblog Entity Search query entity1 entity2 cf. Balog et al. 2006. … entityn
Language Modeling IR Rank documents in decreasing order of their similarity to the query, where similarity is understood in a specific probabilistic context. Assume that a document d was generated by some probability distribution M over the words in the indexing vocabulary. What is the likelihood that the model that generated the text in d also generated the text in our query q? If the likelihood given the language model for document di is greater than the likelihood for another document dj, then we rank di, higher than dj,.
Language Modeling IR d1 : this year’s sigir was better than last year’s sigir d2 : was this (last year’s) sigir better than last
Language Modeling IR q: this sigir d1 : this year’s sigir was better than last year’s sigir d2 : was this (last year’s) sigir better than last
Language Modeling IR Rank documents by the likelihood that their language models generated the query.
Language Modeling IR Rank documents by the likelihood that their language models generated the query.
Language Modeling IR Often assumed to be uniform (the same for all documents) and thus dropped. Rank documents by the likelihood that their language models generated the query.
Language Modeling IR q: this sigir
What is a Document? For a tweet n(wi, d) is self-explanatory. For other entities (hashtags and people), we use the “virtual document” approach (Macdonald, 2009).
Virtual Documents #ipsum Lorem #ipsum dolor sit amet #Ipsum #sapienmollis dui Lorem #ipsum dolor sit amet In blanditipsumpurus vitae #Ipsum #sapienmollis dui #sapien #Ipsum #sapienmollis dui Documents Virtual Documents For a hashtag hi define a virtual document dithat consists the concatenated text of all tweets containing hi.
Virtual Documents #ipsum Lorem #ipsum dolor sit amet #Ipsum #sapienmollis dui #sapien #Ipsum #sapienmollis dui Virtual Documents For a hashtag hi define a virtual document dithat consists the concatenated text of all tweets containing hi.
Does it work? Results for SIGIR #sigir #sigir2010 #sigir10
Does it work? Results for SIGIR #sigir #sigir2010 #sigir10 #gingdotblog #recsys #msrsocpapers #kdd2010
Does it work? Results for SIGIR #sigir #sigir2010 #sigir10 #gingdotblog #recsys #msrsocpapers #kdd2010 #tripdavisor #kannada #ecdl2010 #genevaishotandhasnoairconditioning #sigir20010 #wsdm2011
Hashtag Priors • Some tags are better than others. • Even if its language model is on-topic, a very common tag (e.g. #mutread) is probably not useful. • But rarity isn’t much help • #genevaishotandhasnoairconditioning • Workhorse measures like IDF don’t get at tag usefulness • But “document” (i.e. tag) priors offer help.
Hashtag Priors Intuition: a tag is likely to be useful if it is used in many useful tweets. A tweet is useful if it contains many useful tags (obviously this is an oversimplification).
Hashtag Priors—Analogy to PageRank Where: h is a hashtag. H is the set of tags that co-occur with h. tis a hashtag in the set H. αis a constant so that the probabilities sum to one.
Hashtag Priors—Analogy to PageRank • These prior probabilities are the steady state of the Markov chain… • A “random reader” model: • Reading tweets • Choosing at random what to do next: • Examine tweets with a hashtag in the current tweet • Go to a random, new tweet (so we need…)
Hashtag Priors—Analogy to PageRank • These prior probabilities are the steady state of the Markov chain… • A “random reader” model: • Reading tweets • Choosing at random what to do next: • Examine tweets with a hashtag in the current tweet • Go to a random, new tweet (so we need…)
Hashtag Priors—Calculation • Initialize all n(T) tags to constant probability. • For each tag h: • find the set of tags H that co-occur with h. • Set Pr(h) = sum of Pr(.) for all tags in H. • Normalize all scores. • Iterate, repeating step 2 until convergence.
A Return to Intuition Assume that if two tags co-occur in a tweet, they share an affinity (i.e. they are linked). Assume that tags that occur in many tweets are highly engaged in the discourse on Twitter. Highly engaged tags spread their influence to those that may be less popular, but still bearing linkage to engagement.
Properties of Hashtag “social” Priors Cor(freq, prior)=0.275
Properties of Hashtag “social” Priors +-----------------+---------+---------------------+ | tag_text | docFreq | score | +-----------------+---------+---------------------+ | #linkeddata | 559 | 0.00537524473400945 | | #opensource | 1054 | 0.00427572303218856 | | #semanticweb | 215 | 0.00406174857132168 | | #yam | 345 | 0.00269713986530859 | | #rdf | 106 | 0.00257291441134344 | | #hadoop | 304 | 0.00247387571898314 | | #e20 | 512 | 0.00235774256615437 | | #opendata | 343 | 0.00234389638939712 | | #opengov | 563 | 0.00230838488530255 | | #nosql | 414 | 0.00220599138375711 | | #gov20 | 1964 | 0.00218390304201311 | | #semweb | 116 | 0.00209311764669248 | | #cio | 199 | 0.00190685555120058 | | #a11y | 462 | 0.00184103588077775 | | #sparql | 61 | 0.001802252610603 | | #webid | 125 | 0.00170313580607837 | | #semantic | 89 | 0.001699091839444 | | #cloudcomputing | 123 | 0.00166678332288627 | | #rdfa | 117 | 0.00165629661283845 | | #oss | 75 | 0.00164741425796336 | +-----------------+---------+---------------------+
An example: “immigration reform” No Priors With Priors #immigration #politics #twisters #economist #tlot #tcot #healthcare #ocra #sgp #teaparty #immigration #aussiemigration #teachers #physicians #parenting #election #twisters #reform #politics #hcreform
Assessment Research question: Does incorporating social priors into hashtag retrieval improve the usefulness of results? • 25 Test queries examined via 2 Amazon Mechanical Turk activities. • Queries were created manually. • For each task, each query was completed by 5 people. Estimates of usefulness obtained by the average of 5 scores.
Task 1: assess an individual query/model pair (10 results) Assess: Overall usefulness Clarity of results Obviousness of results (additional demographic info collected)
Task 2: compare the quality of two rankings (10 results each) Assess: Overall usefulness Clarity of results Obviousness of results (additional demographic info collected)
What Does Hashtag Retrieval Let Us Do? Ad hoc tag retrieval Query Expansion (Efron, 2010) Document Expansion
Query Expansion: immigration reform Relevance model Hashtag expansion #weight( 0.5 #combine(immigration reform ) 0.5 #weight(4.48 immigration 1.357 politics 0.965 twisters 0.927 economist 0.847 tlot)) #weight(0.5 #combine(immigration reform) 0.5 #weight( 0.9652168 immigration 0.8424631 reform 0.1551001 rt0.1448956 t 0.1413850 obama0.1361353 law 0.1344551 aussiemigration0.1299880 s 0.1008342 australia0.0939461 illegal ) ) Efron (2010): %8.2 improvement over baseline, %6.92 over term-based feedback.
Document Expansion Browsers For Visually Impaired Users Key Elements of a Startup
Document Expansion Browsers For Visually Impaired Users: #a11y #accessibility #assistive #axs #touch Key Elements of a Startup: #startup #newtech #meetup #meetups #prodmktg #lean
Next Steps • Articulate and investigate two senses of “search” on Twitter: • Searching over collected, indexed tweets. • Social search: Curious to hear from anyone who has gotten to play with @blekko. The user-controlled sorting (what they call "slashtags") is intriguing. • Consider document surrogates for retrieval sets. • Information synthesis from retrieved data: “spontaneous documents.”
References Balog, K., Azzopardi, L., & de Rijke, M. (2009). A language modeling framework for expert finding. Information Processing & Management, 45(1), 1-19. Efron, M. (2010). Hashtag retrieval in a microblogging environment. In Proceeding of the 33rd international ACM SIGIR conference on Research and development in information retrieval (pp. 787-788). Geneva, Switzerland: ACM. Macdonald, C. (2009). The Voting Model for People Search (Doctoral Dissertation). University of Glasgow.