Term Burstiness in WSD and Pseudo Relevance Feedback

Term Burstiness in WSD and Pseudo Relevance Feedback Atelach Alemu Argaw March 2006

Burstiness Model (Sarkar et al) • Model gaps (not term occurrence) • Mixture of exponential distributions • Model the amount of time until a specific event occurs • Between-burst (1/l1, or l1’) • Within-burst (1/l2 or l2’) • Reference:Sarkar, Avik; De Roeck,Anne; Garthwaite, Paul H. Term Re-occurrence Measures for Analyzing Style. In proceedings of SIGIR 2005 workshop on Stylistic Analysis Of Text For Information Access. 2005.

Burstiness Model (Sarkar et al) • First occurrence • No occurrence: censoring

Burstiness Model • Baysian parameter estimation • posterior  prior x likelihood • P(theta/D) = P(theta) x P(D/theta) • choose uninformative prior • estimate posterior using Gibbs Sampling (MCMC) • Random sampling from the population and using the sample values to estimate the posterior. • WinBUGS software

WINBUGS

Parameter estimates (Sarkar et al) l1’ = 1 / l1 • The mean of the exponential distribution with parameter lambda • Rarity of a term in the corpus : average gap at which the term occures if it has not occured recently l2 ’ = 1 / l2 • The rate of occurence of a term given it has occured recently • Within document burstiness P1 • Probability of a term occuring with rate l1’ P2 • Probability of a term occuring with rate l2’

Burstiness Model (Sarkar et al) Word behaviours Small l1’, small l2’: frequently occurring function word Large l1’, small l2’: bursty content word Small l1’, large l2’: frequent but well spaced function word Large l1’, large l2’: infrequent scattered function word

Test run Data • Europarl • English 164K • Morphology, POS • Swedish 130K • Converted to numeric format Pilot run • 1000 iteration burn-in • further 5000 iterations for estimate

Discussion Points • Convergence • Inclusion of POS and morphological analysis Vs more data • How could context information be included? • Does it have to be parallel? • WSD Vs topicality and pseudo relevance feedback

Term Burstiness in WSD and Pseudo Relevance Feedback