280 likes | 427 Views
Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich. Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis. Oct. 28, 2010. Revisions of “Topology” on Wikipedia. 1 st revision:. 250 th revision:. Current revision:.
E N D
AblimitAji, Yu Wang Eugene Agichtein, EvgeniyGabrilovich Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis Oct. 28, 2010
Revisions of “Topology” on Wikipedia 1st revision: 250th revision: Current revision:
Observable Document Generation Process #i-1 #i In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions. In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms 95th revision 96th revision
How Revision History Analysis Could Help Retrieval Revision History Analysis
Selected Prior Work J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010. M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010. J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.
Revision History Analysis (RHA) BM25 Language Model RHA redefines term frequency (TF): - TF is a key indicator of document relevance - TF can be naturally integrated into ranking models
Model 1: Steady growth Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these. First revision Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example ….. basic examples include compactness and connectedness Current version
RHA Global Model: definition Frequency of term in revision Decay factor Define the term frequency over the whole document generation process • a document grows steadily over time • a term is relatively important if it appears in the early revisions.
But… Some pages are different: “Avatar(2009 film)” 1st revision: 500th revision: Current revision:
Model 2: Bursty Growth Burst of Document (Length) & Change of Term Frequency Burst of Edit Activity & Associated Events First photo & trailer released Movie released Global Model might be insufficient
RHA Burst Model: Definition Frequency of term in revision Decay factor for jth Burst A burst resets the decay clock for a term. The weight will decrease after a burst.
Burst Detection (1): Content-based Content-based Burst for “Avatar” Relative content change potential burst
Burst Detection (2): Activity Based Average revision counts Deviation Activity-based Burst for “Avatar” Intensive edit activity potential bursts
Putting it All Together: RHA Term Frequency--Combining global model and burst model RHA Term Frequency: ndicate the weights of RHA global model, burst model and original term frequency (probability).
Integrating RHA into Retrieval Models BM25 + RHA Statistical Language Models + RHA RHA Term Probability:
Datasets INEX 65 topic Wiki Dump Top 1000 retrieved articles 1000 revisions for each article Corpus for INEX TREC 68 topic Top 1000 retrieved articles 1000 revisions for each article Corpus for TREC INEX: well established forum for structured retrieval tasks (based on Wikipedia collection) TREC: performance comparison on different set of queries and general applicability
INEX Results Parameters tuned on INEX query Set BM25: , LM: ,
TREC Results parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test BM25: , LM: , Lab members manually labeled top 20 results for each topic
Performance Analysis Performance Improvements on bpref for BM25+RHA over baseline (BM25) INEX TREC INEX: significant improvement on 40% queries TREC: significant improvement on 37% queries Ex: “circus acts skills” , “olive oil health benefit” (+20% BM25 ,+11% LMimprovement)
Summary • RHA captures importance signal from document authoring process. • Introduced RHA term weighting approach • Natural integration with state of the art retrieval models. • Consistent improvement over baseline retrieval models
Thank you! Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis AblimitAji, Yu Wang, Eugene Agichtein, EvgeniyGabrilovich Research partially supported by:
Query Sets and Evaluation Metrics • Queries and Labels: • INEX: provided • TREC: subset of ad-hoc track • Metrics: • Bpref (robust to missing judgments) • MAP: mean average precision • R-prec: precision at position R
RHA in Statistical Language Models • (Global Model) • (Burst Model)
Cross validation on INEX 5-fold cross validation on INEX 2008 query Set 5-fold cross validation on INEX 2009 query Set