1 / 28

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich. Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis. Oct. 28, 2010. Revisions of “Topology” on Wikipedia. 1 st revision:. 250 th revision:. Current revision:.

halia
Download Presentation

Ablimit Aji , Yu Wang Eugene Agichtein , Evgeniy Gabrilovich

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AblimitAji, Yu Wang Eugene Agichtein, EvgeniyGabrilovich Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis Oct. 28, 2010

  2. Revisions of “Topology” on Wikipedia 1st revision: 250th revision: Current revision:

  3. Observable Document Generation Process #i-1 #i In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Roughly speaking, topology is the study of geometric objects without considering their dimensions. In mathematics, '''topology''' is a branch concerned with the study of topological spaces. Topology is also concerned with the study of the so called topological properties of figures, that is to say properties that does not change under a bicontinuous one-to-one transformation (call homeomorphisms 95th revision 96th revision

  4. How Revision History Analysis Could Help Retrieval Revision History Analysis

  5. Selected Prior Work J. Elsas and S. Dumais. Leveraging temporal dynamics of document content in relevance ranking. In Proc. of WSDM,2010. M. Efron. Linear time series models for term weighting in information retrieval. JASIST, 2010. J. He, H. Yan, and T. Suel. Compact full-text indexing of versioned document collections. In CIKM, New York, NY, USA, 2009.

  6. Revision History Analysis (RHA) BM25 Language Model RHA redefines term frequency (TF): - TF is a key indicator of document relevance - TF can be naturally integrated into ranking models

  7. Model 1: Steady growth Topology, in mathematics, is both a structure used to capture the notions of continuity, connectedness and convergence, and the name of the branch of mathematics which studies these. First revision Topology (from the Greek τόπος, “place”, and λόγος, “study”) is a major area of mathematics concerned with spatial properties that are preserved under continuous deformations of objects, for example ….. basic examples include compactness and connectedness Current version

  8. Model 1 (continued)

  9. RHA Global Model: definition Frequency of term in revision Decay factor Define the term frequency over the whole document generation process • a document grows steadily over time • a term is relatively important if it appears in the early revisions.

  10. But… Some pages are different: “Avatar(2009 film)” 1st revision: 500th revision: Current revision:

  11. Model 2: Bursty Growth Burst of Document (Length) & Change of Term Frequency Burst of Edit Activity & Associated Events First photo & trailer released Movie released Global Model might be insufficient

  12. RHA Burst Model: Definition Frequency of term in revision Decay factor for jth Burst A burst resets the decay clock for a term. The weight will decrease after a burst.

  13. Burst Detection (1): Content-based Content-based Burst for “Avatar” Relative content change potential burst

  14. Burst Detection (2): Activity Based Average revision counts Deviation Activity-based Burst for “Avatar” Intensive edit activity potential bursts

  15. Burst Detection (3): Combined Model

  16. Putting it All Together: RHA Term Frequency--Combining global model and burst model RHA Term Frequency: ndicate the weights of RHA global model, burst model and original term frequency (probability).

  17. Integrating RHA into Retrieval Models BM25 + RHA Statistical Language Models + RHA RHA Term Probability:

  18. Experimental Setup

  19. Datasets INEX 65 topic Wiki Dump Top 1000 retrieved articles 1000 revisions for each article Corpus for INEX TREC 68 topic Top 1000 retrieved articles 1000 revisions for each article Corpus for TREC INEX: well established forum for structured retrieval tasks (based on Wikipedia collection) TREC: performance comparison on different set of queries and general applicability

  20. Results

  21. INEX Results Parameters tuned on INEX query Set BM25: , LM: ,

  22. TREC Results parameters tuned on INEX query Set, ** indicates statistically significant differences @ the 0.01 significance level with two tailed paired t-test BM25: , LM: , Lab members manually labeled top 20 results for each topic

  23. Performance Analysis Performance Improvements on bpref for BM25+RHA over baseline (BM25) INEX TREC INEX: significant improvement on 40% queries TREC: significant improvement on 37% queries Ex: “circus acts skills” , “olive oil health benefit” (+20% BM25 ,+11% LMimprovement)

  24. Summary • RHA captures importance signal from document authoring process. • Introduced RHA term weighting approach • Natural integration with state of the art retrieval models. • Consistent improvement over baseline retrieval models

  25. Thank you! Using the Past to Score the Present: Extending Term Weighting Models with Revision History Analysis AblimitAji, Yu Wang, Eugene Agichtein, EvgeniyGabrilovich Research partially supported by:

  26. Query Sets and Evaluation Metrics • Queries and Labels: • INEX: provided • TREC: subset of ad-hoc track • Metrics: • Bpref (robust to missing judgments) • MAP: mean average precision • R-prec: precision at position R

  27. RHA in Statistical Language Models • (Global Model) • (Burst Model)

  28. Cross validation on INEX 5-fold cross validation on INEX 2008 query Set 5-fold cross validation on INEX 2009 query Set

More Related