200 likes | 285 Views
A Statistical Comparison of Tag and Query Logs. Mark J. Carman, Robert Gwadera , Fabio Crestani , and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim. Contents. Introduction Building a Dataset Are the Distributions Similar? Investigating Website Content Conclusion. Introduction.
E N D
A Statistical Comparison of Tag and Query Logs Mark J. Carman, Robert Gwadera, Fabio Crestani, and Mark Baillie SIGIR 2009 June 4, 2010 Hyunwoo Kim
Contents Introduction Building a Dataset Are the Distributions Similar? Investigating Website Content Conclusion
Introduction tags
Introduction • Questions 1. Are queries and tags similar across URLs? 2. Can tag data be used to approximate user queries to a search engine? 3. Can query logs be used to suggest new tags for a particular webpage? 4. For what types of websites is the correlation between the term distributions for queries and tags the highest? 5. Which of the distributions, tags or queries, is most closely related to the content of the clicked websites?
Building a Dataset • AOL query log • Sizable • Recent (2006) • English queries • Available to academic researchers • 657,426 users • A period of 3 months from March to May, 2006 • Delicious tag • Collaborative tagging system • Final dataset: 4145 complete URLs • Google query, stemming, prunning
Are the Distributions Similar? http://www.nytimes.com tags or
Are the Distributions Similar? Kullback-Leibler divergence
Are the Distributions Similar? Vq: query logs Vr: tags • Jensen-Shannon divergence • Symmetric measure • Overlap coefficient
Are the Distributions Similar? Open directory project
Conclusion • Similarity between query term and tag • Vocabularies contain a large amount of overlap • Term frequency distributions are correlated • Similarity is not dependent on the topic area • Queries are more similar to content than to tags • Queries and tags are more similar to one another than to content • Future work • Models for automatically removing noise from the tag and query logs • Techniques for predicting useful tags from query distributions • Techniques for the effective use of tag data to improve different forms of Web search