100 likes | 493 Views
Using TF-IDF to Determine Word Relevance in Document Queries. Juan Ramos juramos@cs.rutgers.edu Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855. Information Retrieval Problem.
E N D
Using TF-IDF to Determine Word Relevance in Document Queries Juan Ramos juramos@cs.rutgers.edu Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855
Information Retrieval Problem • Given corpus D, query q = w1, w2, … wn, return documents d that maximize Pr(d | q, D). • Easy to dismiss given widespread use of query retrieval today (web searches, database management, etc.)
Approaches to Ad Hoc Retrieval • Probability and Statistics • Naïve Bayes • Approaches include the user’s mindset. • Vector Models • Latent Semantic Indexing • Reduce n-dimensional vector space of documents • Return documents whose distance to query is small
TF-IDF Weighing Scheme • Given corpus D, word w, document d, calculate wd = fw, d * log (|D|/fw, D) • Many varieties of basic mathematical scheme • Procedure • Scan each d, compute each wi, d, return set D’ that maximizes Σi wi, d
Experiment • Documents from Linguistic Data Consortium’s United Nations Parallel Text Corpus • Support noise by enforcing case-sensitivity, no parsing of SGML symbols • Brute force approach- consider only fw, d
Extensions and Further Research • Genetic TF-IDF: evolve weighing schemes that compete with TF-IDF. • Hillclimbing, gradient descent TF-IDF. • Cross-language settings: return documents in different language than query.
References • Berger, A & Lafferty, J. (1999). Information Retrieval as Statistical Translation. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (SIGIR’99), 222-229. • Berger, A et al (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In Proc. Int. Conf. Research and Development in Information Retrieval, 192-199.
References pt. 2 • Berry, Michael W. et al. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4):177-196. • Brown, Peter F. et al. (1990). A Statistical Approach to Machine Translation. In Computational Linguistics 16(2): 79-85.
References Pt. 3 • Oren, Nir. (2002). Reexamining tf.idf based information retrieval with Genetic Programming. In Proceedings of SAICSIT 2002, 1-10. • Salton, G. & Buckley, C. (1988). Term-weighing approache sin automatic text retrieval. In Information Processing & Management, 24(5): 513-523.