Using TF-IDF to Determine Word Relevance in Document Queries

Using TF-IDF to Determine Word Relevance in Document Queries Juan Ramos juramos@cs.rutgers.edu Department of Computer Science, Rutgers University, 23515 BPO Way, Piscataway, NJ, 08855

Information Retrieval Problem • Given corpus D, query q = w1, w2, … wn, return documents d that maximize Pr(d | q, D). • Easy to dismiss given widespread use of query retrieval today (web searches, database management, etc.)

Approaches to Ad Hoc Retrieval • Probability and Statistics • Naïve Bayes • Approaches include the user’s mindset. • Vector Models • Latent Semantic Indexing • Reduce n-dimensional vector space of documents • Return documents whose distance to query is small

TF-IDF Weighing Scheme • Given corpus D, word w, document d, calculate wd = fw, d * log (|D|/fw, D) • Many varieties of basic mathematical scheme • Procedure • Scan each d, compute each wi, d, return set D’ that maximizes Σi wi, d

Experiment • Documents from Linguistic Data Consortium’s United Nations Parallel Text Corpus • Support noise by enforcing case-sensitivity, no parsing of SGML symbols • Brute force approach- consider only fw, d

Results

Extensions and Further Research • Genetic TF-IDF: evolve weighing schemes that compete with TF-IDF. • Hillclimbing, gradient descent TF-IDF. • Cross-language settings: return documents in different language than query.

References • Berger, A & Lafferty, J. (1999). Information Retrieval as Statistical Translation. In Proceedings of the 22nd ACM Conference on Research and Development in Information Retrieval (SIGIR’99), 222-229. • Berger, A et al (2000). Bridging the Lexical Chasm: Statistical Approaches to Answer Finding. In Proc. Int. Conf. Research and Development in Information Retrieval, 192-199.

References pt. 2 • Berry, Michael W. et al. (1995). Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4):177-196. • Brown, Peter F. et al. (1990). A Statistical Approach to Machine Translation. In Computational Linguistics 16(2): 79-85.

References Pt. 3 • Oren, Nir. (2002). Reexamining tf.idf based information retrieval with Genetic Programming. In Proceedings of SAICSIT 2002, 1-10. • Salton, G. & Buckley, C. (1988). Term-weighing approache sin automatic text retrieval. In Information Processing & Management, 24(5): 513-523.

Using TF-IDF to Determine Word Relevance in Document Queries

Using TF-IDF to Determine Word Relevance in Document Queries

Presentation Transcript

Using Subqueries to Solve Queries

TF-IDF

Analysing Creative Image Queries To Determine Important Facets

Faster TF-IDF

TF-IDF

Accessible Word Document Creation Using Microsoft Word 2010

Improved TF-IDF Ranker

Vector Space Model : TF - IDF

Using TF-IDF to Determine Word Relevance in Document Queries

Suru in JpWac word sketches: some queries

Faster TF-IDF

Word Sense Disambiguation in Queries

Using the femur to determine:

Formatting an MLA Document in Word

Using Data to Determine Priorities

Using Action Queries

A Novel TF-IDF Weighting Scheme for Effective Ranking

Using Queries in Access

How to Use Document Templates in Microsoft Word?

TF/IDF Ranking

Convert PDF to Word Document Service in India

Word To Word Arabic Document Translation Service