170 likes | 329 Views
Tutorial#3. Retrieval models. Retrieval models match query with documents to: separate documents into relevant and non-relevant class rank the documents according to the relevance. Boolean model Vector space model (VSM) Probabilistic models. Boolean model.
E N D
Retrieval models Retrieval models match query with documents to: • separate documents into relevant and non-relevant class • rank the documents according to the relevance. • Boolean model • Vector space model (VSM) • Probabilistic models
Boolean model • Boolean model is most common exact-match model • queries are logic expressions with document features as operands • In pure Boolean model, retrieved documents are not ranked.
Example D7 OR D1,D2,D5 AND D2,D4,D5,D6,D8 D7 OR D2,D5
Vector space model (VSM) • Documents and queries are represented as vectors. dj = (w1,j,w2,j,...,wt,j) q = (w1,q,w2,q,...,wt,q) • Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero.
Vector space model (VSM) • Several different ways of computing these values, also known as (term) weights, have been developed. One of the best known schemes is (tf-idf) weighting:
Example • documents: D0:'How to Bake Bread Without Recipes', D1:'The Classic Art of Viennese Pastry', D2:'Numerical Recipes: The Art of Scientific Computing', D3:'Breads, Pastries, Pies and Cakes : Quantity Baking Recipes', D4:'Pastry: A Book of Best French Recipe‘ • Keywords : ['bak','recipe','bread','cake','pastr','pie']
Query: "baking bread“ • will generate a matrix 6 terms x 5 documents
VSM Implementation • VSMranker.javaranks documents for a query • Provides functions to develop different user interfaces • Stand alone usage needs document and query TDMs java -cp ../java VSMranker cacm.tdm query.tdm 7 • Retrieves top 7 documents for CACM queries
References: • http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/vector.pdf • http://www.ccs.neu.edu/home/jaa/CSG339.06F/Lectures/boolean.pdf