220 likes | 368 Views
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Retrieval models { week 13}. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.
E N D
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. Retrieval models{week 13} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0
Retrieval models (i) • A retrieval model is a formal (mathematical) representation of the process of matching a query and a document • Forms the basis of ranking results doc 913 doc 678 ? doc 345 doc 234 user query terms doc 567 doc 789 doc 455 doc 972 doc 881 doc 123 doc 257
Retrieval models (ii) • Goal: Retrieve exactly the documents that users want (whether they know it or not!) • A good retrieval model finds documentsthat are likely to be consideredrelevant by the user submittingthe query (i.e. user relevance) • A good retrieval model alsooften considers topical relevance
Topical relevance • Given a query, topical relevance identifies documents judged to be on the same topic • Even though keyword-based document scores might show a lack of relevance! Civil War Abraham Lincoln query: Abraham Lincoln Tall Guys with Beards U.S. Presidents Stovepipe Hats
User relevance • User relevance is difficult to quantify because of each user’s subjectivity • Humans often have difficultyexplaining why one documentis more relevant than another • Humans may disagree abouta given document’s relevancein relation to the same query R R
Boolean retrieval model (i) • In the Boolean retrieval model, there are exactly two possible outcomes for query processing: • TRUE (an exact match of query specification) • FALSE (otherwise) • Ranking is nonexistent • Each matching document has a score of 1
Boolean retrieval model (ii) • Often the goal is to reduce the number of search results down to a manageable size • Typically called searching by numbers • Given a small enough set of results, human users can continue their search manually • Still a useful strategy, but the “best” resultsmay be omitted
Boolean retrieval model (iii) • Advantages: • Results are predictable and explainable • Efficient and easy implementation • Disadvantages: • Query results essentially unranked (instead ordered by date or title) • Effectiveness of query results depends entirely on the user’s ability to formulate query
Vector space model (i) • The vector space model is a decades-old IR approach for implementing term weighting and document ranking • Documents are represented as vector Di ina t-dimensional vector space • Each element dij represents the weight ofterm j in document i t is the numberof index terms
Vector space model (ii) • Given n documents, we can use a matrix to represent all term weights:
Vector space model (iii) term weights are the term countsin each document
Vector space model (iv) • Query Q is represented by a t-dimensional vector of weights • Each qj is the weight of term j in the query
Vector space model (v) • Given the query “tropical fish,”query vector Qa is below: what do query vectors Qb and Qc represent? Qa 0 0 0 1 0 0 0 0 0 0 1 Qb 1 0 1 0 0 0 0 0 1 0 0 Qc 0 0 0 0 0 1 0 0 0 1 0
Vector space model (vi) • Conceptually, the document vector closest to the query vector is the most relevant • In reality, the distancefunction is not a goodmeasure of relevance • Use a similarity measureinstead (and maximize) • First, think normalization
Cosine correlation (i) • The cosine correlation measures thecosine of the angle betweenquery and document vectors • Normalize vectors such thatall documents and queriesare of equal length
Cosine correlation (ii) • The cosine function is shown in blue below: http://en.wikipedia.org/wiki/File:Sine_cosine_one_period.svg
Cosine correlation (iii) • Given document Di and query Q, the cosine measure is given by: normalization occurs in the denominator
Term weighting (i) • Term weighting is often based on tf.idf: • The term frequency (tf) quantifies the importance of a term in a document • tfik is term frequency weight of term k in document Di • fik is the number of occurrences of term k in Di word count (of words considered) in document Di
Term weighting (ii) • Term weighting is often based on tf.idf: • The inverse document frequency (idf)quantifies the importance of a termwithin the entire collection of documents • idfk is inverse document frequency weight for term k • N is the number of documents in the collection • nk is the number of documents in which term k occurs
Term weighting (iii) • Obtain term weights by multiplying term frequency and inverse document frequency values together • Perform this calculation foreach term • As new/updated documentsare processed, algorithmmust recalculate idf
What next? • Read and study Chapter 7 • Do Exercises 7.1, 7.2, 7.3, and 7.4