160 likes | 286 Views
Internet Resources Discovery (IRD). IR Queries. IR Basic Concepts. In the classic models: each document is described/summarized by a set of representative keywords called index terms . index terms are mainly nouns , but could be all the distinct terms in a document.
E N D
Internet Resources Discovery (IRD) IR Queries T.Sharon - A.Frank
IR Basic Concepts • In the classic models: • each document is described/summarized by a set of representative keywords called index terms. • index terms are mainly nouns, but could be all the distinct terms in a document. • distinct index terms have varying relevance. • index term (numerical) weights are usually assumed to be mutually independent. T.Sharon - A.Frank
Common Weights for Keywords • Binary: 1 if present in document and 0 otherwise. • Term Frequency (TF): Number of occurrences in the document. • Inverse Document Frequency (IDF): The inverse of the number of occurrences of the keywords in the whole collection of documents. T.Sharon - A.Frank
Boolean Model • Simple retrieval model • Based on Set Theory and Boolean Algebra • Queries are specified as Boolean expressions. • Advantages: • Precise semantics, neat formalism, inherent simplicity • Disadvantages: • Difficult to translate information need into a Boolean expression. • Binary decision criterion; relevant or not, no grading scale. • Data (not information) retrieval model. • Exact matching may lead to retrieval of too few or too many documents. T.Sharon - A.Frank
Statistical Queries • Purpose: • Increase flexibility by setting the amount of documents retrieved • Reduce query formulation complexity T.Sharon - A.Frank
Statistical Queries Overall Scheme • Query • words list • word combinations (like “prime minister”) • How many times a word appears in a document? • Giving a matching score to each document • relevance score to documents • What happens to the measures when taking documents with lower scores? T.Sharon - A.Frank
Additional Query Parameters • Location of the word in the document • Title • First paragraph • Body • Distance between words (proximity search) T.Sharon - A.Frank
Matching Score Factors • Frequency: number of appearances of a query keyword in a document. • Count: number of query keywords in the document. • Importance: weight of each word in the query. • Usually use vector space model T.Sharon - A.Frank
Vector Space Model • Documents/queries are represented/converted into vectors. • Vector features are index terms in the document or query, after stemming and removing stop-words. • Index terms are assumed to be mutually independent. • Vectors are non-binary weighted to emphasize the important index terms. • The query vector is compared to each document vector to compute the degree of similarity. Those that are closest to the query are considered to be similar, and are returned. T.Sharon - A.Frank
Vector Space Implementation • V(word, weight) • In the document: weight = number of appearances of word in the document • In the query: weight = according to the user’s definition T.Sharon - A.Frank
Symbols t = term d = document q = query w = weight Equations w(t,d) = weight of term in document w(t,q) = weight of term in query Query/Documents Matching Score How many times a word appears in a document? Score(d,q) = sum[w(t,q)*w(t,d)] t * scalar multiplication T.Sharon - A.Frank
Example of Computing Scores Document Related Part Document (d) w(t,d) Informationretrieval abstract. Meant to show how results are evaluated for all kinds of queries. There are two measures are recall and precision and they change if the evaluation method changes. Informationretrieval is important! It is used a lot for search engines that store and retrieve a lot of information, to help us search the World Wide Web. T.Sharon - A.Frank
Example of Computing Scores Query Related Part * = Score = 300+300+20 = 620 T.Sharon - A.Frank
Solutions: Use normalized word frequency Consider overall number of words in the document Set significance of each word (called IDF) Effective measure of similarity: TF * IDF Problem with Scalar Multiplication • Problem: • Longer documents have more words Normalization Needed T.Sharon - A.Frank
Inverse Document Frequency (IDF) • ni - numbers of the documents in which the term appeared • N - number of documents in the repository • maxn - maximal frequency of a word in the repository • Example of two variations: IDF = log(N/ni) IDF = log(maxn/ni)+1 The effect of the frequency of the word in the whole repository: T.Sharon - A.Frank
Vector Model Advantages • Term-weighting scheme improves retrieval performance. • Partial matching strategy allows retrieval of documents that approximate the query conditions. • Documents sorted/ranked according to their degree of similarity to the query. • It is simple and fast – turns out to be superior to many other IR models - so very popular. T.Sharon - A.Frank