Internet Resources Discovery (IRD)

Internet Resources Discovery (IRD) IR Queries T.Sharon - A.Frank

IR Basic Concepts • In the classic models: • each document is described/summarized by a set of representative keywords called index terms. • index terms are mainly nouns, but could be all the distinct terms in a document. • distinct index terms have varying relevance. • index term (numerical) weights are usually assumed to be mutually independent. T.Sharon - A.Frank

Common Weights for Keywords • Binary: 1 if present in document and 0 otherwise. • Term Frequency (TF): Number of occurrences in the document. • Inverse Document Frequency (IDF): The inverse of the number of occurrences of the keywords in the whole collection of documents. T.Sharon - A.Frank

Boolean Model • Simple retrieval model • Based on Set Theory and Boolean Algebra • Queries are specified as Boolean expressions. • Advantages: • Precise semantics, neat formalism, inherent simplicity • Disadvantages: • Difficult to translate information need into a Boolean expression. • Binary decision criterion; relevant or not, no grading scale. • Data (not information) retrieval model. • Exact matching may lead to retrieval of too few or too many documents. T.Sharon - A.Frank

Statistical Queries • Purpose: • Increase flexibility by setting the amount of documents retrieved • Reduce query formulation complexity T.Sharon - A.Frank

Statistical Queries Overall Scheme • Query • words list • word combinations (like “prime minister”) • How many times a word appears in a document? • Giving a matching score to each document • relevance score to documents • What happens to the measures when taking documents with lower scores? T.Sharon - A.Frank

Additional Query Parameters • Location of the word in the document • Title • First paragraph • Body • Distance between words (proximity search) T.Sharon - A.Frank

Matching Score Factors • Frequency: number of appearances of a query keyword in a document. • Count: number of query keywords in the document. • Importance: weight of each word in the query. • Usually use vector space model T.Sharon - A.Frank

Vector Space Model • Documents/queries are represented/converted into vectors. • Vector features are index terms in the document or query, after stemming and removing stop-words. • Index terms are assumed to be mutually independent. • Vectors are non-binary weighted to emphasize the important index terms. • The query vector is compared to each document vector to compute the degree of similarity. Those that are closest to the query are considered to be similar, and are returned. T.Sharon - A.Frank

Vector Space Implementation • V(word, weight) • In the document: weight = number of appearances of word in the document • In the query: weight = according to the user’s definition T.Sharon - A.Frank

Symbols t = term d = document q = query w = weight Equations w(t,d) = weight of term in document w(t,q) = weight of term in query Query/Documents Matching Score How many times a word appears in a document? Score(d,q) = sum[w(t,q)*w(t,d)] t * scalar multiplication T.Sharon - A.Frank

Example of Computing Scores Document Related Part Document (d) w(t,d) Informationretrieval abstract. Meant to show how results are evaluated for all kinds of queries. There are two measures are recall and precision and they change if the evaluation method changes. Informationretrieval is important! It is used a lot for search engines that store and retrieve a lot of information, to help us search the World Wide Web. T.Sharon - A.Frank

Example of Computing Scores Query Related Part * = Score = 300+300+20 = 620 T.Sharon - A.Frank

Solutions: Use normalized word frequency Consider overall number of words in the document Set significance of each word (called IDF) Effective measure of similarity: TF * IDF Problem with Scalar Multiplication • Problem: • Longer documents have more words Normalization Needed T.Sharon - A.Frank

Inverse Document Frequency (IDF) • ni - numbers of the documents in which the term appeared • N - number of documents in the repository • maxn - maximal frequency of a word in the repository • Example of two variations: IDF = log(N/ni) IDF = log(maxn/ni)+1 The effect of the frequency of the word in the whole repository: T.Sharon - A.Frank

Vector Model Advantages • Term-weighting scheme improves retrieval performance. • Partial matching strategy allows retrieval of documents that approximate the query conditions. • Documents sorted/ranked according to their degree of similarity to the query. • It is simple and fast – turns out to be superior to many other IR models - so very popular. T.Sharon - A.Frank

Internet Resources Discovery (IRD)