CS533 Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #3 February 1, 2000

This lecture • Lexis-nexis demo • Recall and precision • Effect of query terms on recall/precision • Effect of indexing on recall and precision • Zipf’s law and its applications

Searches the retrieved subset

The Concept of Relevance • Relevance of a document D to a query Q is subjective • Different users will have different judgements • Same users may judge differently at different times • Degree of relevance of different documents will vary

The Concept of Relevance • In evaluating IR systems it is assumed that: • A subset of the documents of the database (DB) are relevant • A document is either relevant or not

Finding the relevant set • In a small collection - the relevance of each document can be checked • In a large collection - a sample of the documents is checked and the number of document in the retrieved set is estimated • On the WWW?

Finding the relevant set in TREC • Each query is run by all competitors • Top N documents retrieved by each merged and checked manually • Set of relevant documents found is considered the relevant set

Recall and precision • Most commonly used measures for evaluating an IR system • Given a DB and a query Q

Example 1 rl=5 Relevant rt=8 Retrieved both rr =2 N

Recall • Recall R = rr / rl where • rl - no. relevant documents in DB • rr - no. relevant retrieved documents • Fraction of relevant setwhich is retrieved • For example 1: R=2/5=.4

Precision • rt - no. documents retrieved for Q • Precision P = rr / rt • The fraction of the retrieved set which is relevant • For example 1: P=2/8=0.2

Recall and precision • Ideal retrieval results: 100% recall and 100% precision • All good documents are retrieved and • No bad document is retrieved

Recall/precision graph Precision Ideal 0 0.1 0.2 1 Recall 0 0.1 0.2 1

Choosing query terms • Subject: Information retrieval • Initial Query: Information and retrieval • Broader query: Information or retrieval • Narrower: Information adjacent Retrieval

Effect of query terms on results • Broad query - high recall but low precision • Narrow query - high precision but low recall

Indexing Effectiveness • Indexing exhaustively and • Term specificity

Exhaustively • An index is exhaustive when all content is included • Very few index terms • Information loss • Decreases recall

Exhaustively • Exhaustive index • Increases output • Decreases precision • Increases recall

Specificity • Specificity - breadth or narrowness of terms • Breadth plus to recall and minus to precision • Narrowness plus to precision and minus to recall

The Trade Off Precision Narrowterms Broadterms 0.5 Recall 0.5

Zipf’s law and its applications • Estimating storage space saved by excluding stop words from index • 10 most frequently occurring words in English » 25%-30% of text

Zipf’s law and its applications • Estimating the size of a term’s inverted index list • Given the rank r of the term in English, • N the number of words in the database • A the constant for the database • Size of inverted index list n » A*N/r

Zipf’s law and its applications • Estimating the number of words n(1)that occur 1 times, n(2)that occur 2 times, etc • Words that occur at most twice about 2/3 of a text • Deleting very low frequency words from index - large saving

Term frequency predictions • Rank words by their frequency of occurrence in English • 1 - most frequent word, and • t - number of distinct terms/last rank • Table in next slide based on a text with 1 million words shows the 10 most frequent words and their frequency of occurrence

Most frequent words r Word f(r) r*f(r)/N 1 the 69,971 0.070 2 of 36,411 0.073 3 and 28,852 0.086 4 to 26,149 0.104 5 a 23,237 0.116 6 in 21,341 0.128 7 that 10,595 0.074 8 is 10,049 0.081 9 was 9,816 0.088 10 he 9,543 0.095 N~1000,000

Observing the numbers • “the” and “of” » 10% of text • All 10 words » 25% of text • f(r)/N » probability of occurrence of a term with rank r in the text • Note that r*f(r)/N is » 0.1

Zipf’s law • If we rank the occurrence of all terms in English we will find that r*p(r)»A where • r denotes the rank of a word, and • p(r) the probability of occurrence of the word and • A is a constant

Zipf’s observations • Most frequent English words are short • Least frequent longest • Average length of distinct English words is 8.1 characters, but • Average word length of all word occurrences is only about 4.7

Zipf’s law for a given text • Given a text of N words, • r » A*N/f(r) where • A is domain specific constant, and • f(r) is the number of occurrences of the term with rank r

Distinct words occurring j times • Text has t distinct words ranked 1 to t • max - is the maximum number of occurrences of a word • n(j) - number words occurring j times • r(j) - last rank of terms occurring j times

Rank Frequency Number words 1 max ... ... n(max)=last(max) last(max) max ... … ... … last(j+1) j+1 … j ... n(j)=last(j)-last(j+1) last(j) j last(2) 2 ... 1 n(1)=last(1)-last(2) last(1)=t 1

Distinct words occurring j times • last(j) highest rank of terms occurring at least j times • last(j+1) highest rank of terms occurring at least j+1 times • n(j) number of words occurring j times

Distinct words occurring j times • Using Zipf’s law: last(j) » A*N/j and last(j+1) » A*N/(j+1) and last(1)=t »A*N/1

Distinct words occurring j times n(j)=last(j)-last(j+1) » » A*N/j-A*N/(j+1) » » A*N/(j(j+1)) » » t/(j(j+1)) (see A*N » t in previous slide)

Distinct words occurring once, twice... • About half of the words in the text occur one • n(1)/t » 1/(1*2) » 0.5, • About 2/3 of the distinct words in the text occur at most twice • n(2)/t » 1/(2*3) » 0.167, etc. • n(1)+n(2) » .667

CS533 Information Retrieval