Information Retrieval

Information Retrieval February 3, 2003 Handout #2

Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M&F 11-12 • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Mondays, 1-4 PM in 409 West Hall

Queries and documents

Queries • Single-word queries • Context queries • Phrases • Proximity • Boolean queries • Natural Language queries

Pattern matching • Words, prefixes, suffixes, substrings, ranges, regular expressions • Structured queries (e.g., XML)

Relevance feedback • Query expansion • Term reweighting • Pseudo-relevance feedback • Latent semantic indexing • Distributional clustering

Document processing • Lexical analysis • Stopword elimination • Stemming • Index term identification • Thesauri

Porter’s algorithm • 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y

Porter’s algorithm • Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control

Porter’s algorithm Example: the word “duplicatable” duplicat rule 4duplicate rule 1b1duplic rule 3 The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.

Porter’s algorithm

Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|

Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • R = ? • S = ? • Q’ = ?

Automatic query expansion • Thesaurus-based expansion • Distributional similarity-based expansion

WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /clair3/tools/relatedwords/relate reason

Related (substitutable) words Wordnet Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer:adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper

Indexing and searching

Computing term salience • Term frequency (IDF) • Document frequency (DF) • Inverse document frequency (IDF)

Applications of TFIDF • Cosine similarity • Indexing • Clustering

Information Retrieval