210 likes | 377 Views
Information Retrieval. February 3, 2003. Handout #2. Course Information. Instructor: Dragomir R. Radev (radev@si.umich.edu) Office: 3080, West Hall Connector Phone: (734) 615-5225 Office hours: M&F 11-12 Course page: http://tangra.si.umich.edu/~radev/650/
E N D
Information Retrieval February 3, 2003 Handout #2
Course Information • Instructor: Dragomir R. Radev (radev@si.umich.edu) • Office: 3080, West Hall Connector • Phone: (734) 615-5225 • Office hours: M&F 11-12 • Course page: http://tangra.si.umich.edu/~radev/650/ • Class meets on Mondays, 1-4 PM in 409 West Hall
Queries • Single-word queries • Context queries • Phrases • Proximity • Boolean queries • Natural Language queries
Pattern matching • Words, prefixes, suffixes, substrings, ranges, regular expressions • Structured queries (e.g., XML)
Relevance feedback • Query expansion • Term reweighting • Pseudo-relevance feedback • Latent semantic indexing • Distributional clustering
Document processing • Lexical analysis • Stopword elimination • Stemming • Index term identification • Thesauri
Porter’s algorithm • 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y
Porter’s algorithm • Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat5b m>1 and NULL single letter controll->control
Porter’s algorithm Example: the word “duplicatable” duplicat rule 4duplicate rule 1b1duplic rule 3 The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.
Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non-relevant • R = ? • S = ? • Q’ = ?
Automatic query expansion • Thesaurus-based expansion • Distributional similarity-based expansion
WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /clair3/tools/relatedwords/relate reason
Related (substitutable) words Wordnet Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer:adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper
Computing term salience • Term frequency (IDF) • Document frequency (DF) • Inverse document frequency (IDF)
Scripts to compute tf and idf cd /clair4/class/ir-w03/hw2 ./tf.pl 053.txt | sort -nr +1 | more ./tfs.pl 053.txt | sort -nr +1 | more ./stem.pl reasonableness ./build-idf.pl ./idf.pl | sort -n +2 | more
Applications of TFIDF • Cosine similarity • Indexing • Clustering