230 likes | 309 Views
ISP 433/633 Week 4. Text operation, indexing and search. Document Process Steps. Example Collection. Documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. Step 1: Parse Text Into Words.
E N D
ISP 433/633 Week 4 Text operation, indexing and search
Example Collection Documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat.
Step 1: Parse Text Into Words • break at spaces and punctuation D5:MY DOG WEARS A HAT D1:IT IS A DOG EAT DOG WORLD D2:WHILE THE WORLD SLEEPS D3:LETSLEEPING DOGS LIE D4:I WILL EAT MY HAT
Step 2: Stop Words Elimination • Remove non-distinguishing words • Pronouns, … prepositions, … articles, ... to Be, to Have, to Do • I,MY,IT,YOUR,…OF,BY,ON,…A,THE,THIS,…,IS,HAS,WILL,… D5:DOG WEARS HAT D1:DOG EAT DOG WORLD D2:WORLD SLEEPS D3:LETSLEEPING DOGS LIE D4:EAT HAT
Stop Words List • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). • Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63
Step 3: Stemming • Goal: “normalize” similar words D5:DOG WEAR HAT D1:DOG EAT DOG WORLD D2:WORLD SLEEP D3:LETSLEEP DOG LIE D4:EAT HAT
Stemming and Morphological Analysis Morphology (“form” of words) • Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class • dog, dogs • Derivational Morphology • Derive one word from another • Often change grammatical class • build, building; health, healthy
Simple “S” stemming • IF a word ends in “ies”, but not “eies” or “aies” • THEN “ies” “y” • IF a word ends in “es”, but not “aes”, “ees”, or “oes” • THEN “es” “e” • IF a word ends in “s”, but not “us” or “ss” • THEN “s” NULL Harman, JASIS 1991
Porter’s Algorithm • An effective, simple and popular English stemmer • Official URLhttp://www.tartarus.org/~martin/PorterStemmer/ • A demo http://snowball.tartarus.org/demo.php
Porter’s Algorithm • 1. The measure, m, of a stem is a function of sequences of vowels followed by a consonant. If V is a sequence of vowels and C is a sequence of consonants, then m is: C(VC)mVwhere the initial C and final V are optional and m is the number of VC repeats. m=0 free, why m=1 frees, whose m=2 prologue, compute2. *<X> - stem ends with letter X3. *v* - stem ends in a vowel4. *d - stem ends in double consonant5. *o - stem ends with consonant-vowel-consonant sequence where the final consonant is now w, x, or y Porter, Program 1980
Porter’s Algorithm • Suffix conditions take the form current_suffix = = patternActions are in the form old_suffix -> new_suffixRules are divided into steps to define the order of applying the rules. The following are some examples of the rules:STEP CONDITION SUFFIX REPLACEMENT EXAMPLE1a NULL sses ss stresses->stress1b *v* ing NULL making->mak1b1 NULL at ate inflat(ed)->inflate1c *v* y I happy->happi2 m>0 aliti al formaliti->formal3 m>0 icate ic duplicate->duplic4 m>1 able NULL adjustable->adjust5a m>1 e NULL inflate->inflat
Problems of Porter’s Algorithm • Unreadable results • Does not handle some irregular verbs and adjectives • Take/took • Bad/worse • Possible errors:
Vocabulary Occurrences DOG EAT HAT LET LIE SLEEP WEAR WORLD D1 D3 D5 D1 D4 D4 D5 D3 D3 D2 D3 D5 D1 D2 Step 4: Indexing • Inverted Files
Inverted Files • Occurrences can point to • Documents • Positions in a document • Weight • Most commonly used indexing method • Based on words • Queries such as phrases are expensive to solve • Some data does not have words • Genetic data
Suffix Trees 1234567890123456789012345678901234567890123456789012345678901234567 This is a text. A text has many words. Words are made from letters. Patricia tree 60 l d 50 a m n 28 t ‘ ‘ 19 e x t . 11 w ‘ ‘ 40 o r d s . 33
Text Compression • Represent text in fewer bits • Symbols to be compressed are words • Method of choice • Huffman coding
Huffman Coding • Developed by David Huffman (1952) • Average of 5 bits per character • Based on frequency distributions of symbols • Idea: assign shorter code to more frequent symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h An Example
0 1 0 1 1 0 g 0 1 0 1 0 1 i j f c 0 1 0 1 b d a 0 1 e h Example Coding
Exercise • Consider the bit string: 011011011110001001100011101001110001101011010111 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding
Huffman Code • Prefix property • it means that no word in the code is a prefix of any other word in the code • Random access • Decompress starting from any where • Not the fastest
Sequential string searching • Boyer-Moore algorithm • Example: search for “cats” in “the catalog of all cats” • Some preprocessing is needed. • Demos:http://www-sr.informatik.uni-tuebingen.de/~buehler/BM/BM.html