660 likes | 681 Views
Delve into the world of substrings and explore how they can enhance text processing and analysis. Learn about applications like word breaking in Japanese and Chinese, term extraction for information retrieval, and more.
E N D
Substring Statistics Kyoji Umemura Kenneth Church
Goal: Words Substrings(Anything you can do with words, we can do with substrings) Sound Bite • Haven’t achieved this goal • But we can do more with substrings than you might have thought • Review: • Using Suffix Arrays to compute Term Frequency and Document Frequency for All Substrings in a Corpus (Yamamoto & Church) • Tri-grams Million-grams • Tutorial: Make substring statistics look easy • Previous treatments are (a bit) inaccessible • Generalization: • Document Frequency (df) dfk (adaptation) • Applications • Word Breaking (Japanese & Chinese) • Term Extraction for Information Retrieval Sound Bite
The chance of Two Noriegas is Closer to p/2 than p2:Implications for Language Modeling, Information Retrieval and Gzip • Standard indep models (Binomial, Multinomial, Poisson): • Chance of 1stNoriega is p • Chance of 2nd is also p • Repetition is very common • Ngrams/words (and their variant forms) appear in bursts • Noriega appears several times in a doc, or not at all. • Adaptation & Contagious probability distributions • Discourse structure (e.g., text cohesion, given/new): • 1stNoriega in a document is marked (more surprising) • 2nd is unmarked (less surprising) • Empirically, we find first Noriega is surprising (p≈6/1000) • But chance of two is not surprising (closer to p/2 than p2) • Finding a rare word like Noriega is like lightning • We might not expect lightning to strike twice in a doc • But it happens all the time, especially for good keywords • Documents ≠ Random Bags of Words Motivation & Background: Unigrams Substrings (ngrams)
Three Applications & Independence Assumptions:No Quantity Discounts • Compression: Huffman Coding • |encoding(s)| = ceil(−log2 Pr(s)) • Two Noriegas consume twice as much space as one • |encoding(s s)| = |encoding(s)| + |encoding(s)| • No quantity discount • Indep is the worst case: any dependencies less H (space) • Information Retrieval • Score(query, doc) = ∑term in doc tf(term, doc) idf(term) • idf(term): inverse doc freq: −log2 Pr(term) = −log2 df(term)/D • tf(term, doc): number of instances of term in doc • Two Noriegas are twice as surprising as one (2 idf v. idf) • No quantity discount: any dependencies less surprise • Speech Recognition, OCR, Spelling Correction • I Noisy Channel O • Pr(I) Pr(O|I) • Pr(I) = Pr(w1, w2 … wn) ≈ ∏k Pr(wk|wk-2, wk-1) Log tf smoothing
Interestingness Metrics:Deviations from Independence • Poisson (and other indep assumptions) • Not bad for meaningless random strings • Deviations from Poisson are clues for hidden variables • Meaning, content, genre, topic, author, etc. • Analogous to mutual information (Hanks) • Pr(doctor…nurse) >> Pr(doctor) Pr(nurse)
If we had a good description of the distribution: Pr(k) • Then we could compute any summary statistic • Moments: mean, var • Entropy: • Adaptation: Pr(k≥2|k ≥1)
Poisson Mixtures: More Poissons Better Fit(Interpretation: Each Poisson is conditional on hidden variables: meaning, content, genre, topic, author, etc.)
Adaptation: Three Approaches • Cache-based adaptation • Parametric Models • Poisson, Two Poisson, Mixtures (neg binomial) • Non-parametric • Pr(+adapt1) ≡ Pr(test|hist) • Pr(+adapt2) ≡ Pr(k≥2|k ≥1)
Positive & Negative Adaptation • Adaptation: • How do probabilities change as we read a doc? • Intuition: If a word w has been seen recently • +adapt: prob of w (and its friends) goes way up • −adapt: prob of many other words goes down a little • Pr(+adapt) >> Pr(prior) > Pr(−adapt)
Adaptation: Method 1 • Split each document into two equal pieces: • Hist: 1st half of doc • Test: 2nd half of doc • Task: • Given hist • Predict test • Compute contingency table for each word
Adaptation: Method 1 • Notation • D = a+b+c+d (library) • df = a+b+c (doc freq) • Prior: • +adapt • −adapt
Priming, Neighborhoods and Query Expansion • Priming: doctor/nurse • Doctor in hist Pr(Nurse in test) ↑ • Find docs near hist (IR sense) • Neighborhood ≡ set of words in docs near hist (query expansion) • Partition vocabulary into three sets: • Hist: Word in hist • Near: Word in neighborhood − hist • Other: None of the above • Prior: • +adapt • Near • Other
Adaptation: Hist >> Near >> Prior • Magnitude is huge • p/2 >> p2 • Two Noriegas are not much more surprising than one • Huge quantity discounts • Shape: Given/new • 1st mention: marked • Surprising (low prob) • Depends on freq • 2nd: unmarked • Less surprising • Independent of freq • Priming: • “a little bit” marked
Adaptation is Lexical • Lexical: adaptation is • Stronger for good keywords (Kennedy) • Than random strings, function words (except), etc. • Content ≠ low frequency
Adaptation: Method 2 • Pr(+adapt2) • dfk(w) ≡ number of documents that • mention word w • at least k times • df1(w) ≡ standard def of document freq (df)
Pr(+adapt1) ≈ Pr(+adapt2)Within factors of 2-3 (as opposed to 10-1000) 3rd mention Priming
Adaptation helps more than it hurts Hist is a great clue • Examples of big winners (Boilerplate) • Lists of major cities and their temperatures • Lists of major currencies and their prices • Lists of commodities and their prices • Lists of senators and how they voted • Examples of big losers • Summary articles • Articles that were garbled in transmission Hist is misleading
Recent Work (with Kyoji Umemura) • Applications: Japanese Morphology (text words) • Standard methods: dictionary-based • Challenge: OOV (out of vocabulary) • Good keywords (OOV) adapt more than meaningless fragments • Poisson model: not bad for meaningless random strings • Adaptation (deviations from Poisson): great clues for hidden variables • OOV, good keywords, technical terminology, meaning, content, genre, author, etc. • Extend dictionary method to also look for substrings that adapt a lot • Practical procedure for counting dfk(s) for all substrings s in a large corpus (trigrams million grams) • Suffix array: standard method for computing freq and loc for all s • Yamamoto & Church (2001): count df for all s in large corpus • df (and many other ngram stats) for million-grams • Although there are too many substrings s to work with (n2) • They can be grouped into a manageable number of equiv classes (n) • Where all substrings in a class share the same stats • Umemura (unpublished): generalize method for dfk • Adaptation for million-grams Today’s Talk
Adaptation Conclusions • Large magnitude (p/2 >> p2) • big quantity discounts • Distinctive shape • 1st mention depends on freq • 2nd does not • Priming: between 1st mention and 2nd • Lexical: • Independence assumptions aren’t bad for meaningless random strings, function words, common first names, etc. • More adaptation for content words (good keywords, OOV)
Goal: Substring Statistics Words Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str All strs N2 substrings N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes DFS Traversal of Class Tree Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here Sound Bite
Text Suffix Arrays:Freq & loc of all ngrams • Input: text, an array of N tokens • Null terminated • Tokens: words, bytes, Asian chars • Output: s, an array of N ints, sorted “lexicographically” • s[i] denotes a semi-infinite string • Text starting at position s[i] and continuing to end • s[i] ≡ substr(text, s[i]) ≡ text+s[i] • Simple Practical Procedure • Initialize for(i=0; i<N; i++) s[i] = i; • Sort “lexicographically” qsort(s, N, sizeof(*s), sufcmp); int sufcmp(int *a, int *b) { return strcmp(text + *a, text + *b);} Text
Frequency & Location of All Ngrams(Unigrams, Bigrams, Trigrams & Million-grams) • Sufconc(pattern) outputs a concordance • Two binary searches → <i, j> • i = first suffix in suffix array that starts with pattern • j = last suffix in suffix array that starts with pattern • Freq(<i, j>) = j – i + 1 • Output j – i + 1 concordance lines, one for each suffix in <i, j> ./sufconc -l 10 -r 40 /cygdrive/d/temp2/AP/AP8912 'Manuel Noriega' | head 17913368 5441: osed Gen. ^ Manuel Noriega\nin Panama _ their wives 13789741 4193: apprehend ^ Manuel Noriega\n The situation in Pana 3966027 1218: nian Gen. ^ Manuel Noriega a\n$300,000 present, and 4938894 1503: nian Gen. ^ Manuel Noriega and\nothers laundered $50 16718522 5098: ed ruler\n ^ Manuel Noriega continue to evade U.S. fo 18568442 5635: to force ^ Manuel Noriega from\npower.\n Further 14794912 4497: oust Gen. ^ Manuel Noriega from power, the zoo's dir 14434223 4380: that Gen. ^ Manuel Noriega had been killed.\n Mary 14237714 4321: .''\n `` ^ Manuel Noriega had explicitly declared w 19901786 6061: nian Gen. ^ Manuel Noriega in the hands of a special s[i] doc(s[i])
Suffix Arrays: Computational ComplexityBottom Line: O(N log N) Time & O(N) Space • Simple Practical Procedure • Initialize for(i=0; i<N; i++) s[i] = i; • Sort “lexicographically” qsort(s, N, sizeof(*s), sufcmp); int sufcmp(int *a, int *b) { return strcmp(text + *a, text + *b);} • You might think this takes O(N log N) time • But unfortunately, sufcmp is not O(1) • so the sort is O(N2 log N) • Fortunately, there is an O(N log N) alg • See http://www.cs.dartmouth.edu/~doug/ for excellent tutorial • But in practice, the simple procedure is often just as good • (if not slightly better)
Distribution of LCPs (1989 AP News) • Peak ≈ 10 bytes (roughly word bigrams) • Long tail (boilerplate & duplicate docs)
Goal: Substring Statistics Words Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str All strs N2 substrings N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Depth-First Traversal of Class Tree Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here
Distributional Equivalence • sufconc • Frequency and location for one substring • Impressive: trigrams million-grams • Challenge: One substring All substrings • Too many substrings: N2 • Solution: group substrings into equiv classes • N2 substrings N classes • str1 = str2 iff • Every suffix that starts with str1 also starts with str2 • Example: “to be or not to be” • “to” = “to be” • Class(<i,j>) = { str | str starts every suffix in interval and no others } • Compute stats over N classes • Rather than over N2 substrings
Grouping Substrings into Classes • Interval on Suffix Array: <i, j> • Class(<i, j>) is a set of substrings that • Start every suffix within the interval • And no suffixes outside the interval • Examples: • Class(<6,7>) = {“b”, “be”} • Class(<17,18>) = {“to”, “to_”, “to_b”, “to_be”} • Classes form an equivalence relation R • str1 R str2 ↔ str1 & str2 in same class • Interpretation: distributional equivalence • “to” R “to be” → “to” and “to be” appear in exactly the same places in corpus • R partitions the set of all substrings: • Every substring appears in one and only one class • R is reflexive, symmetric and transitive
Although there are too many substrings to work with (≈N2),they can be grouped into a manageable number of classes
171 Substrings 8 Non-Trivial Classes • Corpus: • N = 18 • to_be_or_not_to_be • Substrings: • Theory: • N ∙(N+1)/2 = 171 • 150 observed • 135 with freq=1 (yellow) • 15 with freq>1 (green) • Classes: • Theory: • 2N = 36 (N trivial + N non-trivial) • 21 observed • 13 trivial (yellow) • 8 non-trivial (green)
Motivation for Grouping Substrings into Equivalence Classes • Computational Issues: • N is more manageable than N2 • Statistics can be computed over classes • Because all substrings in a class have the same stats • for many popular stats: freq, df, dfk, joint df & contingency tables, and combinations thereof • Examples: (corpus = “to_be_or_not_to_be”) • Class(<6,7>) = {“b”, “be”} • freq(“b”) = freq(“be”) = 7-6+1 = 2 • Class(<17,18>) = {“to”, “to_”, “to_b”, “to_be”} • freq(“to”) = freq(“to_”) = freq(“to_b”) = freq(“to_be”) = 18-17+1 = 2 • Class(<11, 14>) = {“o”} • freq(“o”) = 14-11+1 = 4
Goal: Substring Statistics Words Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str All strs N2 substrings N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes DFS Traversal of Class Tree Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here
Class Tree: Nesting of Valid Intervals
LBL = Longest Bounding LCP = 0 SIL = Shortest Interior LCP = 1 • Class(<i, j>) = {str | str starts every suffix within interval, and no others } = {substr(text + s[i], 1, k)} where LBL < k ≤ SIL • LCP[i] = Longest Common Prefix of s[i] & s[i+1] • SIL(<i, j>) = Shortest Interior LCP = MINi≤k<j { LCP[k] } • LBL(<i, j>) = Longest Bounding LCP = max(LCP[i], LCP[j]) • Class(<1, 5>) = {substr(“_be”, 1, k)} for 0 < k ≤ 1
Enumerating Classes struct stackframe { int i, j, SIL } *stack; int sp = 0; /* stack pointer */ stack[sp].i = 0; stack[sp].SIL = -1; for(w=0; w<N; w++) { if(LCP[w] > stack[sp].SIL) { sp++; stack[sp].i = w; stack[sp].SIL = LCP[w]; } while(LCP[w] < stack[sp].SIL) { stack[sp].j = w; output(&stack[sp]); if(LCP[w] <= stack[sp-1].SIL) sp--; else stack[sp].SIL = LCP[w]; }} • A class is uniquely determined by • Two endpoints: <i, j>, or • SIL & witness: i ≤ w ≤ j • To enumerate classes: • Enumerate 0≤w<N & LCP[w] • Remove duplicate classes • Two witnesses and their LCPs might specify the same class • Alternatively, depth first traversal of class tree • Output: <i, j> and SIL(<i, j>) Push Pop
Depth-first traversal Outputs <i, j> and SIL(<i, j>) Sorted first by j (increasing order) and then by i (decreasing order) 1 2 3 struct stackframe { int i, j, SIL } *stack; int sp = 0; /* stack pointer */ stack[sp].i = 0; stack[sp].SIL = -1; for(w=0; w<N; w++) { if(LCP[w] > stack[sp].SIL) { sp++; stack[sp].i = w; stack[sp].SIL = LCP[w]; } while(LCP[w] < stack[sp].SIL) { stack[sp].j = w; output(&stack[sp]); if(LCP[w] <= stack[sp-1].SIL) sp--; else stack[sp].SIL = LCP[w]; }} 4 5 6 7 6 8 8 4 2 3 7 5 1
Find Class Input: pattern (a substring such as “Norieg”) Output: <i, j>, LBL, SIL, stats, Class(<i, j>) Method: Two binary searches into suffix array to find first (i) and last (j) suffix starting with input pattern Third binary search into classes to find class and associated (pre-computed) stats Computed from i and j (first two binary searches) LBL(<i, j>) = max(LCP[i], LCP[j]) Computed from class (third binary search) SIL dfk: # of documents that contain input pattern at least k times Class(<i, j>) = { substr(text + s[i], i, k)} for LBL < k ≤ SIL Takes advantage of ordering on classes (sorted first by j and then by i)
Goal: Substring Statistics Words Substrings (Ngrams) Anything we do with words, we should be able to do with substrings (ngrams)… Stats freq(str) and location (sufconc), df(str), dfk(str), jdf(str1, str2) and combinations thereof Suffix Arrays & LCP Classes: One str All strs N2 substrings N classes Class(<i,j>) = { str | str starts every suffix in interval and no others } Compute stats over classes DFS Traversal of Class Tree Cumulative Document frequency (cdfk) freq = cdf1 dfk = cdfk – cdfk+1 Neighbors Cross Validation Joint Document Freq Sketches Apps Substring StatisticsOutline We are here
Corpus (3 docs): • Hi_Ho_Hi_Ho • Hi_Ho • Hi Need a few docs to talk about df
Cumulative Document Frequency (cdf) • Document Frequency (df) • Number of documents that mention str at least once • dfk≡ number of documents that mention str at k times • Adaptation = Pr(k≥2 | k≥1) = df2 / df1 • Cumulative Document Frequency (cdf) cdfk ≡ cumulative doc freq = Σi≥k dfi • Can recover freq & dfk from cdfk freq = cdf1 = j – i + 1 dfk = cdfk – cdfk+1 • A (simple but slow) method for computing cdfk cdf1(<i, j>)=Σi≤w≤j 1 cdf2(<i, j>)=Σi≤w≤j neighbor[w] ≥ i cdfk(<i, j>)=Σi≤w≤j neighbork-1[w] ≥ i
Neighbors • doc(s) 1:D • (using binary search) • Neighbors[s2] = s1 • where doc(s1) = doc(s2) = d • and s1 and s2 are adjacent suffix s3 such that doc(s3) = d and s1<s3< s2 • Neighbors[s2] = NA if s2 is first suffix in doc suffix s1 such that doc(s1) = doc(s2)and s1< s2 • Neighbork[s] = Neighbork-1[Neighbor[s]], for k>1 • Neighbor0[s] = s (identity)
Simple (but slow) code for cdfk • cdfk(<i, j>)=Σi≤w≤j neighbork-1[w] ≥ i struct class { int start, end, SIL }; /* returns neighbor^k(suf) or -1 if NA*/ int kth_neighbor(int suf, int k) { if(suf >= 0 || k >1) return kth_neighbor( neighbors[suf], k-1); else return suf; } struct class c; while(fread(&c, sizeof(c), 1, stdin)) { int cdfk = 0; for(w=c.start; w<=c.end; w++) if(kth_neighbor(w, K-1) >= c.start) cdfk++; putw(cdfk, out); /* report */ } Neighbork[s] = Neighbork-1[Neighbor[s]], for k>1
Same as before(but folded into Depth-First Search) • cdfk(<i, j>)=Σi≤w≤j neighbork-1[w] ≥ i for(w=0; w<N; w++) { if(LCP[w]> stack[sp].SIL) { sp++; stack[sp].start = w; stack[sp].SIL = LCP[w]; stack[sp].cdfk = 0; } for(sp1=0; sp1<=sp; sp1++) { if(kth_neighbor(w, K-1) >= stack[sp1].start) stack[sp1].cdfk++; } while(LCP[w] < stack[sp].SIL) { putw(stack[sp].cdfk, out); if(LCP[w] <= stack[sp-1].SIL) sp--; else stack[sp].SIL = LCP[w]; }} struct stackframe { int start, SIL, cdfk } *stack; /* returns neighbor^k(suf) or -1 if NA*/ int kth_neighbor(int suf, int k) { if(suf >= 0 || k >1) return kth_neighbor( neighbors[suf], k-1); else return suf; } Report Neighbork[s] = Neighbork-1[Neighbor[s]], for k>1
Results cdfk ≡ cum df = Σi≥k dfi freq = cdf1 = j – i + 1 dfk = cdfk – cdfk+1 cdf1(<i, j>)= Σi≤w≤j 1 = j – i + 1 cdf2(<i, j>)= Σi≤w≤j neighbor[w] ≥ i cdfk(<i, j>)= Σi≤w≤j neighbork-1[w] ≥ i dfk≥ dfk+1 and cdfk ≥ cdfk+1
Monotonicity • dfk ≥ dfk+1 • cdfk ≥ cdfk+1 • cdfk[mother] ≥ ddaughterscdfk[d] Opportunity for speedup: Propagate counts up class tree
Faster O(N max(k, log max(LCP)) code for cdfk for(w=0; w<N; w++) { if(LCP[w]> stack[sp].SIL) { sp++; stack[sp].start = w; stack[sp].SIL = LCP[w]; stack[sp].cdfk = 0; } int prev = kth_neighbor(w, K-1); if(prev >= 0) stack[find(prev)].cdfk++; while(LCP[w] < stack[sp].SIL) { putw(stack[sp].cdfk, out); /* report */ if(LCP[w] <= stack[sp-1].SIL) { stack[sp-1].cdfk += stack[sp].cdfk; sp--; } else stack[sp].SIL = LCP[w]; }} • N struct stackframe { int start, SIL, cdfk } *stack; /* returns neighbor^k(suffix) or -1 if NA*/ int kth_neighbor(int suffix, int k) { int i, result = suffix; for(i=0; i < k && result >= 0; i++) result = neighbors[result]; return result; } /* return first stack frame not before suffix */ /* binary search works because stack is sorted */ int find(int suffix) { int low = 0; int high = sp; while(low + 1 < high) { int mid = (low + high) / 2; if(stack[mid].start <= suffix) low = mid; else high = mid; } if(stack[high].start <= suffix) return high; if(stack[low].start <= suffix) return low; fatal("can't get here"); } • k • log max(LCP) • Propagate counts up class tree