210 likes | 464 Views
Phrase Hierarchy Inference Gordon Paynter, UC Riverside Craig Nevill-Manning, Google Ian Witten, University of Waikato Outline Overlapping vs non-overlapping phrases Memory-based algorithm Suffix trees Suffix arrays Multipass algorithm Non-overlapping phrases
E N D
Phrase Hierarchy Inference Gordon Paynter, UC Riverside Craig Nevill-Manning, Google Ian Witten, University of Waikato
Outline • Overlapping vs non-overlapping phrases • Memory-based algorithm • Suffix trees • Suffix arrays • Multipass algorithm
Non-overlapping phrases • Given a text, parse it into a tree of repeated phrases • Advantage • Based on existing data compression algorithms • Disadvantage • Sometimes arbitrary association of words In the beginning, God created the heaven and the earth
Overlapping Phrases • Instead, we count all repeating phrases, even if two phrases overlap • Limit phrase length to, say, ten
Memory-based Algorithm • For each word w: • Everywhere that word occurs, consider the phrase formed by the word plus the word to the left (aw) • Similarly for words to the right (wa) • If the phrase is always preceded or followed by the same word, extend the phrase • If the phrase begins or ends with a stopword, extend the phrase • Add all the extended phrases to the list of expansions for w • For each phrase p: • …
Memory-based Algorithm • Problem: • How to efficiently find words to the right and left for every occurrence of a word or a phrase? • Solution: • Suffix trees
Suffix Tree • A compacted trie of suffixes • Trie: a tree containing a set of strings she sells sea shells on the sea shore s h e l l s o r e e l l l s a o n t h e
Suffix Tree • Compacted trie: no nodes with only one child s h e l l s o r e e l l l s a o n t h e s h e lls ore e llls a on the
Suffix Tree • Compacted trie of all suffixes she sells sea shells on the sea shore he sells sea shells on the sea shore e sells sea shells on the sea shore sells sea shells on the sea shore sells sea shells on the sea shore ells sea shells on the sea shore lls sea shells on the sea shore ls sea shells on the sea shore s sea shells on the sea shore sea shells on the sea shore sea shells on the sea shore …
Two Surprising Facts • Even though there are O(n2) characters in all the suffixes, • Suffix trees consume O(n) space • Suffix trees take O(n) time to compute
Suffix Tree • How does the suffix tree help us? • Build a suffix tree of words (instead of single letters) • For any word, words to the right are children in the tree • Compaction means that the longest unique sequence is already computed • For words to the left, build a suffix tree for the reverse sequence
Suffix Array • Sorted list of suffixes ·sea·shells·on·the·sea·shore ·sells·sea·shells·on·the·sea·shore e·sells·sea·shells·on·the·sea·shore ells·sea·shells·on·the·sea·shore he·sells·sea·shells·on·the·sea·shore lls·sea·shells·on·the·sea·shore ls·sea·shells·on·the·sea·shore s·sea·shells·on·the·sea·shore sea·shells·on·the·sea·shore sells·sea·shells·on·the·sea·shore she·sells·sea·shells·on·the·sea·shore
Suffix Array • Advantages • Simple: 10 lines of code • Space efficient: one array of pointers • Disadvantages • More expensive to create: O(n log n) • More expensive to operate on (linear scans instead of following an edge)
Multi-pass Algorithm • Disk seeks dominate • minimize disk seeks • fit within available memory • Disk reads are cheap, seeks are expensive • Make multiple passes over the data, using as little memory as possible
Three Phases • Phase 1: count all single words, two word phrases, three word phrases… • Phase 2: make expansion lists for each phrase • Phase 3: delete uninteresting phrases
Phase 1: Count Phrases • Make one pass over the data, counting individualwords • Write out all words that appear more than once • Make a second pass over the data, counting pairs of words, where both words appear more than once • Write out all pairs that appear more than once • Make a third pass over the data, counting triples of words, where both overlapping pairs appear more than once • Write out all triples that appear more than once • …
Phase 1: Output words and 31 Gone 2 man 4 old 12 sea 8 the 57 Wind 3 with 17 pairs of words and the 25 Gone with 2 man and 3 old man 2 The old 5 the sea 3 the Wind 2 with the 13 triples of words and the sea 3 Gone with the 2 man and the 2 old man and 2 The old man 2 with the Wind 2
Phase 2: Make Expansion Lists • Read all pairs of words that appear more than once (from phase 1) • Insert each pair in the list for each word • Read all frequent triples • Insert each triple in the list for each overlapping pair • …
Phase 2: Output words and 31 Gone 2 man 4 old 12 sea 8 the 57 Wind 3 with 17 pairs of words and the 25 Gone with 2 man and 3 old man 2 The old 5 the sea 3 the Wind 2 with the 13 triples of words and the sea 3 Gone with the 2 man and the 2 old man and 2 The old man 2 with the Wind 2 …
Phase 3 • Delete each phrase in the hierarchy if • it begins or ends in a stopword (“man and”) • it occurs in a particular longer phrase more than 75% of the time (“theoretical computer”) • Pointers to that phrase now point to that phrase’s expansions • Process is recursive
Phase 3: Output words and 31 Gone 2 man 4 old 12 sea 8 the 57 Wind 3 with 17 pairs of words and the 25 Gone with 2 man and 3 old man 2 The old 5 the sea 3 the Wind 2 with the 13 triples of words and the sea 3 Gone with the 2 man and the 2 old man and 2 The old man 2 with the Wind 2 …