190 likes | 295 Views
Effective Phrase Prediction. VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung. Motivation. Pervasiveness of Autocompletion Typical autocompletion is still at word level Phrase Prediction
E N D
Effective Phrase Prediction VLDB 2007 : Text Databases Presented By Arnab Nandi, H. V. Jagadish University of Michigan 2008-03-07 Summerized By Jaeseok Myung
Motivation • Pervasiveness of Autocompletion • Typical autocompletion is still at word level • Phrase Prediction • Words provide much more information to exploit for prediction • Context, Phrase Structures • Most text is predictable and repetitive in many applications • Email Composition • Prob(“Thank you very much” | “Thank”) ~= 1 Center for E-Business Technology
Challenges • Number of phrases is large • n(vocabulary) >> n(alphabet) • n(phrases) = O(vocabulary phrase length) • => FussyTree structure • Length of phrase is unknown • “word” has a well-defined boundary • => Significance • How to evaluate a suggestion mechanism? • => Total Profit Metric (TPM) Center for E-Business Technology
Problem Definition • R = query(p) • Need data structure that can • Store completions efficiently • Support fast querying Center for E-Business Technology
An n-gram Data Model • R = query(p) : r ∈ R, prob (p, r) is maximized • mth order Markov model • m: # of previous states that we are using to predict the next state • n-gram model is equivalent to an (n-1)th order Markov model w7,1 10 20 w7,2 30 w7,3 Prefix length p = 5 frequency for rank Center for E-Business Technology
Fundamental Data Structures • Basic data structure to “completion” problems • TRIE or Suffix Tree • Phrase Version • Every node = word <TRIE> <Suffix Tree> Center for E-Business Technology
Pruned Count Suffix Tree(PCST) • Construct a frequency based phrase Tree • Prune all nodes with frequency < threshold τ • Problems • PCST including infrequent phrases is constructed as an intermediate result => does not perform well for large sized data [16] Estimating alphanumeric selectivity in the presence of wildcards Center for E-Business Technology
FussyTree Construction • Filter out infrequent phrases even before adding to the tree training sentence size N = 2, τ = 2 threshold Tokenizing window size = 4 the size of the largest frequent phrase (please, call, me, asap) Ignoredphrases (call, me, asap, -end-) Center for E-Business Technology
Significance • A node in the FussyTree is “significant” if it marks a phrase boundary Example : “please call” “please call”(3) > “please”(0) * “call”(1) “please call”(3) > ½ * “please”(0) “please call”(3) > 3 * “please call me”(1) … Z and Y are considered tuning parameters Assume, z=2, y=3 Center for E-Business Technology
Significance – cont. • All leaves are significant • due to END node (frequency = 0) • Some internal nodes are significant too • Intuitively, suggestions ending on significant nodes will be better • No need to store counts END Center for E-Business Technology
Online Significance Marking • (Offline) Significance requires an additional pass • Compare against tree generated by FussyTree with offline significance A A B B Add “ABCXY” The branch point is considered for promotion C C The immediate descendant significant nodes are considered for demotion D D X E E Y Center for E-Business Technology
Evaluation Metrics • Precision & Recall • Refer to the quality of the suggestions themselves • For ranked results : Center for E-Business Technology
Total Profit Metric (TPM) • Total Profit Metric • TPM measures the effectiveness of suggestion mechanism • Counting number of keystrokes saved by suggestions • d is the distraction parameter • TPM(0) corresponds to a user who does not mind distraction at all • TPM(1) is an extreme case where we consider every suggestion(right or wrong) to be a blocking factor that costs us one keystroke • The distraction value would be closer to 0 than 1 Center for E-Business Technology
Total Profit Metric – An Example Center for E-Business Technology
Experiments • Multiple Corpora • Enron Small : 1 user’s “sent” (366 emails, 250KB) • Enron Large : multiple users (20,842 emails, 16MB) • Wikipedia (40,000 documents, 53MB) • Data Structures • (1) PCST, (2) FussyTree with Count, (3) FussyTree with Significance • Parameters • Significance : z(comparability) = 2, y(uniqueness) = 2 • Training Sentence Size N = 8 • Prefix Size P = 2 Center for E-Business Technology
Prediction Quality Center for E-Business Technology
Tuning Parameters (1) Center for E-Business Technology
Tuning Parameters (2) Center for E-Business Technology
Conclusion • Phrase level autocompletion is challenging, but can provide much greater savings beyond word-level autocompletion • A technique to accomplish this based on “significance” • New evaluation metrics for ranked autocompletion • Possible Extensions • Part of Speech Reranking • Semantic Reranking • Using Wordnet • Query Completion for structured data • XML, .. Center for E-Business Technology