300 likes | 399 Views
Database Research Group. An Effective Approach for Searching Closest Sentence Translations from The Web. Ju Fan , Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong. Outline. Introduction Overview of Our Approach
E N D
Database Research Group An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
Background • Parallel sentences on the Web • Sentences with the well-translated counterpart • An English-to-Chinese Example • A rich source for translation • Commercial Systems Obama said he hopes to get Congress to approve it next year 奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com SCST@DASFAA 2011
Background Sentence-Level Translation Aid Sentence Matching Research Issue Parallel Sentences E.g., The result is good 结果很好 Sen 1 (E-C) An effective similarity model between sentences in the source language (e.g., English sentences) Sen2 (E-C) Sen3 (E-C) …… Closest Sentences with Translation Query Sentence (English) Parallel Sentence Discovery and Extraction sen n (E-C) Parallel Sentence Database Web SCST@DASFAA 2011
Motivation • Existing approaches: • Word-based, e.g., translation model, edit distance, … • Gram-based, e.g., N-gram, V-gram • All subsequences of a sentence Cannot capture the order of words Don’t consider the syntactic information Too expensive We propose a phrase-based similarity model Syntactic information Frequency information Lengths of phrases SCST@DASFAA 2011
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
Problem Definition Query: Query Sentence (English) Translator Data: A Database of Parallel Sentences Answer: Sentences with its translations Sentence1: English - Chinese Sentence2: English - Chinese Sentence3: English - Chinese … SCST@DASFAA 2011
Phrase-Based Sentence Matching Phrase f1 Similarity Model s Phrase f2 q …… Phrase fn Online Offline Phrase f’1 Parallel Sentences Phrase f’2 Phrase Selection …… Phrase f’n Phrase Database SCST@DASFAA 2011
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
Phrase-Based Similarity Model Phrase f1 Similarity Model s Phrase f2 q …… Phrase fn Online Phrase f’1 Parallel Sentences Offline Phrase f’2 Phrase Selection …… Phrase f’n Phrase Database SCST@DASFAA 2011
Similarity Model Fq∩Fs = ∑f ∈Fq∩Fs w(f) sim(q,s) φ(q,f) φ(s,f) Fs Query Sentence, q φ(q,f):syntactic importance of f to q Phrase Set, Fq f1, f2, f3, ……, fm Shared Phrases: f ∈Fq∩Fs w(f):weight of f (IDF) Phrase Set, Fs f'1', f'2, f'3, ……, f'n φ(s,f):syntactic importance of f to s A Sentence in the DB, s SCST@DASFAA 2011
Syntactic Importance of Phrases φ(q,f) = Πm αm Πg βg αm: syntactic weight of matched term Sentence q He has eaten an apple Dependency Tree Gap has an α0 eaten βg : penalty (constant) he apple has Phrase f he eaten apple d·α0 d·α0 d·α0 an d2·α0 d: a decay factor SCST@DASFAA 2011
Features of the Similarity Model • More General • Subsumes Jaccard, Cosine similarity,… • Syntactic Information • Weight of matched terms • Weight of terms in the gap • Frequency Information • Weight of phrases SCST@DASFAA 2011
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
High-Quality Phrase Selection Phrase f1 Similarity Model s Phrase f2 q …… Phrase fn Online Phrase f’1 Parallel Sentences Offline Phrase f’2 Phrase Selection …… Phrase f’n Phrase Database SCST@DASFAA 2011
High-Quality Phrase • Extend grams by allowing discontinuous terms • A heuristic for selecting phrases • Gap constraint: syntactic relationship of discontinuous terms • Frequency constraint: infrequent (large IDF) • Maximum constraint: 1) not a prefix; 2) max. length Sentence q He has eaten an apple he eaten apple Frequency # of sentences In the DB having it syntactic SCST@DASFAA 2011
Phrase Selection • Selecting phrases with gap and maximum constraints Sentence s He ate a red apple • Sentence Graph • Sequential relationship • Syntactic relationship he eat red apple • Longest path from a node = A phrase satisfying • Gap constraint • Maximum constraint SCST@DASFAA 2011
Phrase Selection • Select phrases with frequency constraint (Threshold = 2) N0(8) he have red eat N1(4) … … … eat have N2(3) N9(1) apple pencil # red apple N27(1) N29(0) N4(1) N11(1) N15(1) # # apple Prune freq- uent phrases Use a frequency trie N28(0) N5(0) N13(1) # N14(0) SCST@DASFAA 2011
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
Experiment Setup • Data Sets • DI: 520,899 parallel sentences from ICIBA • DC: 800,000 parallel sentences from CNKI • Baseline Methods • Jaccard Coefficient, Edit Distance, Cosine Similarity • Translation Model Methods (TM) • Cosine Similarity with VGRAM SCST@DASFAA 2011
Experiment Setup • Evaluation Metrics • BLEU • A well known metric for machine translation • Example: • Precision • A user study to label whether the translations are useful Ref. Translation q: He has eaten an apple 他吃了一个苹果 BLEU s: He has a pencil 他有一支铅笔 Translation SCST@DASFAA 2011
Effects of Phrase Selection Effect on freq. threshold on DC Effect on max. length on DI SCST@DASFAA 2011
Comparison with Similarity Models Comparison on the DI data set SCST@DASFAA 2011
Comparison with Existing Methods Comparison on the DC data set SCST@DASFAA 2011
User Studies • Methods used in commercial systems Comparison on the DI data set SCST@DASFAA 2011
Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011
Conclusion • Searching closest sentence translations from the Web • A phrase-based sentence similarity model • High-quality phrase selection methods • Extensive experiments and user studies SCST@DASFAA 2011
Thanks My Homepage: http://dbgroup.cs.tsinghua.edu/fanju SCST@DASFAA 2011
Frequency Constraint • Index structures • Phrase Sentence • Frequent phrases large inverted index SCST@DASFAA 2011