1 / 30

An Effective Approach for Searching Closest Sentence Translations from The Web

Database Research Group. An Effective Approach for Searching Closest Sentence Translations from The Web. Ju Fan , Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong. Outline. Introduction Overview of Our Approach

Download Presentation

An Effective Approach for Searching Closest Sentence Translations from The Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Database Research Group An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong

  2. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  3. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  4. Background • Parallel sentences on the Web • Sentences with the well-translated counterpart • An English-to-Chinese Example • A rich source for translation • Commercial Systems Obama said he hopes to get Congress to approve it next year 奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com SCST@DASFAA 2011

  5. Background Sentence-Level Translation Aid Sentence Matching Research Issue Parallel Sentences E.g., The result is good 结果很好 Sen 1 (E-C) An effective similarity model between sentences in the source language (e.g., English sentences) Sen2 (E-C) Sen3 (E-C) …… Closest Sentences with Translation Query Sentence (English) Parallel Sentence Discovery and Extraction sen n (E-C) Parallel Sentence Database Web SCST@DASFAA 2011

  6. Motivation • Existing approaches: • Word-based, e.g., translation model, edit distance, … • Gram-based, e.g., N-gram, V-gram • All subsequences of a sentence Cannot capture the order of words Don’t consider the syntactic information Too expensive We propose a phrase-based similarity model Syntactic information Frequency information Lengths of phrases SCST@DASFAA 2011

  7. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  8. Problem Definition Query: Query Sentence (English) Translator Data: A Database of Parallel Sentences Answer: Sentences with its translations Sentence1: English - Chinese Sentence2: English - Chinese Sentence3: English - Chinese … SCST@DASFAA 2011

  9. Phrase-Based Sentence Matching Phrase f1 Similarity Model s Phrase f2 q …… Phrase fn Online Offline Phrase f’1 Parallel Sentences Phrase f’2 Phrase Selection …… Phrase f’n Phrase Database SCST@DASFAA 2011

  10. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  11. Phrase-Based Similarity Model Phrase f1 Similarity Model s Phrase f2 q …… Phrase fn Online Phrase f’1 Parallel Sentences Offline Phrase f’2 Phrase Selection …… Phrase f’n Phrase Database SCST@DASFAA 2011

  12. Similarity Model Fq∩Fs = ∑f ∈Fq∩Fs w(f) sim(q,s) φ(q,f) φ(s,f) Fs Query Sentence, q φ(q,f):syntactic importance of f to q Phrase Set, Fq f1, f2, f3, ……, fm Shared Phrases: f ∈Fq∩Fs w(f):weight of f (IDF) Phrase Set, Fs f'1', f'2, f'3, ……, f'n φ(s,f):syntactic importance of f to s A Sentence in the DB, s SCST@DASFAA 2011

  13. Syntactic Importance of Phrases φ(q,f) = Πm αm Πg βg αm: syntactic weight of matched term Sentence q He has eaten an apple Dependency Tree Gap has an α0 eaten βg : penalty (constant) he apple has Phrase f he eaten apple d·α0 d·α0 d·α0 an d2·α0 d: a decay factor SCST@DASFAA 2011

  14. Features of the Similarity Model • More General • Subsumes Jaccard, Cosine similarity,… • Syntactic Information • Weight of matched terms • Weight of terms in the gap • Frequency Information • Weight of phrases SCST@DASFAA 2011

  15. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  16. High-Quality Phrase Selection Phrase f1 Similarity Model s Phrase f2 q …… Phrase fn Online Phrase f’1 Parallel Sentences Offline Phrase f’2 Phrase Selection …… Phrase f’n Phrase Database SCST@DASFAA 2011

  17. High-Quality Phrase • Extend grams by allowing discontinuous terms • A heuristic for selecting phrases • Gap constraint: syntactic relationship of discontinuous terms • Frequency constraint: infrequent (large IDF) • Maximum constraint: 1) not a prefix; 2) max. length Sentence q He has eaten an apple he eaten apple Frequency # of sentences In the DB having it syntactic SCST@DASFAA 2011

  18. Phrase Selection • Selecting phrases with gap and maximum constraints Sentence s He ate a red apple • Sentence  Graph • Sequential relationship • Syntactic relationship he eat red apple • Longest path from a node = A phrase satisfying • Gap constraint • Maximum constraint SCST@DASFAA 2011

  19. Phrase Selection • Select phrases with frequency constraint (Threshold = 2) N0(8) he have red eat N1(4) … … … eat have N2(3) N9(1) apple pencil # red apple N27(1) N29(0) N4(1) N11(1) N15(1) # # apple Prune freq- uent phrases Use a frequency trie N28(0) N5(0) N13(1) # N14(0) SCST@DASFAA 2011

  20. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  21. Experiment Setup • Data Sets • DI: 520,899 parallel sentences from ICIBA • DC: 800,000 parallel sentences from CNKI • Baseline Methods • Jaccard Coefficient, Edit Distance, Cosine Similarity • Translation Model Methods (TM) • Cosine Similarity with VGRAM SCST@DASFAA 2011

  22. Experiment Setup • Evaluation Metrics • BLEU • A well known metric for machine translation • Example: • Precision • A user study to label whether the translations are useful Ref. Translation q: He has eaten an apple 他吃了一个苹果 BLEU s: He has a pencil 他有一支铅笔 Translation SCST@DASFAA 2011

  23. Effects of Phrase Selection Effect on freq. threshold on DC Effect on max. length on DI SCST@DASFAA 2011

  24. Comparison with Similarity Models Comparison on the DI data set SCST@DASFAA 2011

  25. Comparison with Existing Methods Comparison on the DC data set SCST@DASFAA 2011

  26. User Studies • Methods used in commercial systems Comparison on the DI data set SCST@DASFAA 2011

  27. Outline • Introduction • Overview of Our Approach • Phrase-Based Similarity Model • Phrase Selection • Experiments • Conclusion SCST@DASFAA 2011

  28. Conclusion • Searching closest sentence translations from the Web • A phrase-based sentence similarity model • High-quality phrase selection methods • Extensive experiments and user studies SCST@DASFAA 2011

  29. Thanks My Homepage: http://dbgroup.cs.tsinghua.edu/fanju SCST@DASFAA 2011

  30. Frequency Constraint • Index structures • Phrase  Sentence • Frequent phrases  large inverted index SCST@DASFAA 2011

More Related