1 / 5

Bilingual Alignment Models: Cognates and Phrases

Bilingual Alignment Models: Cognates and Phrases. Andrea Burbank Dinkar Gupta Spring 2006. French/English word alignments. Cognates French and English share many words with common roots Can identifying cognate pairs improve alignments? What about a distribution based on word lengths?

swoodrow
Download Presentation

Bilingual Alignment Models: Cognates and Phrases

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bilingual Alignment Models:Cognates and Phrases Andrea Burbank Dinkar Gupta Spring 2006

  2. French/English word alignments • Cognates • French and English share many words with common roots • Can identifying cognate pairs improve alignments? • What about a distribution based on word lengths? • Phrases capture conceptual mappings • Overcome language specific syntax and constructs • Examples: “pommes frites”  “French fries”, “à demain”  “see you tomorrow”, “ne veux jamais”  “never wants” • Aligned phrases need not be long - 3 or 4 words • Concepts arrangement same in French and English

  3. Cognate Identification • Identifying clear cognate matches can help create benchmarks for alignment • Different cognate matching metrics: • match the first four letters (e.g. suggère, suggests) • count shared bigrams (Dice coefficient) • e.g. unité, unity = un + ni + it =3/4 • longest common subsequence ratio (LCSR) • e.g. couleur, color = c-o-l-r = 4/max(le, lf) = 4/7 • count sharedletters and normalize by length • e.g. chat, cat = cat/chat + cat/cat = (3/4+3/3)/2e–(4-3) • Incorporating cognates: add pairs to training set • Word-length distributions: EM algorithm • P(lf | le) iteratively calculated in Model 1

  4. Good Mappings “la Bourse the Toronto”  “The Toronto Stock Exchange” “les actes criminels”  “crimes of violence” “excusez-nous si”  “excuse us if we” “serait que”  “that it would” “profiter de le occasion”  “take this opportunity to” Extraneous “la vision que”  “vision that” Bad Mappings “les pays”  “the country to” “le gouffre financière”  “cheered on by” Good: “pouvons travailler” “can work” “can work together” “can work together within” “can all work” “can all work together” “we can all work” Bad: “les administrations” “of the GDP over” “percent of the GDP over” “4.22 percent of” “to 4.22 percent of” “of the GDP” “4.22 percent of the” Phrase mappings

  5. Results: significant improvements! Model 1 trained on the test set Model 1 with and without cognates words only with phrases

More Related