10 likes | 122 Views
A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation Pidong Wang and Hwee Tou Ng { wangpd , nght }@comp.nus.edu.sg. Experimental Results. Motivation. Missing Word Recovery. Objective. Example. Punctuation Correction.
E N D
A Beam-Search Decoder for Normalization of Social Media Text with Application to Machine Translation Pidong Wang and Hwee Tou Ng {wangpd, nght}@comp.nus.edu.sg Experimental Results Motivation Missing Word Recovery Objective Example Punctuation Correction Limitations of Prior Work Corpora:(1) Chinese-English: 1,000 Weibo messages, normalized and then translated (2) English-Chinese: 2,000 messages randomly selected from the NUS English SMS corpus,normalized and then translated • A beam-search decoder for normalization of social media text • Applied to machine translation (MT) • Effectively integrate multiple normalization operations • Problem: Some words are omitted, e.g., “be” in English. • Solution: A CRF model to recover such missing words • Example: To recover “be” in English, the model has tags None, BE, IS, ARE, and AM, and assigns a tag to each token denoting whether to insert some form of “be” after the token. • Input text: “yeah must sign up , im in lt25”: • Without normalization, translation is “对[yeah] 必须 [must] 签署[sign up] , im在[in] lt25”. • Normalized text: “yeah must sign up , i ’m in lt25 .” • Translation of normalized text: “对 必须签署 , 我 在 lt25 。” • Word normalization and punctuation correction improve translation. • Most prior work focused on word substitution • Other normalization operations are also needed, e.g., punctuation correction, missing word recovery, etc. • Use a two-layer dynamic conditional random fields (CRF) model • Layer 1: Punctuation tags None, Comma, Period, Question-Mark, and Exclamatory-Mark • Layer 2: Sentence boundary tags Declarative-Begin, Declarative-In, Question-Begin, Question-In, Exclamatory-Begin, and Exclamatory-In • Social media texts (SMS messages, Twitter messages, etc.) are written informally with abbreviations, specialized vocabulary, etc. • Problematic for natural language processing applications • Quotation: Applicable if an informal word ends with a letter in (m,s,t) and if inserting a quotation mark before the letter produces a formal word, e.g., im i’m, shes she’s, isnt isn’t • Abbreviation: informal toformal word by adding missing vowels • Time: Change a number into a time expression if possible, e.g., from “1130am” to “11 : 30 am” Chinese-English English-Chinese Text Normalization Decoder Baselines: (1) ORIGINAL: No text normalization (2) LATTICE: Use lattice with the manually assembled dictionary (3) PBMT: Use phrase-based MT to perform normalization Text Normalization for English Text Normalization for Chinese Given an input text, the decoder searches for its best normalization, i.e., the best hypothesis, by iteratively performing two subtasks: Producing new sentence-level hypotheses from hypotheses in the current stack, carried out by hypothesis producers; Evaluating the new hypotheses to retain good ones, carried out by feature functions. • We design the following hypothesis producers: • Dictionary:A dictionary of informal-formal word pairs • Punctuation:A dynamic CRF model • Pronunciation: Use Chinese Pinyin to model the pronunciation similarity of Chinese words • Pronoun:A CRF model to recover the missing pronoun “我[I]” • Interjection:Remove redundant interjections, e.g., “好的[ok]哦[oh]” “好的” • Resegmentation:Fix word segmentation problems • Dictionary, Punctuation, Interjection, plus • Pronunciation: Model English word pronunciation similarity • Be: ACRF model to recover the missing word “be” • Retokenization: Fix tokenization problems • Prefix: Informal word w formal word w’ if w is a prefix of w’, e.g., “goin” “going”