1 / 10

Normalization of SMS Text

Normalization of SMS Text. Gurpreet Singh Khanuja Sachin Yadav. Problem Introduction. R eal-time , short, massive in volume Noisy Similar to chat text as in facebook,twitter This is so hard 2 handle! Tell ppl u luv them cuz 2morrow is truly not promised!!!

hilde
Download Presentation

Normalization of SMS Text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Normalization of SMS Text Gurpreet Singh Khanuja SachinYadav

  2. Problem Introduction • Real-time, short, massive in volume • Noisy • Similar to chat text as in facebook,twitter This is so hard 2 handle! Tell ppl u luv them cuz 2morrow is truly not promised!!! This is so hard to handle! Tell people you love them because tomorrow is truly not promised!!!"

  3. e.g., earthquak, earthquick, ... could be normalised to “earthquake”.

  4. Objective • To restore ill-formed English words to their canonical forms. • Typos (e.g. luv ) • ad hoc abbreviations (e.g. ppl, u) • Phonetic substitutions (e.g. 2morrow,10q), etc.

  5. The normalization focuses on Out-Of-Vocabulary (OOV) words, which are firstly identified. • This is so hard 2 handle! Tell pplu luv them cuz 2morrow is truly not promised!!! • This is so hard to handle! Tell people you love them because tomorrow is truly not promised!!!

  6. Noisy channel model Given ill-formed text T and standard form S, find argmax P(S|T) via argmax P(T|S)P(S), where P(S) = language model and P(T|S) =error model [Toutanova and Moore, 2002, Choudhury et al., 2007,Cook and Stevenson, 2009] • Phrasal statistical machine translation SMT: original text = source language; normalised form = target language[Aw et al., 2006, Kaufmann and Kalita, 2010] • Miscellaneous Automatic Speech Recognition [Kobus et al., 2008], Hybrid models[Beaufort et al., 2010]

  7. Proposed Approach • Confusion set generation (find correction candidates) ... crush da redberryb4 da water ... before four be bore Confusion set generation

  8. Detection of ill-formed word ( is the candidate an ill-formed word?) before crush da redberry ? da water four bore Ill-formed word detector Yes or No

  9. Normalization of ill-formed word (select canonical lexical form)

  10. References • Bo Han and Timothy Baldwin Lexical Normalization of Short Text Messages: Make Sense a #twitter • Deana L. Pennell and Yang Liu A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations • Toutanova and Moore, 2002, Choudhury et al., 2007,Cook and Stevenson, 2009 • Kaufmann and Kalita, 2010 • https://twitter.com

More Related