160 likes | 318 Views
Noisy Text Analytics: An Exercise in Futility?. Hwee Tou Ng Department of Computer Science National University of Singapore 8 Jan 2007. Noisy Text Analytics: An Exercise in Futility?. Sources of Noisy Text. Traditional sources Automatically transcribed text from speech
E N D
Noisy Text Analytics: An Exercise in Futility? Hwee Tou Ng Department of Computer Science National University of Singapore 8 Jan 2007
Sources of Noisy Text • Traditional sources • Automatically transcribed text from speech • Automatically OCRed text from image
Sources of Noisy Text • More recent sources from the Web • Blogs, wikis, message boards, online chats, SMS, etc. • User generated content
Sources of Noisy Text • More recent sources from the Web • Blogs, wikis, message boards, online chats, SMS, etc. • User generated content • Informal text • Acronyms, abbreviations, specialized vocabulary • Sublanguage, sub-community
Importance • The rise of social media (“Web 2.0”) • Commercial, economic interest
Importance • ACL SIGWAC (Special Interest Group on the Web as Corpus, Association for Computational Linguistics) • CLEANEVAL (shared task and competition for web corpus cleaning)
An Exercise in Futility? Necessity is the mother of invention!
What is “Analytics”? • American Heritage Dictionary • “The branch of logic dealing with analysis” • Merriam-Webster’s Online Dictionary • “The method of logical analysis”
Analytics • Approach #1 • Eliminate the noise in noisy text (text normalization), followed by processing the text as per normal • Noise: Misspelled words, wrongly cased words, wrong sentence and paragraph boundaries • Examples: • Table recognition • Learning to Recognize Tables in Free Text, H T Ng, C Y Lim, J L T Koo, ACL 1999
Analytics • Approach #2 • Process the noisy text as is directly • Examples: • Upper case text (e.g., speech recognizer output) • Teaching a Weaker Classifier: Named Entity Recognition on Upper Case Text, H L Chieu, H T Ng, ACL 2002 • Semi-structured text (e.g., seminar announcements, job advertisements) • A Maximum Entropy Approach to Information Extraction from Semi-Structured and Free Text, H L Chieu, H T Ng, AAAI 2002