210 likes | 312 Views
Web Scale NLP: A Case Study on URL Word Breaking. Kuansan Wang, Chris Thrasher, Bo-June (Paul) Hsu Microsoft Research, Redmond, USA WWW 2011 March 31, 2011. More Data > Complex Model. Banko and Brill. Mitigating the Paucity-of-Data Problems . HLT 01. More Data > Complex Model. ?.
E N D
Web Scale NLP:A Case Study on URL Word Breaking Kuansan Wang, Chris Thrasher, Bo-June (Paul) Hsu Microsoft Research, Redmond, USA WWW 2011 March 31, 2011
More Data > Complex Model Banko and Brill. Mitigating the Paucity-of-Data Problems. HLT 01
More Data > Complex Model ? There is no data like more data • CIKM 08
NLP for the Web Simple models with matcheddata! • Scale of the Web • Avoid manual intervention • Efficient implementations • Dynamic Nature of the Web • Fast adaptation • Global Reach of the Web • Need rudimentary multi-lingual capabilities • Diverse Language Styles of Web Contents • Multi-style language models
Outline Web-Scale NLP Word Breaking Models Evaluation Conclusion
Word Breaking • Large Data + Simple Model (Norvig, CIKM 2008) • Use unigram model to rank all possible segmentations • Pretty good, but with occasional embarrassing outcomes • More data does not help! • Extension to trigram alleviates the problem
Word Breaking for the Web Matcheddata is crucial to accuracy! Web URLs exhibit variety of language styles… …and in different languages
Outline Web-Scale NLP Word Breaking Models Evaluation Conclusion
MAP Decision Rule Distortion Channel Signal Observation • Special case of Bayesian Minimum Risk • Speech, MT, Parsing, Tagging, Information Retrieval, … • Problem: Given , find : transformation model : prior
MAP for Word Breaker Transformation Channel Signal Output • : tweeter hash tag or URL domain name • Ex. 247moms, w84um8 • : what user meant to say • Ex. 24_7_moms, w8_4_u_m8 (wait for you mate)
Plug-in MAP Problem • MAP decision rule is optimal only if and are the “correct” underlying distributions • Adjustments needed when estimated models and have unknown errors • Simple logarithmic interpolation: • “Random Field”/Machine Learning: • Bayesian • Point estimation is outdated • Assume parameters are drawn from “some” distribution
Baseline Methods All special cases/variations of MAP • GM: Geometric Mean (Keohn and Kline, 2003) • Widely used, especially in MT systems • BI: Binomial Model (Venkataraman, 2001) • WL: Word Length Normalization (Kaitan et al, 2009)
Proposed Method ME: Maximum Entropy Principle Model – Special case of BI () and WL (uniform) using Microsoft Web N-gram, Microsoft Web N-gram (http://web-ngram.research.microsoft.com) Web documents/Bing queries (EN-US market) Rudimentary multilingual (NAACL 10) Frequent updates (ICASSP 09) Multi-style language model (WWW 10, SIGIR 10)
Outline Web-Scale NLP Word Breaking Models Evaluation Conclusion
Data Set • 100K randomly sampled URLs indexed by Bing • Simple tokenization • 266K unique tokens • Mostly ASCII characters • Metric: Precision@3 • Manually labeled word breaks • Multiple answers are allowed
Language Model Style Matchedstyle is crucial to precision! • Title is best although Body is 100x larger • Nav queries often word-split URLs, but Query worse than Title
Model Complexity Simplemodel is sufficient with matcheddata! • With mismatched data, model choice is crucial • With matched data, complex models do not help
Outline Web-Scale NLP Word Breaking Models Evaluation Conclusion
Best = Right Data + Smart Model • Style of language trumps size of data • There is no data like more data…provided it’s matched data! • Right data alleviates Plug-in MAP problem • Complicated machine learning artillery not required; simple methods suffice • Smart model gives us: • Rudimentary multi-lingual capability • Fast inclusion of new words/phrases • Eliminate needs of human labor in data labeling http://research.microsoft.com/en-us/um /people/kuansanw/wordbreaker/
Title Query Anchor Note: BI, WL are oracle results