220 likes | 320 Views
Language Processing in the Web Era. Kuansan Wang ISRC Microsoft Research, Redmond WA. Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems. CIKM’08. There is no data like more data. That can be correctly exploited. NLP = Data + Model. Data, size does matter, but…
E N D
Language Processing in the Web Era Kuansan Wang ISRC Microsoft Research, Redmond WA
Banko and Brill (HLT’01): Mitigating the Paucity-of-Data Problems
There is no data like more data That can be correctly exploited
NLP = Data + Model • Data, size does matter, but… • Language styles • Global, multi-lingual • Dynamic • Model • Simple, but not overly simplistic • The less human involvements, the better • Machine doesn’t have to work the same way as human • For many tasks machine has outperformed human
URL Anchor Text HTML Title Heading Search Queries google earning earnings GOOG gooogle quarterly report … Caption Body
Tackling the gap between query and document languages • Machine Translation • Miller et al. (SIGIR-99): latent query generation • Berger and Lafferty (SIGIR-99): explicit query model • Jin et al. (SIGIR-02): title/body as parallel text • Smoothing • Lafferty and Zhai (SIGIR-01): divergence model • Zhai and Lafferty (SIGIR-02): two-stage smoothing • Questions • Quantitatively, how big the problem really is? • Computationally, what are insights for solutions?
Microsoft Web N-gram Service • Cloud-based Web Service • Web documents/search queries received by Bing (EN-US market) • Live with June-09, April-10 snapshots • Training/adaptation tokens: ~1.2T per snapshot • CALM (ICASSP-2009) • http://web-ngram.research.microsoft.com
Cross-Language Perplexities on Query June-09 Snapshot
Rapid Pace of the Web • Top unigrams for Web docs change a lot • Ref. Top 100K words from MS Web N-gram • Search query changes more quickly • Real-time media even more so • Twitter, Facebook updates • Web is not a “dead” corpus • Adaptation capability critical for Web NLP
MAP Decision Approach • Channel Coding; Bayesian Minimum Risk… • Speech, MT, Parsing, Tagging, Information Retrieval • Sopt = arg max P(S|O) = arg max P(O|S)P(S) • P(O|S): transformation model • P(S): prior Distortion Channel Signal (S) Output (O)
Plug-in MAP Problem • MAP decision is optimal only if P(S) and P(O|S) are “real” distributions • Adjustments needed when the probabilistic models include estimation errors or mismatch • Simple logarithmic interpolation: • Sopt = arg max [log P(O|S) + αlog P(S)] • “Random Field”/Machine Learning: • Sopt = arg max log P(S|O) = arg max Σαi* log P(fi|O)
Challenging Problems • Generalizability • Robustness • Adaptability • Implementation efficiency • Cost • When do we need complex models?
Case Study: Word Breaker • O: tweeter hash tag or URL domain name (e.g. “247moms”, “w84um8”) • S: what user meant to say (e.g. “24_7_moms”, “w8_4_u_m8” = “wait for you mate”) Transformation Channel Signal (S) Output (O)
Word Breaking Challenge • Norvig (CIKM 2008): Large Data + Simple Model • Unigram model • Good enough, but sometimes yields embarrassing outcomes • Simple extension to trigram:
Additional Challenges for WebApplications • Demo: bing.com/?q=word+breaker+web+era
Prior Arts • Simple heuristics • BI: Binomial Model (Venkataraman, 2001) • log P(O|S) = n * log P# + (|S|– n – 1) * log (1-P#) • GM: Geometric Mean (Keohn and Kline, 2003) • Widely used, especially in MT systems • WL: Word Length Normalization (Kaitan et al, 2009) • log P(O|S) = Σ log P(|wi|) • ME: Maximum Entropy Principle (WWW’11) • P# = 0.5, α = 1.0, P(S) using MS Web N-gram • Bayesian • Modular Linguistic Model (Brent, 1999) • Dirichlet Process (Goldwater et al, 2006)
Title Query Anchor Note: BI, WL are cheating experiments!
Best = Right Data + Smart Model • Style of language trumps size of data • Right data alleviates Plug-in MAP problem • Complicated machine learning artillery not required; simple methods suffice • Performance scales with model power, as mathematically predicted • Smart model gives us: • Rudimentary multi-lingual capability • Fast inclusion of new words/phrases • Alleviate needs of human labor
http://www.spellerchallenge.com From Word Breaking to Spelling Correction