Problem 1: Word Segmentation

Problem 1: Word Segmentation whatdoesthisreferto what doesthis refer to

Application: Chinese Text

Application: Internet Domain Names www.visitbritain.com Visit Britain

Statistical Machine Learning • Best segmentation= one with highest probability • Probability of a segmentation= P(first word) × P(rest of segmentation) • P(word)= estimated by counting

Statistical Machine Learning choosespain Choose Spain Chooses pain P(“Choose Spain”) > P(“Chooses Pain”)

Example • segment(“nowisthetime…”) • Pf(“n”) × Pr(“owisthetime…”) • Pf(“no”) × Pr(“wisthetime…”) • Pf(“now”) × Pr(“isthetime…”) • Pf(“nowi”) × Pr(“sthetime…”) • ……

Example • segment(“nowisthetime…”)

The Complete Program

Performance • Accuracy = 98% • Trained on 1.7B words (English) • Typical errors: • baseratesoughtto • base rate sought to • smallandinsignificant • small and in significant • ginormousego • g in or mouse go

Some Results • whorepresents.com[“who”, “represents”] • therapistfinder.com[“therapist”, “finder”] • expertsexchange.com[“experts”, “exchange”] • speedofart.net[“speed”, “of”, “art”] • penisland.comerror: expected [“pen”, “island”]

Problem 2: Spelling Correction • Mehran Salami • Typical word processor:  Tehran Salami • But Google can …

Statistical Machine Learning • Best correction=one with highest probability • Probability of a spelling correction c=P(c as a word) ×P(original is a typo for c) • P(c as a word)= estimated by counting • P(original is a typo for c)= proportional to number of changes

The Complete Program

Problem 3: Speech Recognition • An informal, incomplete grammar of the English language runs over 1,700 pages. • Invariably, simple models and a lot of data trump more elaborate models based on less data.

Problem 3: Speech Recognition • If you have a lot of data, memorisation is a good policy. • For many tasks such as speech recognition, once we have a billion or so examples, we essentially have a closed set that represents (or at least approximates) what we need, without general rules.

Problem 3: Speech Recognition

Problem 3: Speech Recognition “Every time I fire a linguist, the performance of our speech recognition system goes up.” --- Fred Jelinek

Problem 4: Machine Translation

Conclusion (Statistical) [Machine] Learning Is The Ultimate Agile Development Tool Peter Norvig (Director of Research, Google)

Problem 1: Word Segmentation