Comprehensive Text Engineering Libraries and Tools

ELUTE • Essential Libraries and Utilities of • Text Engineering Tian-Jian Jiang

Why bother? • Since we already have...

(lib)TaBE • Traditional Chinese Word Segmentation • with Big5 encoding • Traditional Chinese Syllable-to-Word Conversion • with Big5 encoding • for bo-po-mo-fo transcription system

1999 - 2001?

How about...

libchewing • Now hacking for • UTF-8 encoding • Pinyin transcription system • and looking for • an alternative algorithm • a better dictionary

We got a problem • 23:43 < s*****> 到底(3023) + 螫舌(0) 麼(2536) 東西(6024) = 11583 • 23:43 < s*****> 到底是(829) + 什麼東西(337) = 1166 • 23:43 < s*****> 到底螫舌麼東西大勝到底這什麼東西 • 00:02 < s*****> k***: 「什麼」會被「什麼東西」排擠掉 • 00:02 < s*****> k***: 結果是20445 活生生的被337 幹掉:P

Word Segmentation Review

Heuristic Rules* • Maximum matching -- Simple vs. Complex: 下雨天真正討厭 • 下雨天真正討厭vs. 下雨天真正討厭 • Maximum average word length • 國際化 • Minimum variance of word lengths • 研究生命起源 • Maximum degree of morphemic freedom of single-character word • 主要是因為 * Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/

Graphical Models • Markov chain family • Statistical Language Model (SLM) • Hidden Markov Model (HMM) • Exponential models • Maximum Entropy (ME) • Conditional Random Fields (CRF) • Applications • Probabilistic Context-Free Grammar (PCFG) Parser • Head-driven Phrase Structure Grammar (HPSG) Parser • Link Grammar Parser

What is a language model?

A probability distributionover surface patterns of texts.

The Italian Who Went to Malta • One day ima gonna Malta to bigga hotel. • Ina morning I go down to eat breakfast. • I tella waitress I wanna two pissis toasts. • She brings me only one piss. • I tella her I want two piss. She say go to the toilet. • I say, you no understand, I wanna piss onna my plate. • She say you better no piss onna plate, you sonna ma bitch. • I don’t even know the lady and she call me sonna ma bitch!

P(“I want to piss”) > P(“I want two pieces”) • For that Malta waitress,

Do the Math • Conditional probability: • Bayes’ theorem: • Information theory: • Noisy channel model • Language model: P(i) I O Î Noisy channel p(o|i) Decoder

Shannon’s Game • Predict next word by history • Maximum Likelihood Estimation • C(w1…wn) : Frequency of n-gramw1…wn

Once in a Blue Moon • A cat has seen... • 10 sparrows • 4 barn swallows • 1 Chinese Bulbul • 1 Pacific Swallow • How likely is it that next bird is unseen?

(1+1) / (10 + 4 + 1 + 1)

But I’ve seen a moonand I’m blue • Simple linear interpolation • PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1 ) + λ3P2(wn|wn-1 , wn-2) • 0 ≤λi ≤ 1, Σiλi = 1 • Katz’s backing-off • Back-off through progressively shorter histories. • Pbo(wi|wi-(n-1)…wi-1) =

Good Luck! • Place a bet remotely on a horse race within 8 horses by passing encoded messages. • Past bet distribution • horse 1: 1/2 • horse 2: 1/4 • horse 3: 1/8 • horse 4: 1/16 • the rest: 1/64 Foreversoul: http://flickr.com/photos/foreversouls/ CC: BY-NC-ND

3 bits? No, only 2! • 0, 10, 110, 1110, 111100, 111101, 111110, 111111

Alright, let’s ELUTE

have 2 grams? No Bi-gram MLE Flow Chart Yes Permute candidates right_gram In LM? No temp_score = LogProb(Unknown) Yes has left_gram? No temp_score = LogProb(right_gram) Yes bi_gram In LM? No left_gram In LM? No temp_score = LogProb(Unknown) + BackOff(right_gram) Yes Yes temp_score = LogProb(bi_gram) temp_score = LogProb(left_gram) + BackOff(right_gram) temp_score += previous_score Update scores

Bi-gram Syllable-to-Word INPUT input_syllables; len = Length(input_syllables); Load(language_model); scores[len + 1]; tracks[len + 1]; words[len + 1]; FOR i = 0 TO len scores[i] = 0.0; tracks[i] = -1; words[i] = ""; FOR index = 1 TO len best_score = 0.0; best_prefix = -1; best_word = ""; FOR prefix = index - 1 TO 0 right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix)); FOREACH right_gram IN right_grams[] IF right_gram IN language_model left = tracks[prefix]; IF left >= 0 AND left != prefix left_grams[] = Homophones(Substring(input_syllables, left, prefix - left)); FOREACH left_gram IN left_grams[] temp_score = 0.0; bigram = left_gram + " " + right_gram; IF bigram IN language_model bigram_score = LogProb(bigram); temp_score += bigram_score; ELSEIF left_gram IN language_model bigram_backoff = LogProb(left_gram) + BackOff(right_gram); temp_score += bigram_backoff; ELSE temp_score += LogProb(Unknown) + BackOff(right_gram); temp_score += scores[prefix]; Scoring ELSE temp_score = LogProb(right_gram); Scoring ELSE temp_score = LogProb(Unknown) + scores[prefix]; Scoring scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix; IF tracks[index] == -1 tracks[index] = index - 1; boundary = len; output_words = ""; WHILE boundary > 0 output_words = words[boundary] + output_words; boundary = tracks[boundary]; RETURN output_words; SUBROUTINEScoring IF best_score == 0.0 OR temp_score > best_score best_score = temp_score; best_prefix = prefix; best_word = right_gram;

Show me the…

William’s Requests

And My Suggestions • Convenient API • Plain text I/O (in UTF-8) • More linguistic information • Algorithm: CRF • Corpus: we need YOU! • Flexible to different applications • Composite, Iterator, and Adapter Patterns • IDL support • SWIG • Open Source • Open Corpus, too

Thank YOU

Comprehensive Text Engineering Libraries and Tools