1 / 28

Comprehensive Text Engineering Libraries and Tools

A collection of essential libraries and utilities for text engineering, including word segmentation, syllable-to-word conversion, language models, and graphical models. Explore advanced algorithms and applications for text processing.

martines
Download Presentation

Comprehensive Text Engineering Libraries and Tools

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ELUTE • Essential Libraries and Utilities of • Text Engineering Tian-Jian Jiang

  2. Why bother? • Since we already have...

  3. (lib)TaBE • Traditional Chinese Word Segmentation • with Big5 encoding • Traditional Chinese Syllable-to-Word Conversion • with Big5 encoding • for bo-po-mo-fo transcription system

  4. 1999 - 2001?

  5. How about...

  6. libchewing • Now hacking for • UTF-8 encoding • Pinyin transcription system • and looking for • an alternative algorithm • a better dictionary

  7. We got a problem • 23:43 < s*****> 到底(3023) + 螫舌(0) 麼(2536) 東西(6024) = 11583 • 23:43 < s*****> 到底是(829) + 什麼東西(337) = 1166 • 23:43 < s*****> 到底螫舌麼東西大勝到底這什麼東西 • 00:02 < s*****> k***: 「什麼」會被「什麼東西」排擠掉 • 00:02 < s*****> k***: 結果是20445 活生生的被337 幹掉:P

  8. Word Segmentation Review

  9. Heuristic Rules* • Maximum matching -- Simple vs. Complex: 下雨天真正討厭 • 下雨天真正討厭vs. 下雨天真正討厭 • Maximum average word length • 國際化 • Minimum variance of word lengths • 研究生命起源 • Maximum degree of morphemic freedom of single-character word • 主要是因為 * Refer to MMSEG by C. H. Tsai: http://technology.chtsai.org/mmseg/

  10. Graphical Models • Markov chain family • Statistical Language Model (SLM) • Hidden Markov Model (HMM) • Exponential models • Maximum Entropy (ME) • Conditional Random Fields (CRF) • Applications • Probabilistic Context-Free Grammar (PCFG) Parser • Head-driven Phrase Structure Grammar (HPSG) Parser • Link Grammar Parser

  11. What is a language model?

  12. A probability distributionover surface patterns of texts.

  13. The Italian Who Went to Malta • One day ima gonna Malta to bigga hotel. • Ina morning I go down to eat breakfast. • I tella waitress I wanna two pissis toasts. • She brings me only one piss. • I tella her I want two piss. She say go to the toilet. • I say, you no understand, I wanna piss onna my plate. • She say you better no piss onna plate, you sonna ma bitch. • I don’t even know the lady and she call me sonna ma bitch!

  14. P(“I want to piss”) > P(“I want two pieces”) • For that Malta waitress,

  15. Do the Math • Conditional probability: • Bayes’ theorem: • Information theory: • Noisy channel model • Language model: P(i) I O Î Noisy channel p(o|i) Decoder

  16. Shannon’s Game • Predict next word by history • Maximum Likelihood Estimation • C(w1…wn) : Frequency of n-gramw1…wn

  17. Once in a Blue Moon • A cat has seen... • 10 sparrows • 4 barn swallows • 1 Chinese Bulbul • 1 Pacific Swallow • How likely is it that next bird is unseen?

  18. (1+1) / (10 + 4 + 1 + 1)

  19. But I’ve seen a moonand I’m blue • Simple linear interpolation • PLi(wn|wn-2 , wn-1) = λ1P1(wn) + λ2P2(wn|wn-1 ) + λ3P2(wn|wn-1 , wn-2) • 0 ≤λi ≤ 1, Σiλi = 1 • Katz’s backing-off • Back-off through progressively shorter histories. • Pbo(wi|wi-(n-1)…wi-1) =

  20. Good Luck! • Place a bet remotely on a horse race within 8 horses by passing encoded messages. • Past bet distribution • horse 1: 1/2 • horse 2: 1/4 • horse 3: 1/8 • horse 4: 1/16 • the rest: 1/64 Foreversoul: http://flickr.com/photos/foreversouls/ CC: BY-NC-ND

  21. 3 bits? No, only 2! • 0, 10, 110, 1110, 111100, 111101, 111110, 111111

  22. Alright, let’s ELUTE

  23. have 2 grams? No Bi-gram MLE Flow Chart Yes Permute candidates right_gram In LM? No temp_score = LogProb(Unknown) Yes has left_gram? No temp_score = LogProb(right_gram) Yes bi_gram In LM? No left_gram In LM? No temp_score = LogProb(Unknown) + BackOff(right_gram) Yes Yes temp_score = LogProb(bi_gram) temp_score = LogProb(left_gram) + BackOff(right_gram) temp_score += previous_score Update scores

  24. Bi-gram Syllable-to-Word INPUT input_syllables; len = Length(input_syllables); Load(language_model); scores[len + 1]; tracks[len + 1]; words[len + 1]; FOR i = 0 TO len scores[i] = 0.0; tracks[i] = -1; words[i] = ""; FOR index = 1 TO len best_score = 0.0; best_prefix = -1; best_word = ""; FOR prefix = index - 1 TO 0 right_grams[] = Homophones(Substring(input_syllabes, prefix, index - prefix)); FOREACH right_gram IN right_grams[] IF right_gram IN language_model left = tracks[prefix]; IF left >= 0 AND left != prefix left_grams[] = Homophones(Substring(input_syllables, left, prefix - left)); FOREACH left_gram IN left_grams[] temp_score = 0.0; bigram = left_gram + " " + right_gram; IF bigram IN language_model bigram_score = LogProb(bigram); temp_score += bigram_score; ELSEIF left_gram IN language_model bigram_backoff = LogProb(left_gram) + BackOff(right_gram); temp_score += bigram_backoff; ELSE temp_score += LogProb(Unknown) + BackOff(right_gram); temp_score += scores[prefix]; Scoring ELSE temp_score = LogProb(right_gram); Scoring ELSE temp_score = LogProb(Unknown) + scores[prefix]; Scoring scores[index] = best_score; tracks[index] = best_prefix_index; words[index] = best_prefix; IF tracks[index] == -1 tracks[index] = index - 1; boundary = len; output_words = ""; WHILE boundary > 0 output_words = words[boundary] + output_words; boundary = tracks[boundary]; RETURN output_words; SUBROUTINEScoring IF best_score == 0.0 OR temp_score > best_score best_score = temp_score; best_prefix = prefix; best_word = right_gram;

  25. Show me the…

  26. William’s Requests

  27. And My Suggestions • Convenient API • Plain text I/O (in UTF-8) • More linguistic information • Algorithm: CRF • Corpus: we need YOU! • Flexible to different applications • Composite, Iterator, and Adapter Patterns • IDL support • SWIG • Open Source • Open Corpus, too

  28. Thank YOU

More Related