1 / 22

A New Lexicon Mechanism for Chinese Word Segmentation

A New Lexicon Mechanism for Chinese Word Segmentation. Advisor : Dr. Hsu Graduate : Kuo-min Wang. 2006 PACIS. Outline. Motivation Objective Introduction A New Lexicon Mechanism Experiments Conclusion Personal Opinions. Motivate.

Download Presentation

A New Lexicon Mechanism for Chinese Word Segmentation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A New Lexicon Mechanism for Chinese Word Segmentation Advisor : Dr. Hsu Graduate : Kuo-min Wang 2006 PACIS .

  2. Outline • Motivation • Objective • Introduction • A New Lexicon Mechanism • Experiments • Conclusion • Personal Opinions

  3. Motivate • Under the development of global networking through Internet, the amount of articles in Chinese or other oriental languages is increasing rapidly. • As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus affecting the whole system in performance.

  4. Objective • This paper propose a new solution for Chinese word segmentation problem based on lexicon named double-character-and-long-world-hash-indexing (DCLWHI). • This method can improve the speed and efficiency of word segmentation without extra memory spending, and gains the same accuracy.

  5. Introduction • The current methods of Chinese word segmentation are divided into two kinds • Lexicon • Easily accomplished, high level arithmetic efficiency • Out of vocabulary problem (OOV) • (new words, names of people, organizations and locations) • Frequency statistic • Has the advantage on OOV problems • But the arithmetic efficiency is much lower than the lexicon based method.

  6. A New Lexicon Mechanism • The Double-Character words hold large proportion in Chinese words. • 70% are double-character word [4] • Make a hash indexing for the first two characters of the lexicon words, then add the remaining string into a special long word table, which has a hash indexing.

  7. A New Lexicon Mechanism • First-Double-Character-Hash-Indexing • Flag Bit(2Bytes) • If the two-character is a prefix of a word which length is N, the big N-1 of the 2 bytes will be set 1; • Exaple “圖籍” , which is a double-character word, but can’t be the prefix of other words, So the Flag Big of 圖籍 is set 0000000000000010(0x0002) • 電老 is not a Chinese word, but it can be a prefix of a word 電老虎. So the Flag Bit is 0000000000000100(0x0004) • Similar examples : 春夏(ox000A),君子(x0006)、敢作(x0008) • Long Word Hash Indexing • Similar to the First-Double-Character-Hash-Indexing. 2-character 0000 0000 0000 0010 3-character 0000 0000 0000 0100 4-character 0000 0000 0000 1000

  8. A New Lexicon Mechanism • Example of Search->君子當圖籍是電老虎 • Pick up first two characters “君子”, Flag Big is x0006 can be a 2-character or a prefix of a Treble-Character word. • Then shift to the character “當”, compute the hash value of the substring “君子當”, search in the long word • Find the marching index, confirm the string , marching succeed. • Shift to Character “圖籍” (0x002) • Shift to Character “是電” • There is no value in hash-indexing, 2 situations may happen • First, there is no value in hash-indexing, return one character “是” • Second, there is a substring in the index, but value unequally; return one character “是” • Shift to Character電老”(0x004) • Shift to Character “虎” 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎

  9. Experiments • Comparison of Searching Cycles • Comparison of Memory Space Cost • Comparison of Speed

  10. Background • Binary-Seek-by-Word • Composed of three parts • Lexicon text, word-index-table, first-character-index-table

  11. Background (cont.) • TRIE indexing tree • is a multi-chain-table tree, the mechanism is composed of two parts: • First-character-index table and TRIE index-tree node • Didn’t need to predict the length of the word , only need to match the word by chain-tree

  12. Background (cont.) • Binary-Seek-by-Characters • Absorbs the search-advantage in TRIE indexing tree, using searching by characters not searching by words

  13. Background (cont.) • Summary above methods’ drawbacks • Binary-seek-by-word is using full-words marching, the efficiency is evidently low. • The design and maintenance of the TRIE tree is very complex, wastes mass memory space • Binary-seek-by characters • Improves some aspects, but it doesn’t change the data structure of the binary-seek-by-word which restrict the efficiency.

  14. Some novel schemes • Double-Character-Hash- indexing[4] • An new searching tree improved from the TRIE indexing tree. • Composed of two parts: Hashing index, remaining strings. • Can avoids the deep searching , increases the segment speed without complex increasing.

  15. Some novel schemes (cont.) • A new lexicon mechanism based on PATRICIA[3] • Use of the ISN (internal statement number) of the words as the key words bit-string, • Constructs the PATRICIA tree by comparing the big-string. • Advantage • The searching process only need some cycles of bit comparison and some cycles of string comparison. • Double-Array Trie[1] • Even node in the tree stands for a status of an auto-machine, • Which changes according to the difference of the variable. • This new structure actually is an improved scheme of the TRIE tree, using 2 linear arrays to express the TRIE tree

  16. Some novel schemes (cont.) 用負的base值表示該位置為詞語。 如果狀態i對應某一個詞,而且Base[i]=0,那麼令Base[i]=(-1)*i, 如果Base[i]的值不是0,那麼令Base[i]=(-1)*Base[i]。得到雙陣列如下: 例如設“阿根”的下標為i=8,那麼check[i]的內容是“阿”的下標, 而base[i]是“阿根廷”的下標的基值。 “廷”的序列碼為x=8,那麼“阿根廷”的下標為base[i]+x=base[8]+8=12。

  17. Some novel schemes (cont.) • Double-code scheme [1] • Basic idea is mapping the 6768 Chinese characters in GB-2312 into the sequence-code from 1 to 6768. • Every string written in Chinese can only maps to a number string, • Composed of two steps: • Switch from number-sequence into even-coding • Establish indexing mechanism

  18. Analysis of the novel schemes • Double-Character-Hash-Indexing • Improvement of TRIE index tree, while it is easier structured and maintained than the former mechanism. • PATRICIA • Is a super arithmetic in segment speed, but it waste on the memory space and reduce the efficiency. • Double-Array Trie • When decrease or increase the lexicon, the whole double-array should be adjusted. • Double-code scheme • The extract rate of the arithmetic is not good enough, which result in a very big array, restrict the performance of the search efficiency

  19. 大白 0x00E 大白日夢 大白日 大白 大白日 大白日夢 Experiment Detail

  20. Experiment Detail

  21. Conclusions • Our mechanism DCLWHI farther improves the speed and efficiency of segmentation. • The scheme A has a very high process speed but costs too much memory space, while scheme B costs less storage with a high efficiency. We think it a good eclectic mechanism for Chinese word segmentation.

  22. Opinions • Experiments are not enough to evidence this method is very well. • …..

More Related