220 likes | 235 Views
A New Lexicon Mechanism for Chinese Word Segmentation. Advisor : Dr. Hsu Graduate : Kuo-min Wang. 2006 PACIS. Outline. Motivation Objective Introduction A New Lexicon Mechanism Experiments Conclusion Personal Opinions. Motivate.
E N D
A New Lexicon Mechanism for Chinese Word Segmentation Advisor : Dr. Hsu Graduate : Kuo-min Wang 2006 PACIS .
Outline • Motivation • Objective • Introduction • A New Lexicon Mechanism • Experiments • Conclusion • Personal Opinions
Motivate • Under the development of global networking through Internet, the amount of articles in Chinese or other oriental languages is increasing rapidly. • As the lack of explicit separator, word segmentation is a precondition for the processing of these character-based languages and thus affecting the whole system in performance.
Objective • This paper propose a new solution for Chinese word segmentation problem based on lexicon named double-character-and-long-world-hash-indexing (DCLWHI). • This method can improve the speed and efficiency of word segmentation without extra memory spending, and gains the same accuracy.
Introduction • The current methods of Chinese word segmentation are divided into two kinds • Lexicon • Easily accomplished, high level arithmetic efficiency • Out of vocabulary problem (OOV) • (new words, names of people, organizations and locations) • Frequency statistic • Has the advantage on OOV problems • But the arithmetic efficiency is much lower than the lexicon based method.
A New Lexicon Mechanism • The Double-Character words hold large proportion in Chinese words. • 70% are double-character word [4] • Make a hash indexing for the first two characters of the lexicon words, then add the remaining string into a special long word table, which has a hash indexing.
A New Lexicon Mechanism • First-Double-Character-Hash-Indexing • Flag Bit(2Bytes) • If the two-character is a prefix of a word which length is N, the big N-1 of the 2 bytes will be set 1; • Exaple “圖籍” , which is a double-character word, but can’t be the prefix of other words, So the Flag Big of 圖籍 is set 0000000000000010(0x0002) • 電老 is not a Chinese word, but it can be a prefix of a word 電老虎. So the Flag Bit is 0000000000000100(0x0004) • Similar examples : 春夏(ox000A),君子(x0006)、敢作(x0008) • Long Word Hash Indexing • Similar to the First-Double-Character-Hash-Indexing. 2-character 0000 0000 0000 0010 3-character 0000 0000 0000 0100 4-character 0000 0000 0000 1000
A New Lexicon Mechanism • Example of Search->君子當圖籍是電老虎 • Pick up first two characters “君子”, Flag Big is x0006 can be a 2-character or a prefix of a Treble-Character word. • Then shift to the character “當”, compute the hash value of the substring “君子當”, search in the long word • Find the marching index, confirm the string , marching succeed. • Shift to Character “圖籍” (0x002) • Shift to Character “是電” • There is no value in hash-indexing, 2 situations may happen • First, there is no value in hash-indexing, return one character “是” • Second, there is a substring in the index, but value unequally; return one character “是” • Shift to Character電老”(0x004) • Shift to Character “虎” 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎 君子當圖籍是電老虎
Experiments • Comparison of Searching Cycles • Comparison of Memory Space Cost • Comparison of Speed
Background • Binary-Seek-by-Word • Composed of three parts • Lexicon text, word-index-table, first-character-index-table
Background (cont.) • TRIE indexing tree • is a multi-chain-table tree, the mechanism is composed of two parts: • First-character-index table and TRIE index-tree node • Didn’t need to predict the length of the word , only need to match the word by chain-tree
Background (cont.) • Binary-Seek-by-Characters • Absorbs the search-advantage in TRIE indexing tree, using searching by characters not searching by words
Background (cont.) • Summary above methods’ drawbacks • Binary-seek-by-word is using full-words marching, the efficiency is evidently low. • The design and maintenance of the TRIE tree is very complex, wastes mass memory space • Binary-seek-by characters • Improves some aspects, but it doesn’t change the data structure of the binary-seek-by-word which restrict the efficiency.
Some novel schemes • Double-Character-Hash- indexing[4] • An new searching tree improved from the TRIE indexing tree. • Composed of two parts: Hashing index, remaining strings. • Can avoids the deep searching , increases the segment speed without complex increasing.
Some novel schemes (cont.) • A new lexicon mechanism based on PATRICIA[3] • Use of the ISN (internal statement number) of the words as the key words bit-string, • Constructs the PATRICIA tree by comparing the big-string. • Advantage • The searching process only need some cycles of bit comparison and some cycles of string comparison. • Double-Array Trie[1] • Even node in the tree stands for a status of an auto-machine, • Which changes according to the difference of the variable. • This new structure actually is an improved scheme of the TRIE tree, using 2 linear arrays to express the TRIE tree
Some novel schemes (cont.) 用負的base值表示該位置為詞語。 如果狀態i對應某一個詞,而且Base[i]=0,那麼令Base[i]=(-1)*i, 如果Base[i]的值不是0,那麼令Base[i]=(-1)*Base[i]。得到雙陣列如下: 例如設“阿根”的下標為i=8,那麼check[i]的內容是“阿”的下標, 而base[i]是“阿根廷”的下標的基值。 “廷”的序列碼為x=8,那麼“阿根廷”的下標為base[i]+x=base[8]+8=12。
Some novel schemes (cont.) • Double-code scheme [1] • Basic idea is mapping the 6768 Chinese characters in GB-2312 into the sequence-code from 1 to 6768. • Every string written in Chinese can only maps to a number string, • Composed of two steps: • Switch from number-sequence into even-coding • Establish indexing mechanism
Analysis of the novel schemes • Double-Character-Hash-Indexing • Improvement of TRIE index tree, while it is easier structured and maintained than the former mechanism. • PATRICIA • Is a super arithmetic in segment speed, but it waste on the memory space and reduce the efficiency. • Double-Array Trie • When decrease or increase the lexicon, the whole double-array should be adjusted. • Double-code scheme • The extract rate of the arithmetic is not good enough, which result in a very big array, restrict the performance of the search efficiency
大白 0x00E 大白日夢 大白日 大白 大白日 大白日夢 Experiment Detail
Conclusions • Our mechanism DCLWHI farther improves the speed and efficiency of segmentation. • The scheme A has a very high process speed but costs too much memory space, while scheme B costs less storage with a high efficiency. We think it a good eclectic mechanism for Chinese word segmentation.
Opinions • Experiments are not enough to evidence this method is very well. • …..