130 likes | 156 Views
This paper presents a novel system combining Maximum Matching Segmentation (MMS) and Self-Matching Segmentation (SMS) to identify unknown words in Chinese text, achieving high accuracy in word identification. Experiments show significant improvement over traditional methods.
E N D
A Chinese segmentation system based on document self-matching for identifying the unknown words Advisor : Dr. Hsu Presenter : Wen-Hsiang Hu Authors :Yue-Heng Sun; Pi-Lian He; Guang-Yuan Wu; Proceedings of the Second International Conference on Machine Learning and Cybernetics, November 2003, Pages:2080 - 2084
Outline • Motivation • Objective • Introduction • MMS System • System Model • Experiments • Conclusion • Personal Opinion
Motivation • Unknown words can be considered as good indexing terms in IR. • How to identify the new unknown words that don’t exist in the dictionary.
Objective • Identify the unknown words that don’t exist in the dictionary.
Introduction • Combine two segmentation systems • Maximum Matching Segmentation (MMS) as a preparatory step. If a word contained in the document is found in the dictionary, a successful matching is gotten. • Self-Matching Segmentation (SMS) has the function of identifying unknown words.
MMS System • Let D = dictionary, Max=the character number of the most longest term in the dictionary, str = string which are ready to segment D= 恐怖份子、破壞、世界和平................ Max= 4 Str = 恐怖份子本拉登欲破壞世界和平………………
SMS System System Model Ex: 本拉登欲 Ex:本拉登欲 Ex: 本拉登欲 本拉登欲、但為了…… MMS Ex:本拉登欲 Ex: 本拉登欲 Use IDF to gain the Discrimination Competence Where N is the number of documents in the collection and nw is the document frequency, the number of documents contained the unknown word w Ex: 本拉登欲
Experiments • Experimental Environment • Use NTCIR-3 Chinese documents • Five types of text collections: economy, society, technology, amusement and sports • Table 1 provides the identification ratio of the unknown words of the five types of texts in SMS System
Frequency Feature • A plot of the frequency of occurrence and the rank order rare words Significant words These words are content bearing in some ways and may be as the potential indexing terms. common words 4 8 order
Ability as Good Indexing Terms • If removing an unknown word leads to an decrease of dissimilarity between the query and document, then the word can be considered as a good indexing term. Where l is the number of indexing terms, dik is the weight of the kth indexing term in document Di and qjk is the weight of the kth indexing term in query Qj
Conclusion • This paper provides a SMS System based on the MMS method. • The system can achieve above 85% identification ratio of the unknown words, and get an increase of 10% in recall and precision compared with the MMS System.
Personal Opinion • Drawback • The lack of the measure value of IDF