1 / 13

A Chinese segmentation system based on document self-matching for identifying the unknown words

This paper presents a novel system combining Maximum Matching Segmentation (MMS) and Self-Matching Segmentation (SMS) to identify unknown words in Chinese text, achieving high accuracy in word identification. Experiments show significant improvement over traditional methods.

westoverm
Download Presentation

A Chinese segmentation system based on document self-matching for identifying the unknown words

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Chinese segmentation system based on document self-matching for identifying the unknown words Advisor : Dr. Hsu Presenter : Wen-Hsiang Hu Authors :Yue-Heng Sun; Pi-Lian He; Guang-Yuan Wu; Proceedings of the Second International Conference on Machine Learning and Cybernetics, November 2003, Pages:2080 - 2084

  2. Outline • Motivation • Objective • Introduction • MMS System • System Model • Experiments • Conclusion • Personal Opinion

  3. Motivation • Unknown words can be considered as good indexing terms in IR. • How to identify the new unknown words that don’t exist in the dictionary.

  4. Objective • Identify the unknown words that don’t exist in the dictionary.

  5. Introduction • Combine two segmentation systems • Maximum Matching Segmentation (MMS) as a preparatory step. If a word contained in the document is found in the dictionary, a successful matching is gotten. • Self-Matching Segmentation (SMS) has the function of identifying unknown words.

  6. MMS System • Let D = dictionary, Max=the character number of the most longest term in the dictionary, str = string which are ready to segment D= 恐怖份子、破壞、世界和平................ Max= 4 Str = 恐怖份子本拉登欲破壞世界和平………………

  7. SMS System System Model Ex: 本拉登欲 Ex:本拉登欲 Ex: 本拉登欲 本拉登欲、但為了…… MMS Ex:本拉登欲 Ex: 本拉登欲 Use IDF to gain the Discrimination Competence Where N is the number of documents in the collection and nw is the document frequency, the number of documents contained the unknown word w Ex: 本拉登欲

  8. Experiments • Experimental Environment • Use NTCIR-3 Chinese documents • Five types of text collections: economy, society, technology, amusement and sports • Table 1 provides the identification ratio of the unknown words of the five types of texts in SMS System

  9. Experiments (cont.)

  10. Frequency Feature • A plot of the frequency of occurrence and the rank order rare words Significant words These words are content bearing in some ways and may be as the potential indexing terms. common words 4 8 order

  11. Ability as Good Indexing Terms • If removing an unknown word leads to an decrease of dissimilarity between the query and document, then the word can be considered as a good indexing term. Where l is the number of indexing terms, dik is the weight of the kth indexing term in document Di and qjk is the weight of the kth indexing term in query Qj

  12. Conclusion • This paper provides a SMS System based on the MMS method. • The system can achieve above 85% identification ratio of the unknown words, and get an increase of 10% in recall and precision compared with the MMS System.

  13. Personal Opinion • Drawback • The lack of the measure value of IDF

More Related