A Chinese segmentation system based on document self-matching for identifying the unknown words

A Chinese segmentation system based on document self-matching for identifying the unknown words Advisor : Dr. Hsu Presenter : Wen-Hsiang Hu Authors :Yue-Heng Sun; Pi-Lian He; Guang-Yuan Wu; Proceedings of the Second International Conference on Machine Learning and Cybernetics, November 2003, Pages:2080 - 2084

Outline • Motivation • Objective • Introduction • MMS System • System Model • Experiments • Conclusion • Personal Opinion

Motivation • Unknown words can be considered as good indexing terms in IR. • How to identify the new unknown words that don’t exist in the dictionary.

Objective • Identify the unknown words that don’t exist in the dictionary.

Introduction • Combine two segmentation systems • Maximum Matching Segmentation (MMS) as a preparatory step. If a word contained in the document is found in the dictionary, a successful matching is gotten. • Self-Matching Segmentation (SMS) has the function of identifying unknown words.

MMS System • Let D = dictionary, Max=the character number of the most longest term in the dictionary, str = string which are ready to segment D= 恐怖份子、破壞、世界和平................ Max= 4 Str = 恐怖份子本拉登欲破壞世界和平………………

SMS System System Model Ex: 本拉登欲 Ex:本拉登欲 Ex: 本拉登欲本拉登欲、但為了…… MMS Ex:本拉登欲 Ex: 本拉登欲 Use IDF to gain the Discrimination Competence Where N is the number of documents in the collection and nw is the document frequency, the number of documents contained the unknown word w Ex: 本拉登欲

Experiments • Experimental Environment • Use NTCIR-3 Chinese documents • Five types of text collections: economy, society, technology, amusement and sports • Table 1 provides the identification ratio of the unknown words of the five types of texts in SMS System

Experiments (cont.)

Frequency Feature • A plot of the frequency of occurrence and the rank order rare words Significant words These words are content bearing in some ways and may be as the potential indexing terms. common words 4 8 order

Ability as Good Indexing Terms • If removing an unknown word leads to an decrease of dissimilarity between the query and document, then the word can be considered as a good indexing term. Where l is the number of indexing terms, dik is the weight of the kth indexing term in document Di and qjk is the weight of the kth indexing term in query Qj

Conclusion • This paper provides a SMS System based on the MMS method. • The system can achieve above 85% identification ratio of the unknown words, and get an increase of 10% in recall and precision compared with the MMS System.

Personal Opinion • Drawback • The lack of the measure value of IDF

A Chinese segmentation system based on document self-matching for identifying the unknown words

A Chinese segmentation system based on document self-matching for identifying the unknown words

Presentation Transcript

W.A.C. unknown words!

Identifying Sight Words

Phonics: Strategies for Decoding Unknown Words

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis

Matching words to readers

A Presentation on Fingerprint Matching System

Defining Unknown Words

Identifying Unknown Substances

Identifying the needs of unknown users

Unknown words

Categorizing Unknown Words:

Extraction and segmentation of tables from Chinese ink documents based on a matrix model

A New Lexicon Mechanism for Chinese Word Segmentation

UNKNOWN WORDS:

Matching Words and Pictures

Adaptive Segmentation Based on a Learned Quality Metric

Stack-based Algorithms for Pattern Matching on DAGs

A New Lexicon Mechanism for Chinese Word Segmentation

Identifying Sight Words

Unsupervised Lexicon-Based Resolution of Unknown Words for Full Morphological Analysis

W.A.C. unknown words!