320 likes | 503 Views
Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. 黃居仁 Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm April 11, 2007,Hong Kong Polytechnic University. Citation.
E N D
Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification 黃居仁 Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm April 11, 2007,Hong Kong Polytechnic University
Citation • Please note that this is our ongoing work that will be presented later as Chu-Ren Huang, Petr Šimon, Shu-Kai Hsieh and Laurent Prévot. 2007. Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification. To appear in the proceedings of the 2007 ACL Annual Meeting.
Outline • Introduction: modeling and theoretical challenges • Previous Models • Segmentation as Tokenization • Character classification model • A radical model • Implementation and experiment • Conclusion/Implications
Introduction: modeling and theoretical challenges • Back to the basics: The goal of Chinese word segmentation is to identify wordbreaks • Such that these segmented units can be used as processing units (i.e. words) • Crucially • Words are not identified before segmentation • Wordbreaks in Chinese fall at character-breaks only, and at no other places
Challenge I Segmentation is the pre-requisite task for all Chinese processing applications, hence a realistic solution of segmentation must be • Robust: perform consistently regardless of language variations • Scalable: be applicable to all variants of Chinese and requires minimal training • Portable: applicable for real time processing to all kinds of texts, all the time,
Challenge II Chinese speakers perform segmentation subconsciously without mistakes, hence if we simulate human segmentation, it must : • Be Robust, Sharable, Portable • Not assume prior lexical knowledge • Equally sensitive to known and unknown words
So Far Not so good • All exiting algorithms perform reasonably well but require • Large set of training data • Long training time • Comprehensive lexicon • And the training process must be repeated with every new variant (topic/style/genre) But Why?
Previous Models ISegmentation as Tokenization The Classical Model (Chen and Liu 1992 etc.) • Segmentation is interpreted as identification of tokens (e.g. words) in a text, hence contains two steps • Dictionary Lookup • Unknown Word (or OOV) Resolution
Segmentation as Tokenization 2 • Find all sequences Ci, …Ci+m such that [Ci, …Ci+m] is a token iff • it is an entry in the lexicon, or • It not a lexical entry but is predicted to be so by a unknown word resolution algorithm • Ambiguity Resolution: when there is a Cj, such that both [x, Cj, y] and [y, Cj, z] are entries in the lexicon
Segmentation as Tokenization 3 • High Complexity: • mapping tens of thousand of lexical entries to even more possible matching strings • Overlapping ambiguity estimated to be up to 20% depending on texts and lexica • Not Robust • Dependent on lexicon (and lexica are notoriously easy to change and expensive to build • OOV?
Previous Models II:Character Classification Currently Popular Model (Xue 2003, Gao et al. 2004) • Segmentation is re-interpreted as classification of character positions. • Classify and tag each character according to its position in a word (initial, final, middle etc.) • Learn the distribution of such classification from a corpus • Predict segmentation based on positional classification of a character in a string
Character Classification 2 • Character Classification: • Each character Ci is associated with a 3-tuple Ci: <Inii, Midi, Fin i> where Inii, Midi, Fini are the probability for Ci, to be in Initial, Middle, or Final positions respectively. • Ambiguity Resolution: • Multiple classification of a character: A character does not occur exclusively as initial or final etc. • Conflicting classifications of neighboring characters.
Character Classification 3 • Less Complexity: • 6,000 characters x 3 to 10 positional classes • Higher Performance: 97% f-score on SigHAN bakeoff (Huang and Zhao 2006)
Character Classification 4 Inherent Modeling Problems • Segmentation becomes a second order decision dependent on first order decision on character classification • Unnecessary complexity involved • Inherent ceiling set (segmentation cannot outperform character classification) • Still highly dependent on lexicon • Character positions must be defined with prior lexical knowledge of a word
Our New Proposal Naïve but Radical • Segmentation is nothing but segmentation • Possible segmentation sites are well-defined without ambiguity. They are simply the character-breaks clearly marked in any text. • The task is simply to identify all CB which also function as Wordbreak (WB) • Based on distributional information extracted from the contexts surrounding CB’s (i.e. characters)
Simple Formalization • Any Chinese text is envisioned as a sequence characters-breaks CB’s, evenly distributed among a sequence of characters c’s. CB0c1CB1c2...CBi-1 ciCBi...CBn-1 cnCBn • NB: Psycholinguistic experiment with eye-tracking machine shows that eyes can fix on edges of a character when reading Chinese. (J.L. Tsai, p.c.)
How to Model Distributional Information of blanks? • There is no overt difference between CB’s and WB’s. Unlike English, where the CB spaces are small, but the WB spaces are BIG. • Hence distributional information must come from the context. • CB0c1CB1c2...CBi-1 ciCBi...CBn-1 cnCBn • Overtly, CB’s carry no distributional Info. • However, c’s do carry information about the status of a CB/WB in its neighborhood (based on a tagged corpus, or human experience)
Range of Relevant Context CBi-2 CBi-1 ciCBi+1 CBi+2 • Recall that CB’s carry no overt information, while c’s do. • Linguistically, it is attested that initial, final, second, and penultimate positions are morphologically significant. • In other words, a linguistic element can carry explicit information about immediately adjacent CB’s as well the CB’s immediately adjacent to the above two • 2CB-Model: Taking all the immediate ones • 4CB-Model: Taking two more
Collecting Distributional Information CBi-2 CBi-1 ciCBi+1 CBi+2 • Adopt either 2CBM or 4CBM • Collect a 2-tuple or 4-tuple for each character from a segmented corpus • Sum up the n-tuple value for all tokens belong to the same character type to form a distributional vector Table 2. Character table for 4CBM
Estimating Distributional Features of CB’s c-2 c-1CBc+1 c+2 • For each CB, distributional information is contributed by 2 or 4 adjacent characters • Each characters carry the four-element vector given above, align the vector positions and then sum up • Note that no knowledge from a lexicon is involved (while the character classification model is making explicit decision of the position of that character in a word)
Aligning Vector Positions c-2 c-1CBc+1 c+2 c-2< V1, V2, V3, V4 > c-1< V1, V2, V3, V4 > c+1< V1, V2, V3, V4 > c+2<V1, V2, V3, V4 >
Theoretical Issues in Modeling • Do we look beyond WB’s (in 4CBM)? • No, characters cannot contribute to boundary conditions beyond an existing boundary. • Yes, we cannot assume lexical knowledge a priori (and the model is more elegant) • One or Two features (in 4CBM)? • No, positive information (that there is a WB) and negative (that there is no WB) should be complimentary • Yes (especially when the answer to the above Q is no), since there are under-specified cases
Size of Distributional Info • The Sinica Corpus 5.0 contains 6820 types of c’s (characters, numbers, punctuation, Latin alphabet etc.) • The 10 million word corpus is converted into 14.4 million labeled CB vectors. • In this first study we implement a CB only model, without any preprocessing of punctuation marks.
How to Model Decision I • Assuming that each character represents an independent event, hence all relevant vectors can be summed up and evaluated • Simple heuristic by sum and threshold • Decision Tree trained on segmented corpus • Machine-learning trained on segmented corpus?
Simple Sum and Threshold Heuristic • Mean for sums of CB vectors for each S and -S (mean probability of S = 2.90445651112, -S = 1.89855870063) • One standard deviation difference between each CB vector and threshold values was used as a segmentation heuristics • 88% accuracy • Error analysis: CB vectors are not linearly separable
Decision Tree • A decision tree classifier (YaDT, Ruggieri2004) is adopted • on a 900,000 CB vectors sample of 100,000 boundary vectors for testing phase. • Achieves up to 97% accuracy in inside test, including numbers, punctuation and foreign words.
Evaluation: SigHAN Bakeoff • Note that our method is NOT designed for SigHAN bakeoff, where resources are devoted to fine-tune for the small extra edge in scoring • This radical model aims to be robust in a real world situation, where it can perform reliably without extra tuning when encountered different texts • No manual pre-processing, texts input as seen
Evaluation • Closed test, but without any lexical knowledge
Discussion • The method is basically sound • We still need to develop an effective algorithm for adaptation to new variants • Automatic pre-processing on punctuation marks and foreign symbols should improve the performance • What role should lexical knowledge play? • The character as independent event assumption may be incorrect
How to Model Decision II • Assuming that a string of characters are not independent events, hence certain combinations (as well as single characters) can contribute to WB decision. • One possible implementation: c’s as committee members, decision by vote • Five voting blocks by simple majority: c-2 c-1, c-1, c+1 c-1, c+1, c+1 c+2 c-2 c-1CBc+1 c+2
Conclusion I • We propose a radical but elegant model for Chinese Word Segmentation • Where the task is reduce to binary classification of CB’s into WB’s and non WB’s • The model does not pre-suppose and lexical knowledge and relies only on distributional information of characters as the context for CB’s
Conclusion II • In principle, this model should be robust and scalable for all different variants of texts • Preliminary experiment result is promising yet leave rooms for improvement • Work is still on-going • You are welcomed to adopt this model and experiment with your favorite algorithm!