270 likes | 457 Views
Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation. Wei Qiao, Maosong Sun and Wolfgang Menzel. State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University. Part Ⅰ. Background.
E N D
Statistical Properties of Overlapping Ambiguities inChinese Word Segmentation and aStrategy for Their Disambiguation Wei Qiao, Maosong Sun and Wolfgang Menzel State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University
Part Ⅰ Background
Introduction • Chinese word segmentation • Combination ambiguity 火 把(torch) 火(fire) 把 (make) • Overlapping ambiguity a.先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b.首先要关注整体,其次要注意细节 其次 要(secondly we should) 火 把 ★
Related Terms • Overlapping ambiguity string (OAS) • Length; Order; Intersection length; Structure • Maximal overlapping ambiguity string (MOAS) • True / Pseudo ambiguity MOAS e.g. 其次要 (TM): 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部(measure word) 长篇小说 order2 order1 0-2, 1-3 0 3 1 2 3
Previous Work • [Sun et al.,1999] • 100 million character • A set of core for MOAS is found • [Li, et al., 2003] • 650 million character • Similar method is used to improve the performance of segmenter
Motivation • Two basic issues remain unsolved in their work: • Only include news data, the results need further validated • Determine the core of pseudo OA strings. both for general-purpose and domain-specific.
Part Ⅱ Statistical Properties of MOAS • From General Corpus • From Domain-specific Corpus
From General Corpus • Data Set • CBC : 929,963,468 characters • Rich in content (from 1920’s) covering rich categories such as novel, essay, news…… • Chinese Word List • Peking University, with 74,191 entries • Automatically find totally 733,066 distinct MOAS types in CBC
From General Corpus • Detailed Distribution • Perspective 1: Length
From General Corpus • Perspective 2: Order
From General Corpus • Perspective 3: Intersection Length
From General Corpus • Perspective 4: Structure distribution
From General Corpus • Top N Frequent MOAS --Core candidate 40000 ~ 80.39% 7000 ~ 60.43% 3500 ~ 50.78%
From General Corpus • Stability VS Corpus size Top 7000 # of MOAS VS Corpus size # of top N MOAS VS Corpus size
From General Corpus • Pseudo MOAS Detection • Relax definition on “Pseudo” Eg. “出国门”: 出 国门(go abroad)in almost all the cases 出国 门 (the way to go abroad) small possibility • 5,507 PM and 1,439 TM judged by hand • Token coverage of PM and TM over CBC
From Domain-specific Corpora • Domain-Specific Corpora • Ency55: 90.02 million characters • Web55: 54.97 million characters • Common Parts
From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=3,500)
From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)
From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)
From Domain-specific Corpora • PM and TM distribution over Domain Corpora • 42% of overlapping ambiguities in any Chinese text can be 100% solved. ★
Part Ⅲ Disambiguation
Disambiguation Method • Current performance on OA • Performance of ICTCLAS1.0 http://www.nlp.org.cn on OAs e.g. 公安局 长 是 主管 这一 事故 的 The police chief (公安 局长) is the person who in charge of this accident. • Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs e.g. 核电站的特殊性 质 The special properties (特殊 性质) of nuclear power station
Disambiguation Method • Performance of CRF-base[Lafferty 2001] CWS on OAs e.g. 这一 现状 先 天地 决定 了 他们 的 使命 This situation congenitally (先天 地) makes them to take the mission About 2% of OAS are mistakenly segmented ——it is a net gain
Disambiguation Method • Individual-based method • Simple table lookup: record the PMs and the correct segmentation in a table • Advantage • Satisfactory token coverage to MOASs • Full correctness for segmentation of pseudo MOASs • Low cost in time and space complexity.
Conclusion • An extension of [Sun et. al, 1999] • Adjust the exist results in large corpora • Further verify the properties on domain-specific corpora • An disambiguation strategy is proposed • Over 42% Overlapping ambiguity can be resolved without any mistake • Will be more effective when facing running text
Reference • Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference of ICML, pages 282-289. • Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese) • Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7. • Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338. • Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese) • Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)
Thank you any comments ? ^.^