Wei Qiao, Maosong Sun and Wolfgang Menzel

Statistical Properties of Overlapping Ambiguities inChinese Word Segmentation and aStrategy for Their Disambiguation Wei Qiao, Maosong Sun and Wolfgang Menzel State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University

Part Ⅰ Background

Introduction • Chinese word segmentation • Combination ambiguity 火把(torch) 火(fire) 把 (make) • Overlapping ambiguity a.先解决其主要问题，再解决其次要问题其次要 (the subordinate) b.首先要关注整体，其次要注意细节其次要(secondly we should) 火把 ★

Related Terms • Overlapping ambiguity string (OAS) • Length; Order; Intersection length; Structure • Maximal overlapping ambiguity string (MOAS) • True / Pseudo ambiguity MOAS e.g. 其次要（TM）: 其次要 & 其次要 e.g. 部长篇小说 (PM) : 部(measure word) 长篇小说 order2 order1 0-2, 1-3 0 3 1 2 3

Previous Work • [Sun et al.,1999] • 100 million character • A set of core for MOAS is found • [Li, et al., 2003] • 650 million character • Similar method is used to improve the performance of segmenter

Motivation • Two basic issues remain unsolved in their work: • Only include news data, the results need further validated • Determine the core of pseudo OA strings. both for general-purpose and domain-specific.

Part Ⅱ Statistical Properties of MOAS • From General Corpus • From Domain-specific Corpus

From General Corpus • Data Set • CBC : 929,963,468 characters • Rich in content (from 1920’s) covering rich categories such as novel, essay, news…… • Chinese Word List • Peking University, with 74,191 entries • Automatically find totally 733,066 distinct MOAS types in CBC

From General Corpus • Detailed Distribution • Perspective 1: Length

From General Corpus • Perspective 2: Order

From General Corpus • Perspective 3: Intersection Length

From General Corpus • Perspective 4: Structure distribution

From General Corpus • Top N Frequent MOAS --Core candidate 40000 ~ 80.39% 7000 ~ 60.43% 3500 ~ 50.78%

From General Corpus • Stability VS Corpus size Top 7000 # of MOAS VS Corpus size # of top N MOAS VS Corpus size

From General Corpus • Pseudo MOAS Detection • Relax definition on “Pseudo” Eg. “出国门”：出国门(go abroad)in almost all the cases 出国门 (the way to go abroad) small possibility • 5,507 PM and 1,439 TM judged by hand • Token coverage of PM and TM over CBC

From Domain-specific Corpora • Domain-Specific Corpora • Ency55: 90.02 million characters • Web55: 54.97 million characters • Common Parts

From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=3,500)

From Domain-specific Corpora • PM and TM distribution over Domain Corpora • 42% of overlapping ambiguities in any Chinese text can be 100% solved. ★

Part Ⅲ Disambiguation

Disambiguation Method • Current performance on OA • Performance of ICTCLAS1.0 http://www.nlp.org.cn on OAs e.g. 公安局长是主管这一事故的 The police chief (公安局长) is the person who in charge of this accident. • Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs e.g. 核电站的特殊性质 The special properties (特殊性质) of nuclear power station

Disambiguation Method • Performance of CRF-base[Lafferty 2001] CWS on OAs e.g. 这一现状先天地决定了他们的使命 This situation congenitally (先天地) makes them to take the mission About 2% of OAS are mistakenly segmented ——it is a net gain

Disambiguation Method • Individual-based method • Simple table lookup: record the PMs and the correct segmentation in a table • Advantage • Satisfactory token coverage to MOASs • Full correctness for segmentation of pseudo MOASs • Low cost in time and space complexity.

Conclusion • An extension of [Sun et. al, 1999] • Adjust the exist results in large corpora • Further verify the properties on domain-specific corpora • An disambiguation strategy is proposed • Over 42% Overlapping ambiguity can be resolved without any mistake • Will be more effective when facing running text

Reference • Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference of ICML, pages 282-289. • Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese) • Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7. • Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338. • Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese) • Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)

Thank you any comments ? ^.^

Wei Qiao, Maosong Sun and Wolfgang Menzel

Wei Qiao, Maosong Sun and Wolfgang Menzel

Presentation Transcript

Ku Qiao Mai

Presenters: Shan Qiao, Chen Zhang, Xiaoming Li, Yuejiao Zhou, Wei Liu

Qiao lin 2012. 04.18

Jennifer Olson Sarah Rivest Brian Schmidtberg Sponsor: Dr. Wei Sun

Yi-Lun Sun, Wei-Ping Hu*

Wolfgang Wahlster

Peter Menzel - Material World

Qiao Liu, HKU

Song Wei

Ai Wei Wei

Ai Wei Wei

Yaoquan Zhong, Wei Guo, Weiqiang sun, Yaohui Jin and Weisheng Hu

Ang Sun Zhichao Wei Oct 29, 2012

wei jee

SOLAR Joy Ghosh, Sumesh J. Philip, Chunming Qiao {joyghosh, sumeshjp, qiao}@cse.buffalo

Unit 8 Writing a Report Lecture by Sun wei jie

Diana Menzel - Teacher

Xiao qiao turkey rice

Angela Wei