1 / 27

Wei Qiao, Maosong Sun and Wolfgang Menzel

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation. Wei Qiao, Maosong Sun and Wolfgang Menzel. State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University. Part Ⅰ. Background.

tea
Download Presentation

Wei Qiao, Maosong Sun and Wolfgang Menzel

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical Properties of Overlapping Ambiguities inChinese Word Segmentation and aStrategy for Their Disambiguation Wei Qiao, Maosong Sun and Wolfgang Menzel State Key Lab of Intelligent Tech. & Sys. Tsinghua University Department Informatic, Hamburg University

  2. Part Ⅰ Background

  3. Introduction • Chinese word segmentation • Combination ambiguity 火 把(torch) 火(fire) 把 (make) • Overlapping ambiguity a.先解决其主要问题,再解决其次要问题 其 次要 (the subordinate) b.首先要关注整体,其次要注意细节 其次 要(secondly we should) 火 把 ★

  4. Related Terms • Overlapping ambiguity string (OAS) • Length; Order; Intersection length; Structure • Maximal overlapping ambiguity string (MOAS) • True / Pseudo ambiguity MOAS e.g. 其次要 (TM): 其次 要 & 其 次要 e.g. 部长篇小说 (PM) : 部(measure word) 长篇小说 order2 order1 0-2, 1-3 0 3 1 2 3

  5. Previous Work • [Sun et al.,1999] • 100 million character • A set of core for MOAS is found • [Li, et al., 2003] • 650 million character • Similar method is used to improve the performance of segmenter

  6. Motivation • Two basic issues remain unsolved in their work: • Only include news data, the results need further validated • Determine the core of pseudo OA strings. both for general-purpose and domain-specific.

  7. Part Ⅱ Statistical Properties of MOAS • From General Corpus • From Domain-specific Corpus

  8. From General Corpus • Data Set • CBC : 929,963,468 characters • Rich in content (from 1920’s) covering rich categories such as novel, essay, news…… • Chinese Word List • Peking University, with 74,191 entries • Automatically find totally 733,066 distinct MOAS types in CBC

  9. From General Corpus • Detailed Distribution • Perspective 1: Length

  10. From General Corpus • Perspective 2: Order

  11. From General Corpus • Perspective 3: Intersection Length

  12. From General Corpus • Perspective 4: Structure distribution

  13. From General Corpus • Top N Frequent MOAS --Core candidate 40000 ~ 80.39% 7000 ~ 60.43% 3500 ~ 50.78%

  14. From General Corpus • Stability VS Corpus size Top 7000 # of MOAS VS Corpus size # of top N MOAS VS Corpus size

  15. From General Corpus • Pseudo MOAS Detection • Relax definition on “Pseudo” Eg. “出国门”: 出 国门(go abroad)in almost all the cases 出国 门 (the way to go abroad) small possibility • 5,507 PM and 1,439 TM judged by hand • Token coverage of PM and TM over CBC

  16. From Domain-specific Corpora • Domain-Specific Corpora • Ency55: 90.02 million characters • Web55: 54.97 million characters • Common Parts

  17. From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=3,500)

  18. From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=7,000)

  19. From Domain-specific Corpora • Frequent MOAS Coverage in Domain Specific Corpora (N=40,000)

  20. From Domain-specific Corpora • PM and TM distribution over Domain Corpora • 42% of overlapping ambiguities in any Chinese text can be 100% solved. ★

  21. Part Ⅲ Disambiguation

  22. Disambiguation Method • Current performance on OA • Performance of ICTCLAS1.0 http://www.nlp.org.cn on OAs e.g. 公安局 长 是 主管 这一 事故 的 The police chief (公安 局长) is the person who in charge of this accident. • Performance of MSR-Seg1.0 http://research.microsoft.com/-S-MSRSeg on OAs e.g. 核电站的特殊性 质 The special properties (特殊 性质) of nuclear power station

  23. Disambiguation Method • Performance of CRF-base[Lafferty 2001] CWS on OAs e.g. 这一 现状 先 天地 决定 了 他们 的 使命 This situation congenitally (先天 地) makes them to take the mission About 2% of OAS are mistakenly segmented ——it is a net gain

  24. Disambiguation Method • Individual-based method • Simple table lookup: record the PMs and the correct segmentation in a table • Advantage • Satisfactory token coverage to MOASs • Full correctness for segmentation of pseudo MOASs • Low cost in time and space complexity.

  25. Conclusion • An extension of [Sun et. al, 1999] • Adjust the exist results in large corpora • Further verify the properties on domain-specific corpora • An disambiguation strategy is proposed • Over 42% Overlapping ambiguity can be resolved without any mistake • Will be more effective when facing running text

  26. Reference • Lafferty J., A. McCallum, and F. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of 18th International Conference of ICML, pages 282-289. • Li R., S.H. Liu, S.W. Ye, and Z.Z. Shi. 2001. A method for resolving overlapping ambiguities in Chinese word segmentation based on SVM and k-NN. Journal of Chinese Information Processing, 15(6): 13-18. (In Chinese) • Li M., J.F. Gao, C.N. Huang, and J.F. Li. 2003. Unsupervised training for overlapping ambiguity resolution in Chinese word segmentation. In Proceedings of SIGHAN’2003, pages 1-7. • Sun M.S. and Z.P. Zuo. 1998. Overlapping ambiguities in Chinese text. Quantitative and Computational Studies on the Chinese Language, pages 323-338. • Sun M.S., C.N. Huang, and B.K.Y. T’sou. 1997. Using character bigram for ambiguity resolution In Chinese word segmentation. Computer Research and Development, 34(5): 332-339. (In Chinese) • Sun M.S., Z.P. Zuo and B.K.Y. T’sou. 1999. The role of high frequent maximal crossing ambiguities in Chinese word segmentation. Journal of Chinese Information Processing, 13(1): 27-37. (In Chinese)

  27. Thank you any comments ? ^.^

More Related