1 / 22

Sixth SIGHAN Workshop on Chinese Language Processing Jia Lu, Masayuki Asahara, Yuji Matsumoto

Analyzing Chinese Synthetic Words with Tree-based Information and a Survey on Chinese Morphologically Derived Words. Sixth SIGHAN Workshop on Chinese Language Processing Jia Lu, Masayuki Asahara, Yuji Matsumoto 黃挺豪 20080328. Introduction. Goal

monet
Download Presentation

Sixth SIGHAN Workshop on Chinese Language Processing Jia Lu, Masayuki Asahara, Yuji Matsumoto

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Analyzing Chinese Synthetic Words with Tree-based Informationand a Survey on Chinese Morphologically Derived Words Sixth SIGHAN Workshop on Chinese Language Processing Jia Lu, Masayuki Asahara, Yuji Matsumoto 黃挺豪 20080328

  2. Introduction • Goal • The structure of the internal information of Chinese synthetic words?Ex. 電腦→電/腦;接駁車→接駁/車 • What’s the Semantic and syntactic types?Ex. 電腦→電/腦,”電”修飾”腦”(偏正) • No single segmentation standard

  3. Definition of Chinese Words • Single-morpheme words • One-character words人、馬、車 • One-morpheme words鸚鵡、翡翠、鴛鴦 • Transliteration words肯德基、阿斯匹靈 • Synthetic words

  4. Classification of Chinese synthetic words

  5. Compound Words • Subject-predicate(主謂)VS: 搬運/工、裁判/員SV: 胃/下垂、地/震 • Verb-object(動賓)OV: 黨/代表VO: 理/髮、反/政府 • Verb-modification(動詞偏正)VX: 放大/器, 沖印/店XV: 自動/控制、批/發

  6. Compound Words (Cont.) • Predicate-complement(述補)VV: 跑/出來、打發/掉VA: 染/紅 • Parallel-combination(聯合)開/發、學/習、國/家、兄/弟、中/日/韓 • Noun-modification(名詞偏正)電/腦、書/架、汽車/站

  7. Morphologically Derived Words • Merging中學+小學→中小學上文+下文→上下文 • Reduplication雄糾糾、研究研究 • AffixationPrefix: 副/主席、總/工程infix: 看不到、聽得見Suffix: 調查/局、安全/廳

  8. Exceptions • Abbreviations中共→中國共產黨 • Factoids2007.1.30、五點半、三塊五毛六五點半 • Idioms, proverbs, sayings and poems門可羅雀、先天下之憂而憂

  9. Previous Research • Andi Wu, 2003烤麵包器[toaster]→ 烤[bake] / 麵包[bread] / 器[machine] • C. Huang, 1997北京市安全廳→ <w2><w1><w0>北京</w0><w0>市</w0></w1><w1><w0>安全</w0><w0>廳</w0></w1></w2>

  10. Synthetic word analysis with tree-based structure information • AssumeWords which are already in the system dictionary could be word components of other unknown synthetic words → classify all synthetic words in dictionary • Focus on 3-character words

  11. Tree structure

  12. Synthetic word analysis with tree-based structure information • Some rules which have the following form: A + B➔ Category or A + B + C➔ Category • A, B and C are parts ofspeech, affixationor other properties of word components.

  13. Annotation of morphologically derived words in dictionary

  14. Annotation of morphologically derived words in dictionary (Cont.)

  15. Preprocess • Noun-modification words • Affixation Table Map (from 1000 words)Ex. Prefix: 副手、副隊長、副班長Suffix: 跳舞、現代舞、芭蕾舞 • Parallel combination、Reduplication • Match the fixed formatEx. 研究研究:ABAB;試看看:BAAAA, ABAB, AABB, AXA, AXAY, XAYA, AAB and ABB.

  16. Experiment on Morphologically derived words • Corpus: Chinese Gigaword(CGW) • 3-char = 1+2 or 2+1 (rarely 1+1+1) • The word ABC (Ex.交響曲) • Mi-pre the mutual information for A and BC • Mi-suf the mutual information for AB and C

  17. Experiment on Morphologically derived words (Cont.)

  18. Using SVM • Feature • Internal partA, C, BC, AB, ABC • POS of each internal partpos(A), pos(C), pos(BC), pos(AB), pos(ABC) • Frequency of each part in Chinese Gigawordfre(A), fre(C), fre(BC), fre(AB), fre(ABC) • Mutual information of internal partMi-pre(A-BC), Mi-suf(AB-C)

  19. Result

  20. Result Discussion • Unbalance result because prefix is much fewer (9.78%) • Wrong Case主色調、土坷垃、小賣部、山大王、市中心、菲軍方、零備件、學聯會 • 大批量:「量」是 suffix 的機率太高 • 羅影劇:「羅影」不在字典中

  21. Future Work • Other morphological internal structures • Thesaurus that contain syntactic information of words or char to help us • Build a Chinese synthetic word dictionary

  22. Thanks!

More Related