220 likes | 352 Views
Analyzing Chinese Synthetic Words with Tree-based Information and a Survey on Chinese Morphologically Derived Words. Sixth SIGHAN Workshop on Chinese Language Processing Jia Lu, Masayuki Asahara, Yuji Matsumoto 黃挺豪 20080328. Introduction. Goal
E N D
Analyzing Chinese Synthetic Words with Tree-based Informationand a Survey on Chinese Morphologically Derived Words Sixth SIGHAN Workshop on Chinese Language Processing Jia Lu, Masayuki Asahara, Yuji Matsumoto 黃挺豪 20080328
Introduction • Goal • The structure of the internal information of Chinese synthetic words?Ex. 電腦→電/腦;接駁車→接駁/車 • What’s the Semantic and syntactic types?Ex. 電腦→電/腦,”電”修飾”腦”(偏正) • No single segmentation standard
Definition of Chinese Words • Single-morpheme words • One-character words人、馬、車 • One-morpheme words鸚鵡、翡翠、鴛鴦 • Transliteration words肯德基、阿斯匹靈 • Synthetic words
Compound Words • Subject-predicate(主謂)VS: 搬運/工、裁判/員SV: 胃/下垂、地/震 • Verb-object(動賓)OV: 黨/代表VO: 理/髮、反/政府 • Verb-modification(動詞偏正)VX: 放大/器, 沖印/店XV: 自動/控制、批/發
Compound Words (Cont.) • Predicate-complement(述補)VV: 跑/出來、打發/掉VA: 染/紅 • Parallel-combination(聯合)開/發、學/習、國/家、兄/弟、中/日/韓 • Noun-modification(名詞偏正)電/腦、書/架、汽車/站
Morphologically Derived Words • Merging中學+小學→中小學上文+下文→上下文 • Reduplication雄糾糾、研究研究 • AffixationPrefix: 副/主席、總/工程infix: 看不到、聽得見Suffix: 調查/局、安全/廳
Exceptions • Abbreviations中共→中國共產黨 • Factoids2007.1.30、五點半、三塊五毛六五點半 • Idioms, proverbs, sayings and poems門可羅雀、先天下之憂而憂
Previous Research • Andi Wu, 2003烤麵包器[toaster]→ 烤[bake] / 麵包[bread] / 器[machine] • C. Huang, 1997北京市安全廳→ <w2><w1><w0>北京</w0><w0>市</w0></w1><w1><w0>安全</w0><w0>廳</w0></w1></w2>
Synthetic word analysis with tree-based structure information • AssumeWords which are already in the system dictionary could be word components of other unknown synthetic words → classify all synthetic words in dictionary • Focus on 3-character words
Synthetic word analysis with tree-based structure information • Some rules which have the following form: A + B➔ Category or A + B + C➔ Category • A, B and C are parts ofspeech, affixationor other properties of word components.
Annotation of morphologically derived words in dictionary (Cont.)
Preprocess • Noun-modification words • Affixation Table Map (from 1000 words)Ex. Prefix: 副手、副隊長、副班長Suffix: 跳舞、現代舞、芭蕾舞 • Parallel combination、Reduplication • Match the fixed formatEx. 研究研究:ABAB;試看看:BAAAA, ABAB, AABB, AXA, AXAY, XAYA, AAB and ABB.
Experiment on Morphologically derived words • Corpus: Chinese Gigaword(CGW) • 3-char = 1+2 or 2+1 (rarely 1+1+1) • The word ABC (Ex.交響曲) • Mi-pre the mutual information for A and BC • Mi-suf the mutual information for AB and C
Using SVM • Feature • Internal partA, C, BC, AB, ABC • POS of each internal partpos(A), pos(C), pos(BC), pos(AB), pos(ABC) • Frequency of each part in Chinese Gigawordfre(A), fre(C), fre(BC), fre(AB), fre(ABC) • Mutual information of internal partMi-pre(A-BC), Mi-suf(AB-C)
Result Discussion • Unbalance result because prefix is much fewer (9.78%) • Wrong Case主色調、土坷垃、小賣部、山大王、市中心、菲軍方、零備件、學聯會 • 大批量:「量」是 suffix 的機率太高 • 羅影劇:「羅影」不在字典中
Future Work • Other morphological internal structures • Thesaurus that contain syntactic information of words or char to help us • Build a Chinese synthetic word dictionary