140 likes | 276 Views
HanNanum Project. Sangwon Park 2010.11.24. Contents. The result of applying plug-in component based architecture Key differences with previous HanNanum ( jhannanum ver.0.7.4) GUI demo A measurement of the morphological analyzer Features of Korean morphological analysis
E N D
HanNanum Project Sangwon Park 2010.11.24
Contents • The result of applying plug-in component based architecture • Key differences with previous HanNanum (jhannanum ver.0.7.4) • GUI demo • A measurement of the morphological analyzer • Features of Korean morphological analysis • Measurement 1. Strict criteria • Measurement 2. Loose criteria
The result of applying plug-in component based architecture • HanNanum ver.0.8 was released • Plug-in component based architecture • Faster analysis speed • Object based communication • Reduced overhead between components • More accurate result • Several bugs were fixed. • GUI Demo • It helps people to understand the concept of HanNanum workflow • People can test various workflow for their own purpose
GUI Demo Workflow Information of a plug-in Plug-in Pool Workflow control Input & Output
A measurement of the morphological analyzer POS Tagger
A measurement of the morphological analyzer • Features of Korean morphological analysis • 가시는 • 가시/noun + 는/josa (thorn, prickle) • 가시/verb + 는/eomi (leave, disappear) • 가/verb + 시/eomi + 는/eomi (go) • 갈/verb + 시/eomi + 는/eomi (grind, sharpen) Ambiguity of part-of-speech • Ambiguity of segmentation of morpheme
Evaluation Metrics POS Tagger • Input • 집에 가시는 • Output • 집에 • 집/pvg+에/ecx • 집/pvg+에/jca • 가시는 • 가시/ncn+는/jxc • 갈/pvg+시/ep+는/etm • 가/pvg+시/ep+는/etm • 가/px+시/ep+는/etm • Correct Analysis • 집에 • 집/ncn+에/jca • 가시는 • 가/pvg+시/ep+는/etm
Evaluation Metrics • Measurement 1. Strict criteria • Only when the analysis result is exactly same with the corpus, it is considered as a correct one. • A measurement can be performed on large amount of test data automatically. • This has not been used in papers on Korean morphological analyzer. • Measurement 2. Loose criteria • There can be several correct answers on a input Eojeol. • Only few tags, such as {N, P, M, I , J, E, X, F, S} are considered. • Most of the papers use this criteria and say that their analyzers show around 98% accuracy.
Measurement 1. Strict criteria Input Data • Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus • Test Set 20 sentences, which have more than 10 eojeols, from 68 documents • # of sentences 1360 • # of eojeols 25515 Result • # of generated eojeols 74415 • # of eojeols which are restored and segmented correctly 23605 • # of eojeols which are tagged correctly 19147 • Precision 19147/25515 (0.75) • Recall 19147/74415 (0.26) • F-measure 0.38
Measurement 2. Loose criteria Larger Morpheme Dictionary • Morpheme Dictionary was extended with the Corpus • 29098 morphemes+tagsare extended Input • Test Corpus BORA Corpus 2 Aligned morpheme analysis corpus • Test Set 2 sentences, which have more than 10 eojeols, from 68 documents • # of sentences 136 • # of eojeols 2527 Result • # of generated eojeols 30536 • # of eojeols which are restored and segmented correctly 2340 • # of eojeols which are tagged correctly 2041 • Precision 2041/2527 (0.81) • Recall 2041/30536 (0.07) • F-measure 0.12
Thank you HAPPY CILAB