140 likes | 273 Views
KKAP : KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser. Sangwon Park January 20, 2011. Research Goal.
KKAP: KAIST Korean Analysis Platform Morphological Analyzer, POS Tagger, Parser Sangwon Park January 20, 2011
Research Goal • The goal of the research is to develop KKAP(KAIST Korean Analysis Platform), which is a infrastructure for Korean natural language analysis. • The KKAP will be flexible and easy to utilizeso that it can be widely used in various areas. The platform will include morphological analyzer, POS tagger, parser, etc.
KKAP: KAIST Korean Analysis Platform Workflow for Korean Analysis Phase 1. Text Preprocessing Phase 2. Morphological Analysis Phase 3. POS Tagging Phase 4. Parsing Supplement Plugin Major Plugin Supplement Plugin Major Plugin Supplement Plugin Major Plugin Supplement Plugin • 7/nnc+일/nbu저녁/ncn발표예정/ncpa+이/jp+ㄴ/etm노벨문학상/nq+의/jcm유력/ncps수상자/ncn+로/jca고은/nq시인/ncn+이/jcc거론/ncpa+되/xsv+고/ecc있/paa+다/ef./sf통신은 통/ncn+신/ncn+은/jxc스웨덴/nq+의/jcm노벨상/ncn관측통/ncn+들/xsn사이/ncn+에/jca…. • 7일 저녁 발표예정인 노벨문학상의 유력 수상자로 고은 시인이 거론되고 있다. AP통신은 스웨덴의 노벨상 관측통들 사이에 한국의 고은 시인이 시리아의 시인 아도니스와 함께 올해 노벨상 수상 가능성이 큰 후보로 가장 많이 거론됐다고 전했다. … Plugin Pool Phase 2. Plugin Phase 1. Plugin Unknown Term Processing Sentence Segmentation Auto Spacing Chart-base Morph Analyzer Noun Extraction Input Filter Tag Mapper Noun Extraction Noun Phrase Extractor Analyzed Korean Document HMM-based POS Tagging Verb Phrase Extractor Tag Mapper Korean Document Analysis Chart Parser Phase 3. Plugin Phase 4. Plugin
Target Users • The Korean parser can support the other researches which need Korean analysis. • The major goal is to make the parser useful on the following researches. • I plan to work on a dependency parser so that I can follow and improve the previous researches of our laboratory and existing parser. Smart Calendar Project Korean E-mail Analysis HanNanum Parser Multi-lingual Knowledge Sync. on Wikipedia Korean Wikipedia Analysis
Korean Syntactic Tagged Corpus • KAIST Syntactic Tagged Corpus • http://bora.or.kr • Corpus 5. Manual sentence analysis corpus • 31,091 Sentences from 97 different sources. • Length: 1 ~ 33 Eojeols Average 11.35 Eojeols • Related document • Kong joo Lee, ByungGyu Chang, Gil Chang Kim, “Bracketing Guidelines for Korean Syntactic Tree Tagged Corpus Version 1”, KAIST CS Department Technical Report, CS/TR-97-112, 1997 (In Korean) • ByungGyu Chang, Kong joo Lee, Gil Chang Kim, “Design and Implementation of Tree Tagging Workbench To Build a Large Tree Tagged Corpus of Korean”, Proceedings of the Conference on Hangul and Korean Language Information Processing, pp.421~429, 1997 (In Korean)
Korean Syntactic Tagged Corpus • KAIST Syntactic Tagged Corpus [4226] ; 물론 꼭 필요할 땐 어디서든지 부르짖어야지요. ((((((물론/mag )0Mag- ((((((꼭/mag )0Mag- ((필요/ncps )0Ncps+ 하/xsm )1Paa )MgPaa+ ㄹ/etm )emPaa- (때/nbn )0Nbn )EmNbn+ ㄴ/jxt )jtNbn- ((((어디/npd )0Npd+ 에서/jca )jaNpd+ 든지/jxc )jxNpd- (부르짖/pvg )0Pvg )JxPvg )JtPvg )MgPvg+ 어야지/ef )efPvg+ 요/jxf )jfPvg+ (./sf )0Sf )sfPvg )S 0 : 0Mag -> mag 1 : 0Mag -> mag 2 : 0Ncps -> ncps 3 : 1Paa -> 0Ncps+xsm 4 : MgPaa -> 0Mag 1Paa 5 : emPaa -> MgPaa+etm 6 : 0Nbn -> nbn 7 : EmNbn -> emPaa 0Nbn 8 : jtNbn -> EmNbn+jxt 9 : 0Npd -> npd 10 : jaNpd -> 0Npd+jca 11 : jxNpd -> jaNpd+jxc 12 : 0Pvg -> pvg 13 : JxPvg -> jxNpd 0Pvg 14 : JtPvg -> jtNbnJxPvg 15 : MgPvg -> 0Mag JtPvg 16 : efPvg -> MgPvg+ef 17 : jfPvg -> efPvg+jxf 18 : 0Sf -> sf 19 : sfPvg -> jfPvg+0Sf 20 : S -> sfPvg
Korean Syntactic Tagged Corpus • Sejong Syntactic Tagged Corpus • I got the latest release from the National Institute of the Korean Language this week. • Released on December 2010 • 15 Documents • 433,839 Eojeols / 43,828 Sentences ; 프랑스의 세계적인 의상 디자이너 엠마누엘웅가로가 실내 장식용 직물 디자이너로 나섰다. (S (NP_SBJ (NP (NP_MOD 프랑스/NNP + 의/JKG) (NP (VNP_MOD 세계/NNG + 적/XSN + 이/VCP + ᆫ/ETM) (NP (NP 의상/NNG) (NP 디자이너/NNG)))) (NP_SBJ (NP 엠마누엘/NNP) (NP_SBJ 웅가로/NNP + 가/JKS))) (VP (NP_AJT (NP (NP (NP 실내/NNG) (NP 장식/NNG + 용/XSN)) (NP 직물/NNG)) (NP_AJT 디자이너/NNG + 로/JKB)) (VP 나서/VV + 었/EP + 다/EF + ./SF)))