180 likes | 336 Views
2009. 6. 18. ISO/TC37/SC4/WG2 Word Segmentation Project Editorial Meeting Word Segmentation in Korean. Hansaem Kim The National Institute of the Korean Language. Contents for further work (09.4.24.). Part1 1. WU, WSU: check up 2. Figure1 -> change and check. 3. Figure4
E N D
2009. 6. 18.ISO/TC37/SC4/WG2 Word Segmentation Project Editorial MeetingWord Segmentation in Korean Hansaem Kim The National Institute of the Korean Language
Contents for further work (09.4.24.) Part1 1. WU, WSU: check up 2. Figure1 -> change and check. 3. Figure4 1) lemma: delete 2) other lexical items -> other character strings 3) word forms -> lexical items 4) bound morpheme: delete Part2 1. terms & definition: added ex) bunsetz, eojeol, character, etc 2. Properties of CJK: add to introductory part 3. in "Scope": Chinese scripts -> Chinese characters 4. Application of Chinese general rules for JK in combination of Chinese characters 5. Add examples of agglutinative unit in JK
Table of contents (Part 2) Foreword 1. Introduction: Kim 1)difference of CJK 2)interaction of CJK (nouns w/ Chinese characters) 2. Scope: Choi Application oriented refer to MAF, SynAF, etc linguistic layer & processing(vertical) 3. terms and definitions Bunsetsu: Kanzaki Eojeol: Kim 4. Overview and motivation: Kanzaki(main), Sun, Kim Mapping table of CJK POS scheme( + examples and definition) 5. Chinese word segmentation 6. Japanese 7. Korean 5.1. General rules for identifying WUs in Chinese text 5.2 Typology of WUs in Chinese
Word unit(WU) Distinction between ‘word unit’and‘word segmentation unit’ Y Terms and definition of WSU + N Correcting the definition of WU MWE(phrasal compound, fragment of sentence,…) ⊂ lexical item? Y No change or changing ‘lexical items’ into ‘lexical items including MWEs’ N changing ‘lexical items’ into ‘lexical items, MWEs’ Terms and definitions
Essential concept systems (Figure 4) changed Word segmentation unit Miscellaneous character strings Word forms
Word segmentation for CJK (Part2)
See the document. 1)difference of CJK 2)interaction of CJK (nouns w/ Chinese characters) Introduction
Eojeol Linguistic unit separated by white space in Korean text, consisting of a word followed by either particle(s) or ending(s), or just a word. Example Given a sentence “나는 점심을 먹었다.”, “나(I)” is a pronoun, “는”is a particle, “점심(lunch)” is a noun, “을”is a particle, “먹(eat)” is a verbal stem followed by the endings “었”and “다”. And the sentence contains 3 Eojeols - “나는”, “점심을”, and “먹었다”. Terms and definitions
Mapping table of CJK POS scheme Overview and motivation
7.1.1. Punctuation Space blank and punctuations are separation marks of word segmentation unit in computer processing. The punctuations used as separation marks include the full stop(.), question mark(?), exclamation mark(!), comma(,) middle dot(․), colon(:), slash(/), quotation mark(“”, ‘’), brackets(( ), { }, [ ]), dash(―), hyphen(-), swungdash(~), ellipsis dots(……), etc. Korean punctuation marks are listed up in the “Korean language regulations”.
7.1.2.1. Numeric character strings 1984, 2009 7.1.2.2. Foreign character strings GPS, EU, 同意 7.1.2.3. Hangeul(Korean Alphabet) characters (C & V) ㄱㄴㄷ, 가 7.1.2.4. Combination of character strings or other symbols [abc], {라} 7.1.2. Combination of characters
7.1.3.1. Simplex 사자, 밥 7.1.3.2. Compound 농목장, 검붉다 7.1.3.3. Derivation 풋사과, 신사적, 동의하다 7.1.3.4. Abbreviation 건교위, 노찾사 7.1.3.5. idiomatic expression w/ Chinese characters 와신상담(臥薪嘗膽), 오십보백보(五十步百步) 7.1.3. word
7.1.4.1. Phrasal compound 1) General phrasal compound 주민 번호 2) Terminology 민주 국가, 계급 사회 3) Expressions related to proper nouns 예술의 전당 7.1.4.2. Idiom 1) Lexical idiom 무릎을 꿇다 2) Grammatical idiom ~로 인해, ~을 위해 7.1.4.3. Fixed expression: proverb, motto, etc. 낫 놓고 기역 자도 모른다 7.1.4. Combination of words (MWEs)
1. Noun 1.1 Common noun 1.2 Proper noun 1.3 Bound noun 2. Pronoun 3. Numeral 4. Verb 5. Auxiliary verb 6. Copula Overall typology (See the document.) 7. Adjective 8. Auxiliary adjective 9. Adnoun 10. Adverb 11. Exclamation 12. Particle 12.1 Case particle 12.2 Auxiliary particle