220 likes | 438 Views
关于分词国际标准的若干思考 孙茂松 清华大学 200 7 年 2 月 1 日 西双版纳. 目的. 分词( word segmentation ) 据认为,俄罗斯方面已经进入控制首都的最后阶段。 据 认为 , 俄罗斯 方面 已经 进入 控制 首都 的 最后 阶段 。 ロシア側は首都制圧の最終段階に入ったとみられる。 ロシア 側 は 首都 制圧 の 最終 段階 に 入った と み られる 。 亚洲:日、越、泰、韩(*)等 国内:民族语言. 目的.
E N D
关于分词国际标准的若干思考孙茂松清华大学2007年2月1日西双版纳关于分词国际标准的若干思考孙茂松清华大学2007年2月1日西双版纳
目的 分词(word segmentation) • 据认为,俄罗斯方面已经进入控制首都的最后阶段。 • 据 认为 , 俄罗斯 方面 已经 进入 控制 首都 的 最后 阶段 。 • ロシア側は首都制圧の最終段階に入ったとみられる。 • ロシア 側 は 首都 制圧 の 最終 段階 に 入った と み られる 。 亚洲:日、越、泰、韩(*)等 国内:民族语言
目的 • Why “word segmentation” for what purpose? • What the output is for an input text after the process of word segmentation, pursuing the consistency in word segmentation within/among texts to the maximum extent so as to meet the requirements from a variety of applications in information processing regarding natural languages , -- both mono-lingual and multi-lingual • Without word segmentation, no “word”, no content computing and management of NLP text • Benchmark for word segmentation evaluation
Word Seg. Standard Series • Language resource management – Word segmentation of written texts or mono-lingual and multi-lingual information processing - Part 1: General principles and methods Proposed by: CNIS, China Project leaders: Prof. Sun Maosong (Tsinghua U., China) Prof. Sue Ellen Wright (Kent State University, USA) Prof. Budin (Vienna U., Austia) Experts and actively involved scholars: Dr. Galinski Experts from USA (ANSI,…) Prof. Benjamin Tsou (City U of Hong Kong) Prof Chu-ren Huang (Academic Sinica, Taipei) Prof. Virach (Thailand) Ms. Song Min (CNIS, China) ……
Word Seg. Standard Series • Language resource management - Word segmentation of written texts for mono-lingual and multi-lingual information processing - Part 2: Word Segmentation for Chinese, Japanese and Korean Proposed by: CNIS, China Project leaders: Prof. Sun Maosong (Tsinghua U., China) Prof. Choi (KAIST, Korea) Dr. Isahara (NICT, Japan) Experts and actively involved scholars: Dr. Galinski Experts from USA (ANSI,…) Prof. Benjamin Tsou (City U of Hong Kong) Prof Chu-ren Huang (Academic Sinica, Taipei) Ms. Song Min (CNIS, China) ……
Related Activities • Early August, 2004 in Paris, NWIP in ISO TC37 meetings • Late Jan. 2005, NWIP approved by ISO • Late April 2005 in Yantai: about 30 Chinese linguists and computational linguists had two days discussions on Chinese word segmentation • Early July 2005 in CNIS: Discussion with Dr. Galinski and Prof. Choi; • Late July 2005 in CNIS: Discussion with Prof. Budin • Late July 2005 in Japan: invited by Dr. Isahara from NICT, Two-days discussion by Prof. Choi, Dr. Isahara and Sun Maosong.
Related Activities • August 25, 2005 in Warsaw, ISO TC-37 Meeting, 1 day meeting on word segmentation standard • Oct. 2005, in Jeju, Korea, discussion in two related workshops (ALR, SIGHAN) (organized by Prof. Choi. presented by Sun Maosong. Dr. Isanhara also attended the events): (1) Workshop on Asian Language Resources (2) SIGHAN workshop (3 hours intensive discussion) • Nov. 2005, in Tokyo, EFTerm • Jan. 17, 2006, in Beijing, Small-scale workshop with Chinese scholors. • Jan. 20, 2006, in Jeju,Korea, ISO TC37 SC4 meeting • August, 2006,Beijing, ......
Today’s Discussion:Focus onPart 1: General Principles and Methods
概念体系(62核心,13外围) • Word A basic grammatical unit, and a relatively independent carrier of meaning, of a language that can stand alone to make up sentences. The unit is intuitively and mentally available for native speakers. In the context of a given language, a word is codified as a lexeme in the lexicon, with at least a part of speech. A word consists of at least a morpheme. • Lexeme A basic abstract unit of the lexicon which may be realized in different word forms. A simple(r) lexeme can be a part of another complex lexeme (associated with the process of derivation and compounding), and, free morphemes are the simplest lexemes. In its broader sense, lexeme is also used synonymously for word. • Word forms The concretely realized grammatical form of a word, or equivalently, of a lexeme in the lexicon, according to its grammatical categories in the context of a sentence.
(English) find, found, and finding are word forms of the lexeme FIND
一般原则与方法 4.1 Principlesin applying this Standard to the text 4.1.1 Principle of full coverage The standard should be applicable to any text that needs word segmentation. 4.1.2 Principle of consistency The standard should be used in a consistent way to any text and, the output of using the standard should also be consistent.
一般原则与方法 4.2 The universal principle of morphology All languages have words and all languages have morphemes.
一般原则与方法 • 4.3 Principles for validating the word-hood of a linguistic unit • 4.3.1 Principles from the linguistic perspective • In general, all the linguistic principles regarding word-formation hold. • Principle of bound morpheme • Principle of lexical integrity hypothesis. • (3) Principle of unpredictability of a word meaning from its subparts. • (4) Principle of idiomatization. • (5) Principle of collocation. • (6) Principle of unproductivity.
一般原则与方法 4.3.2. Principles from the practical (pragmatic) perspective (1) Principle of frequency: Frequency is a basic index for the degree of lexicalization of a linguistic unit. (2) Gestalt principle in cognitive linguistics: Things are likely to be perceived as a whole. (3) Principle of prototype members in categories: According to the prototype theory in the mental lexicon, prototype members in categories is more salient than non-prototype members, and more accurately remembered in short-term memory and more easily retained and accessed in long-term memory for human-beings. (4) Principle of language economy: For a linguistic unit, if its inclusion in the lexicon can decrease the difficulty of later linguistic analysis, then it is likely to be a lexical item. 大中小学
一般原则与方法 4.4 The full entry principle of lexicon All the words which ‘exist’ are listed in the lexicon. The lexicon should be dynamic, being adapted to the changes of language usage.
一般原则与方法 • 4.5 Principles for word segmentation output • Principle of granularity. • 傣族风情园令人流连忘返. • 傣族风情园 | 令 | 人 | 流连忘返. • 傣族 | 风情 | 园 | 令 | 人 | 流连忘返. • (((傣 族) 风情) 园) 令 人 流连忘返. • (2) Principle of the scope maximization of affixation. • (3) Principle of the scope maximization of compounding with respect to a lexicon.
一般原则与方法 4.6.1 General architecture for word segmentation (1) a dictionary, built on the representative corpus, with high coverage to texts, and, possibly with morphological structures for some lemmas, if applicable, respectively. (2) word formation specification. (3) a complete prefix/semi-prefix list (4) a complete suffix/semi-suffix list (5) a complete free morpheme list (6) a complete bound morpheme list (7) special morpheme lists that have special functions in the process of word segmentation, for example, inflectional affix for verbs in Japanese. (8) corpora: to support the quantitative analysis of the lexicon (but not as a part of the Standards).
一般原则与方法 • 4.6.2 The role and makeup of the lexicon • The lexicon serves as a foundation and gold-standard in word segmentation, so as to keep consistencies in word segmentation to the maximum extent. • Regular word forms are in general not included in the lexicon. • Two lexical items which are homographic should keep two separate entries in the lexicon.