190 likes | 396 Views
ACL/COLING’06 Workshop on Multilingual Language Resources and Interoperability. The Role of Lexical Resources in CJK Natural Language Processing. Jack Halpern (春遍雀來) The CJK Dictionary Institute (CJKI) ( 日中韓辭典研究所 ). various challenges.
E N D
ACL/COLING’06 Workshop on Multilingual Language Resources and Interoperability The Role of Lexical Resources in CJK Natural Language Processing Jack Halpern(春遍雀來) The CJK Dictionary Institute (CJKI) (日中韓辭典研究所)
various challenges • Identifying and processing the large number of orthographic variants in Japanese, and alternate character forms in CJK languages. • The lack of easily available comprehensive lexical resources, especially lexical databases, comparable to the major European languages. • The accurate conversion between Simplified and Traditional Chinese • The morphological complexity of Japanese and Korean • Accurate word segmentation and disambiguating ambiguous segmentations strings • The difficulty of lexeme-based retrieval and CJK CLIR • Chinese and Japanese proper nouns, which are very numerous, are difficult to detect without a lexicon • Automatic recognition of terms and their variants
Named Entity Extraction • The number of personal names and their variants (e.g. over a hundred ways to spell Mohammed) is probably in the billions • CJKI maintain databases of millions of proper nouns • use of keywords or syntactic structures that co-occur with proper nouns, which we refer to as named entity contextual clues (NECC) • NER, especially of personal names and place names, is an area in which lexicon-driven methods have a clear advantage over probabilistic methods and in which the role of lexical resources should be a central one
Linguistic Issues in Chinese • A major issue for Chinese segmentors is how to treat compound words and multiword lexical units (MWU) • 录像带、机器翻译 are not tagged as segments in Chinese Gigaword • The lexicons used by Chinese segmentors are small-scale or incomplete. Our testing of various Chinese segmentors has shown that coverage of MWUs is often limited. • Chinese linguists disagree on the concept of wordhood in Chinese. Various theories such as the Lexical Integrity Hypothesis have been proposed. • The "correct” segmentation can depend on the application, and there are various segmentation standards. For example, a search engine user looking for 录像带 is not normally interested in 录像 and 带 per se, unless they are part of 录像带.
Lexeme • A Lexeme • smallest distinctive units associating meaning with form • Predicting compositionality is not trivial and often impossible • lexical items like 机器翻译 represent stand-alone, well-defined concepts and should be treated as single units
Multilevel Segmentation • Chinese MWUs can consist of nested components that can be segmented in different ways for different levels to satisfy the requirements of different segmentation standards • 北京日本人学校 multiword lexemic • 北京+日本人+学校 lexemic • 北京+日本+人+学校 sublexemic • 北京 + [日本 + 人] [学+校] morphemic • [北+京] [日+本+人] [学+校] submorphemic MT NER preferred 語音技術 preferred
Neologisms (新詞;新義) • The problem of incorrect segmentation is especially obvious in the case of neologisms. • 电脑迷 diànnǎomí cyberphile • 电子商务 diànzǐshāngwùe-commerce • 追车族 zhuīchēzú auto fan
Chinese-to-Chinese Conversion (C2C) • The conversion can be implemented on three levels • 1. Code Conversion. • numerous one-to-many ambiguities,
C2C (cont.) • 2. Orthographic Conversion • meaningful linguistic units, equivalent to lexemes • must be done with a segmentor
C2C (cont.) • 3. Lexemic Conversion • maps SC and TC lexemes that are semantically
Traditional Chinese Variants • Traditional Chinese has numerous variant character forms
Orthographic Variation in Japanese • Highly Irregular Orthography • four scripts used to write Japanese, e.g. kanji, hiragana, katakana, and the Latin alphabet • 取り扱い, 取扱い, 取扱, とり扱い, 取りあつかい, とりあつかい. • JP IR problem • 金の卵を産む鶏 4*3*2=24種寫法 • 'egg' four variants (卵, 玉子, たまご, タマゴ) • 'chicken' three (鶏, にわとり, ニワトリ) • 'to lay' two (産む, 生む) • 金の卵を生むニワトリ - google 398結果 • 金の玉子を産む鶏 - google 12結果
Okurigana Variants • 語尾變化 • 書き著しませんでした
Cross-Script Orthographic Variation • (人參是日文的紅蘿蔔) Google 67500 66200 58000
Kana Variants • 母音自行加長、長母音、多重片假名、古代用法,兩個音都類似zu normalization
Lexicon-driven Normalization • Convert variants to a standardized form for indexing • Normalize queries for dictionary lookup • Normalize all source documents • Identify forms as members of a variant group
Orthographic Variation in Korean • far less than in Japanese • loan word ‘cake’ • 케이크 (ke i keu) and 케잌 (ke ik) • Person name 'Clinton‘ • 클린턴 keul rin teon and 클린톤 keul rin ton • Mixture ‘shirt’ • 와이셔츠 wai-syea cheu • Y 셔츠 wai-syea cheu • 南北韓拼音不同 • 新舊不同
The Role of Lexical Databases • disk storage is no longer a major issue • CJKI, which specializes in CJK and Arabic computational lexicography • orthographic normalization and named entity extraction • the small-scale lexical resources currently used by many NLP tools are inadequate to these tasks • lexicon-driven techniques have proven their effectiveness, there is no need to overly rely on probabilistic methods • up-to-date lexical resources are the key to achieving major enhancements in NLP technology