The Role of Lexical Resources in CJK Natural Language Processing

ACL/COLING’06 Workshop on Multilingual Language Resources and Interoperability The Role of Lexical Resources in CJK Natural Language Processing Jack Halpern（春遍雀來） The CJK Dictionary Institute (CJKI) (日中韓辭典研究所)

various challenges • Identifying and processing the large number of orthographic variants in Japanese, and alternate character forms in CJK languages. • The lack of easily available comprehensive lexical resources, especially lexical databases, comparable to the major European languages. • The accurate conversion between Simplified and Traditional Chinese • The morphological complexity of Japanese and Korean • Accurate word segmentation and disambiguating ambiguous segmentations strings • The difficulty of lexeme-based retrieval and CJK CLIR • Chinese and Japanese proper nouns, which are very numerous, are difficult to detect without a lexicon • Automatic recognition of terms and their variants

Named Entity Extraction • The number of personal names and their variants (e.g. over a hundred ways to spell Mohammed) is probably in the billions • CJKI maintain databases of millions of proper nouns • use of keywords or syntactic structures that co-occur with proper nouns, which we refer to as named entity contextual clues (NECC) • NER, especially of personal names and place names, is an area in which lexicon-driven methods have a clear advantage over probabilistic methods and in which the role of lexical resources should be a central one

Linguistic Issues in Chinese • A major issue for Chinese segmentors is how to treat compound words and multiword lexical units (MWU) • 录像带、机器翻译 are not tagged as segments in Chinese Gigaword • The lexicons used by Chinese segmentors are small-scale or incomplete. Our testing of various Chinese segmentors has shown that coverage of MWUs is often limited. • Chinese linguists disagree on the concept of wordhood in Chinese. Various theories such as the Lexical Integrity Hypothesis have been proposed. • The "correct” segmentation can depend on the application, and there are various segmentation standards. For example, a search engine user looking for 录像带 is not normally interested in 录像 and 带 per se, unless they are part of 录像带.

Lexeme • A Lexeme • smallest distinctive units associating meaning with form • Predicting compositionality is not trivial and often impossible • lexical items like 机器翻译 represent stand-alone, well-defined concepts and should be treated as single units

Multilevel Segmentation • Chinese MWUs can consist of nested components that can be segmented in different ways for different levels to satisfy the requirements of different segmentation standards • 北京日本人学校 multiword lexemic • 北京+日本人+学校 lexemic • 北京+日本+人+学校 sublexemic • 北京 + [日本 + 人] [学+校] morphemic • [北+京] [日+本+人] [学+校] submorphemic MT NER preferred 語音技術 preferred

Neologisms (新詞;新義) • The problem of incorrect segmentation is especially obvious in the case of neologisms. • 电脑迷 diànnǎomí cyberphile • 电子商务 diànzǐshāngwùe-commerce • 追车族 zhuīchēzú auto fan

Chinese-to-Chinese Conversion (C2C) • The conversion can be implemented on three levels • 1. Code Conversion. • numerous one-to-many ambiguities,

C2C (cont.) • 2. Orthographic Conversion • meaningful linguistic units, equivalent to lexemes • must be done with a segmentor

C2C (cont.) • 3. Lexemic Conversion • maps SC and TC lexemes that are semantically

Traditional Chinese Variants • Traditional Chinese has numerous variant character forms

Orthographic Variation in Japanese • Highly Irregular Orthography • four scripts used to write Japanese, e.g. kanji, hiragana, katakana, and the Latin alphabet • 取り扱い, 取扱い, 取扱, とり扱い, 取りあつかい, とりあつかい. • JP IR problem • 金の卵を産む鶏 4*3*2=24種寫法 • 'egg' four variants (卵, 玉子, たまご, タマゴ) • 'chicken' three (鶏, にわとり, ニワトリ) • 'to lay' two (産む, 生む) • 金の卵を生むニワトリ - google 398結果 • 金の玉子を産む鶏 - google 12結果

Okurigana Variants • 語尾變化 • 書き著しませんでした

Cross-Script Orthographic Variation • (人參是日文的紅蘿蔔) Google 67500 66200 58000

Kana Variants • 母音自行加長、長母音、多重片假名、古代用法，兩個音都類似zu normalization

Lexicon-driven Normalization • Convert variants to a standardized form for indexing • Normalize queries for dictionary lookup • Normalize all source documents • Identify forms as members of a variant group

Orthographic Variation in Korean • far less than in Japanese • loan word ‘cake’ • 케이크 (ke i keu) and 케잌 (ke ik) • Person name 'Clinton‘ • 클린턴 keul rin teon and 클린톤 keul rin ton • Mixture ‘shirt’ • 와이셔츠 wai-syea cheu • Y 셔츠 wai-syea cheu • 南北韓拼音不同 • 新舊不同

The Role of Lexical Databases • disk storage is no longer a major issue • CJKI, which specializes in CJK and Arabic computational lexicography • orthographic normalization and named entity extraction • the small-scale lexical resources currently used by many NLP tools are inadequate to these tasks • lexicon-driven techniques have proven their effectiveness, there is no need to overly rely on probabilistic methods • up-to-date lexical resources are the key to achieving major enhancements in NLP technology

The Role of Lexical Resources in CJK Natural Language Processing

The Role of Lexical Resources in CJK Natural Language Processing

Presentation Transcript

NATURAL LANGUAGE PROCESSING

Natural Language Processing

Natural Language Processing

Natural Language Processing

Unstructured Data and the Role of Natural Language Processing

Lexical Semantic + Students Presentations ICS 482 Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing

Natural Language Processing