Chu-Ren Huang Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

From Synergy to Knowledge: Integrating multiple language resourcesPart II: Creating Synergy and Multi-functionality of Language Resources Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm

Outline • From Language Resources to Language Technology • A word’s company • Classical Paradigm of Language Resource Development • A new paradigm: Integrating Multiple Language resources • Introduction: CGW Corpus • Chinese WordSketch: Integrating multiple resources • Wen-Guo: Merging different resources to create new synergy C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Language Resources to Language Technology • Language Modeling and Knowledge Generation: How to acquire linguistic model and/or generalization from language resources? • Sharability: can two or more resources be combined to create bigger and better resources • Re-usability: Can a resource be used for a different purpose than what it is designed for? C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A word’s company: Corpus KeyWord In Context (KWIC) and the color pen • 1political association 4 person in an agreement/dispute • 2 social event 5 to be party to something... • group of people • The coloured pens methodfrom Kilgarriff et al. 2005 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A Word’s Company Automatically Detected: WordSketch w BNC Data C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Sketch Engine and Chinese WordSketch • Sketch Engine http://www.sketchengine.co.uk Developed by team led by Adam Kilgarriff • A new corpus viewing tool • Discovering grammatical information from a gigantic corpus • Chinese Wordsketch by Academia Sinica http://www.ling.sinica.edu.tw/wordsketch (for Taiwan only) • Academia Sinica, Taiwan (Huang, Smith, Ma, Simon黃居仁，史尚明，馬偉雲，石穆) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Classical Paradigm of Language Resource Development • Data Collection and Preparation: • Design Criteria：by human • Data collection：executed or supervised by human • digitization：input and/or proofreading by human • Knowledge Enrichment: tagging and structural annotation • Knowledge source：by human • Representational standard and annotation：by human • Quality and speed of human labor becomes the bottleneck of language resources development C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Current Challenges to Corpus and Language Resource Research • Corpus size is too small： • Disambiguation • Collocation • Grammatical functions and other dependencies usually requires corpus size of 100 million words or above to yield significant distributional information. • Resources development is slow and tedious • Semantic Role Tagging • POS tagging post-processing C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Estimating Corpus Scale for Automatic Extraction of Linguistic Knowledge How many events do we need to establish reliable description of a word from corpus? automatically? • Grammatical Information based on Word-word Collocation • V＋N：「開立」＋「發票」 • A＋N：「不實」＋「發票」 • Collocational information between any given two mid-frequency words (frequency rank 10,000 or above) • That occur within a 10 word window of the keyword (5 before and 5 after • Requires a corpus size of 1 billion words or above C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Classical Chinese Corpora:Million Word Scale M= million C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

A new paradigm Integrating Multiple Language resources From Synergy to Knowledge • Integrate multiple existing (language) resources to create new resource • Allow resources to scale up beyond existing resources, • Generate new knowledge which does not exist in any individual resource • General methodology (without too much additional manual work): • merging existing, similarly annotated resources, or • creating an overall conceptual framework for different knowledge/language resources to be integrated • Automatically C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

From Synergy to Knowledge • When A and B have synergy, we say in Chinese that A and B bring out the advantages of each other • Knowledge is what we know about the world, either descriptive or explanatory • Knowledge cannot be created from nothing, it comes by • Keen observation of facts • Sharp reasoning when we put two or more facts together • Different language resources can be put together to • Facilitate observation of facts, and • Create an environment where different linguistic facts can be more easily associated (for knowledge discovery) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergy: Integrating different types of language resoureces Research based on Chinese Gigaword Corpus • Chinese Gigaword Corpus: Introduction • Implementation of fully automatic corpus tagging • Word Sketch Engine: Introduction • Chinese Word Sketch • Integrating corpus program with lexico-grammatical information C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus I Chinese Gigaword Second Edition (2005) • Produced and released by Linguistic Data Consortium (LDC) in 2003 (first edition). • Newswire text data in Chinese. • Second edition contains additional data collected after the publication of the first edition. • Three distinct international sources : • Central News Agency of Taiwan • Xinhua News Agency of Beijing • Zaobao Newspaper of Singapore C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus II Table 1. Coverage of Chinese GigaWord Corpus C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus III Markup Structure All text data are presented in SGML form, using a very simple, minimal markup structure. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Introduction: CGW Corpus IV Statistics Table 2. Content of data from each source Unit: Million C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CGW after fully automatic tagging C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

II. 1. Corpus Preparation: (Almost) Fully Automatic Segmentation and Tagging • Strategy (Ma and Chen 2005) : HMM method for POS tagging for words existing in basic lexicon and morpheme-analysis-based method (Tseng and Chen 2002) to predict POS’s for new words. • Integrating Language Resources • Sinica lexicon with 80,000 word entries. • A 50,000-words’ set collected from Sinica Corpus 3.0 (10 million words balanced corpus). • 5,000 new words from Xinhua new-words dictionary. • Tagset：Adopting Sinica Tagset as a uniform tagging set. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Preparation: Implementation • Environment: 2 PC (2.8GHz CPU) • Time Consumed：over 3 days • Output： • 462 million words of CNA • 252 million words of XIN Ma and Huang 2006 (LREC 2006) See http://ckipsvr.iis.sinica.edu.tw/ for demo of the CKIP Segmentation program C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Preparation: Tagging Segmented and Tagged Article C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Summary of Fully Tagged CGW Corpus • Fully segmented and tagged with Sinica tagset by Academia Sinica • Being processing by PKU with their tagset • Potentially the most important source for processing and comparative studies of Mandarin Chinese • Will be available from LDC in 2007. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS and Integration of Corpus Search Engine with Lexico-grammatical Information • Overview • A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behavior. • The Word Sketch Engine, which takes as input acorpus of anylanguage and a corresponding grammar patterns, generates word sketches for the words of that language. • We synergize rich lexicon-based grammatical information (ICG, Chen and Huang 1992) with stochastic information. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Word Sketch Word Sketch Engine (Kilgarriff et al.) Register for trial usage at http://www.sketchengine.co.uk • A Versatile Corpus Viewing and Searching Tool • The Word Sketch Engine, which takes as input acorpus of anylanguage and a corresponding grammar patterns, generates word sketches for the words of that language. • Based on pre-defined context-free rules to identify grammatical functions (relations) • Ranked by Saliency: frequency adjusted MI (based on Dekang Lin’s definition of Pair-wise MI) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Design Criteria of Sketch Engine • Grammatical relation is the information that is both of most interest to HLT and linguistic research • However, GR’s can only be discovered based on collocational data, hence requires very large corpus and high quality annotation at the same time, a seeming unsolvable dilemma • There is a solution when corpus is big enough • Context-free patterns allows fairly reliable extraction of a substantial number, if not all, relations • (When there are enough instances of relations extracted), the saliency ranking correctly picks the distributional tendencies and allows users to ignore idiosyncrasies/errors. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

WordSketch’s Approach:From Lexical Types to Relations Types • BNC has 100,000,000 Words • 939,028 word types • 70,000,000 tuples (relations) Extracted • More than 70 relations per lemma • For CWS II, and CGW corpus (CNA data) • 1,917,093 word Types • 59,183,238 tuples (<eat, obj, rice>) • More than 30 relations per lemma C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Chinese WordSketch: An Overview • Concordance • WordSketch • Sketch Difference • Thesaurus C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS: SketchDiffComparing the behaviors of two words C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

CWS: Thesaurus of 快樂 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Application to Chinese Corpus:Comparing ThesaurusWe shall know a word by the company it keeps C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Context-free patterns: Does Quality of Grammatical Knowledge Matter? • The implementation of CWS I simply adopts English like CF grammatical patterns (since Chinese and English supposedly share very similar PS rules) • However, the result was not very satisfactory • Missing a lot of relations, such as objects which do not appear right next to a verb • Mis-classifying topicalized objects as subjects • Missing objects in non-canonical positions C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Linguistic Knowledge Should Solve the above Problems Comprehensive Lexical Knowledge of Verb Frames exists • Information-based Case Grammar (ICG) • Encoded on over 40,000 verbs in Sinica Lexicon ICG Basic Patterns for Stative Pseudo-transitive Verb (VI) EXPERIENCER<GOAL[PP[對]]<VI EXPERIENCER<VI<<GOAL[PP[於]] THEME<GOAL[PP{對、以}]<VI THEME<VI<<GOAL[PP[於]] THEME<VI<<SOURCE[PP{自、於}] THEME< SOURCE[PP{歸、為}]<VI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Comparing Lexical Knowledge Between CWS I and CWS II • CWS I: 11 definitions, 11 patterns • One single patter for verb-object relation • CWS II: 32 definitions, 80 patterns • 20 patterns for verb-object relation • 59,183,238 tuples (<eat, obj, rice>) from 496,465,879 words • English has 39 definitions, 40 patterns C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Synergy among tagging, statistics, and linguistic knowledge • Collocations are identified with Context free rules in Word Sketch Engine • Collocating Pattern for Object from CSE I • 1:"V[BCJ]" "Di"? "N[abc]"? "DE"? "N[abc]"? 2: "Na" [tag!= "Na"] • Challenge: Long-distance relations • 全穀麵包，吃了很健康。 quan.gu mian.bao, chi le hen jian.kang • 有人嘗試要將這荷花分類，卻越分越累。 you ren chang.shi yao jiangzhe he.huafen.lei, que yue fen yue lei • 他只吃了一口飯… Ta zhi chi let yi kou fan C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing Knowledge Source • Information-based Case Grammar (ICG, Chen and Huang 1992) • Encoded on over 40,000 verbs in Sinica Lexicon • ICG Basic Patterns for Stative Pseudo-transitive Verb (VI) EXPERIENCER<GOAL[PP[對]]<VI EXPERIENCER<VI<<GOAL[PP[於]] THEME<GOAL[PP{對、以}]<VI THEME<VI<<GOAL[PP[於]] THEME<VI<<SOURCE[PP{自、於}] THEME< SOURCE[PP{歸、為}]<VI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Examples • 村莊(object) 明天將被夷為平地(VB11) cunzhuang mingtian jiang bei yiweipingdi • begin time1 location time1 adv? passive_prep adv_string 1:"V[BCJ].*" [tag!="DE"] • 大量的遊客破壞(VC2) 公園景觀(object) daliang de youke pohuai gongyuan jingguan • 1:"VC.*" (particle|prep)? NP not_noun • (NP is defined as “…noun_modifier{0,2} 2:noun…”. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Partial Result • Object Recall Comparison CSE I CSE II hong2 (red) 0 0 pao3 (run) 0 8,704 kan4 (look) 32,350 64,096 da3 (hit) 26,016 47,182 song4 (give) 0 76,378 shuo1 (say) 0 20,350 xiang1xin4 (believe) 0 52,373 quan4 (persuade) 0 3,852 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Integrating prior Knowledge in Processing: Partial Result II • Most salient objects for chi1 「吃」 in CSEII • Those among top 20 salient object fromCSE1, but not II • 飯fan4 rice 802 70.96 (4), • 虧kui disadvantage 329 59.24 (12) • 苦頭ku3tou2 suffering 194 58.71 (14) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Applications: Chinese WordSketch • Test version of Chinese Word Sketch is available • Permanent version of CWS will be available from Academia Sinica Soon http://wordsketch.ling.sinica.edu.tw C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Application: Resolving Nominalization • Chinese verbs are nominalized without overt markup Resolving Categorical ambiguity with distributional information only • Two Approaches: HMM and Bayesian Classifier • HMM: N-grams • Classifier: left, right contexts, plus own verb sub-class, weighted C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Nonminalization Results (Ma and Huang 2006) • Best overall HMM performance: 69% • Best Overall Bayesian classifier performance: 74% C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Mining Cross-Strait Lexical Difference • Strategy: Using a pair of know contrasting words as seeds and lookup SketchDifference • Clinton 克林頓 ke4 Vs. 柯林頓 ke1 • What is found: Other unique translation for either PRC or Taiwan • 克林頓 (PRC) only and/or patterns (vs柯林頓 only) 葉利欽88 54.6 Yeltin 葉爾勤(3) 布什65 49.7 Bush 布希 (4) 萊溫斯基10 41.3 Lewinsky 呂茵斯基 /呂女(1) 戈爾 20 39.4 Gore 高爾(2) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land: 文國尋寶記 http://www.sinica.edu.tw/Wen/ • Integrating the following • Corpora: Sinica Corpus, Textbook Corpus (3 different editions), Tang poems, Dream of the Red Chamber, On the Water Margin… • Lexicon: General, Classifier, Idiom (成語) • Linked with a corpus/lexicon interface • Developed by: Huang, Fengju Lo, Hui-chun Hsiao, and team of teachers C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

The Substantive IssuesLanguage Resources Used in WenGuo • Textual Databases (of classical texts) • Text Corpora • Linguistic and Philological Knowledge from previous research • LKB Extracted and composed from the above C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land (2001) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Adventures in Wen-Land • What: Is a virtual theme park for on-line Chinese language learning and teaching . • How: Is the end product of a National Digital Museum Project sponsored by the National Science Council, ROC (A Linguistic and Literary KnowledgetNet for Elementary School Children) • When: Was completed in spring, 2001 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007

Chu-Ren Huang Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

Chu-Ren Huang Academia Sinica cwn.ling.sinica.tw/huang/huang.htm

Presentation Transcript

Academia Sinica Grid Computing Certification Authority (ASGCCA)

Yong-Ren Huang

Annual Report DiCK / SLaBS RCAS, Academia Sinica

Computational Nanomaterials and Nanomechanics Laboratory RCAS, Academia Sinica

Michael Shiyung Liu Academia Sinica

Infrastructures in Taiwan and for the Chinese Languages Chu-Ren Huang Institute of Linguistics

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang

Academia Sinica Grid Computing Certification Authority (ASGCCA)

ACADEMIA SINICA IBMS

ASQA – Academia Sinica Question Answering System

Hwai-Chung Ho Academia Sinica and National Taiwan University

RCAS, Academia Sinica Shih-Yen Lin ( 林時彥 )

Academia Sinica Seminar Taipei, TAIWAN, 21 May 2010

Institute of Biomedical Sciences Academia Sinica

Pen-Chung Yew Institute of Information Science Academia Sinica

Institute of Atomic and Molecular Sciences, Academia Sinica, Taiwan

Academia Sinica Grid Computing Certification Authority (ASGCCA)

Academia Sinica Grid Computing Certification Authority (ASGCCA)

Academia Sinica Grid Computing Certification Authority (ASGCCA)

Academia Sinica Grid Computing Certification Authority (ASGCCA)