680 likes | 888 Views
From Synergy to Knowledge: Integrating multiple language resources Part II: Creating Synergy and Multi-functionality of Language Resources. Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm. Outline. From Language Resources to Language Technology A word’s company
E N D
From Synergy to Knowledge: Integrating multiple language resourcesPart II: Creating Synergy and Multi-functionality of Language Resources Chu-Ren Huang Academia Sinica http://cwn.ling.sinica.edu.tw/huang/huang.htm
Outline • From Language Resources to Language Technology • A word’s company • Classical Paradigm of Language Resource Development • A new paradigm: Integrating Multiple Language resources • Introduction: CGW Corpus • Chinese WordSketch: Integrating multiple resources • Wen-Guo: Merging different resources to create new synergy C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Language Resources to Language Technology • Language Modeling and Knowledge Generation: How to acquire linguistic model and/or generalization from language resources? • Sharability: can two or more resources be combined to create bigger and better resources • Re-usability: Can a resource be used for a different purpose than what it is designed for? C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A word’s company: Corpus KeyWord In Context (KWIC) and the color pen • 1political association 4 person in an agreement/dispute • 2 social event 5 to be party to something... • group of people • The coloured pens methodfrom Kilgarriff et al. 2005 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A Word’s Company Automatically Detected: WordSketch w BNC Data C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Sketch Engine and Chinese WordSketch • Sketch Engine http://www.sketchengine.co.uk Developed by team led by Adam Kilgarriff • A new corpus viewing tool • Discovering grammatical information from a gigantic corpus • Chinese Wordsketch by Academia Sinica http://www.ling.sinica.edu.tw/wordsketch (for Taiwan only) • Academia Sinica, Taiwan (Huang, Smith, Ma, Simon黃居仁,史尚明,馬偉雲,石穆) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Classical Paradigm of Language Resource Development • Data Collection and Preparation: • Design Criteria:by human • Data collection:executed or supervised by human • digitization:input and/or proofreading by human • Knowledge Enrichment: tagging and structural annotation • Knowledge source:by human • Representational standard and annotation:by human • Quality and speed of human labor becomes the bottleneck of language resources development C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Current Challenges to Corpus and Language Resource Research • Corpus size is too small: • Disambiguation • Collocation • Grammatical functions and other dependencies usually requires corpus size of 100 million words or above to yield significant distributional information. • Resources development is slow and tedious • Semantic Role Tagging • POS tagging post-processing C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Estimating Corpus Scale for Automatic Extraction of Linguistic Knowledge How many events do we need to establish reliable description of a word from corpus? automatically? • Grammatical Information based on Word-word Collocation • V+N:「開立」+「發票」 • A+N:「不實」+ 「發票」 • Collocational information between any given two mid-frequency words (frequency rank 10,000 or above) • That occur within a 10 word window of the keyword (5 before and 5 after • Requires a corpus size of 1 billion words or above C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Classical Chinese Corpora:Million Word Scale M= million C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
A new paradigm Integrating Multiple Language resources From Synergy to Knowledge • Integrate multiple existing (language) resources to create new resource • Allow resources to scale up beyond existing resources, • Generate new knowledge which does not exist in any individual resource • General methodology (without too much additional manual work): • merging existing, similarly annotated resources, or • creating an overall conceptual framework for different knowledge/language resources to be integrated • Automatically C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
From Synergy to Knowledge • When A and B have synergy, we say in Chinese that A and B bring out the advantages of each other • Knowledge is what we know about the world, either descriptive or explanatory • Knowledge cannot be created from nothing, it comes by • Keen observation of facts • Sharp reasoning when we put two or more facts together • Different language resources can be put together to • Facilitate observation of facts, and • Create an environment where different linguistic facts can be more easily associated (for knowledge discovery) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Synergy: Integrating different types of language resoureces Research based on Chinese Gigaword Corpus • Chinese Gigaword Corpus: Introduction • Implementation of fully automatic corpus tagging • Word Sketch Engine: Introduction • Chinese Word Sketch • Integrating corpus program with lexico-grammatical information C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus I Chinese Gigaword Second Edition (2005) • Produced and released by Linguistic Data Consortium (LDC) in 2003 (first edition). • Newswire text data in Chinese. • Second edition contains additional data collected after the publication of the first edition. • Three distinct international sources : • Central News Agency of Taiwan • Xinhua News Agency of Beijing • Zaobao Newspaper of Singapore C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus II Table 1. Coverage of Chinese GigaWord Corpus C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus III Markup Structure All text data are presented in SGML form, using a very simple, minimal markup structure. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Introduction: CGW Corpus IV Statistics Table 2. Content of data from each source Unit: Million C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CGW after fully automatic tagging C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
II. 1. Corpus Preparation: (Almost) Fully Automatic Segmentation and Tagging • Strategy (Ma and Chen 2005) : HMM method for POS tagging for words existing in basic lexicon and morpheme-analysis-based method (Tseng and Chen 2002) to predict POS’s for new words. • Integrating Language Resources • Sinica lexicon with 80,000 word entries. • A 50,000-words’ set collected from Sinica Corpus 3.0 (10 million words balanced corpus). • 5,000 new words from Xinhua new-words dictionary. • Tagset:Adopting Sinica Tagset as a uniform tagging set. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Preparation: Implementation • Environment: 2 PC (2.8GHz CPU) • Time Consumed:over 3 days • Output: • 462 million words of CNA • 252 million words of XIN Ma and Huang 2006 (LREC 2006) See http://ckipsvr.iis.sinica.edu.tw/ for demo of the CKIP Segmentation program C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Preparation: Tagging Segmented and Tagged Article C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Summary of Fully Tagged CGW Corpus • Fully segmented and tagged with Sinica tagset by Academia Sinica • Being processing by PKU with their tagset • Potentially the most important source for processing and comparative studies of Mandarin Chinese • Will be available from LDC in 2007. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CWS and Integration of Corpus Search Engine with Lexico-grammatical Information • Overview • A word sketch is a one-page, automatic, corpus-derived summary of a word's grammatical and collocational behavior. • The Word Sketch Engine, which takes as input acorpus of anylanguage and a corresponding grammar patterns, generates word sketches for the words of that language. • We synergize rich lexicon-based grammatical information (ICG, Chen and Huang 1992) with stochastic information. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Word Sketch Word Sketch Engine (Kilgarriff et al.) Register for trial usage at http://www.sketchengine.co.uk • A Versatile Corpus Viewing and Searching Tool • The Word Sketch Engine, which takes as input acorpus of anylanguage and a corresponding grammar patterns, generates word sketches for the words of that language. • Based on pre-defined context-free rules to identify grammatical functions (relations) • Ranked by Saliency: frequency adjusted MI (based on Dekang Lin’s definition of Pair-wise MI) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Design Criteria of Sketch Engine • Grammatical relation is the information that is both of most interest to HLT and linguistic research • However, GR’s can only be discovered based on collocational data, hence requires very large corpus and high quality annotation at the same time, a seeming unsolvable dilemma • There is a solution when corpus is big enough • Context-free patterns allows fairly reliable extraction of a substantial number, if not all, relations • (When there are enough instances of relations extracted), the saliency ranking correctly picks the distributional tendencies and allows users to ignore idiosyncrasies/errors. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
WordSketch’s Approach:From Lexical Types to Relations Types • BNC has 100,000,000 Words • 939,028 word types • 70,000,000 tuples (relations) Extracted • More than 70 relations per lemma • For CWS II, and CGW corpus (CNA data) • 1,917,093 word Types • 59,183,238 tuples (<eat, obj, rice>) • More than 30 relations per lemma C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Chinese WordSketch: An Overview • Concordance • WordSketch • Sketch Difference • Thesaurus C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CWS: SketchDiffComparing the behaviors of two words C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
CWS: Thesaurus of 快樂 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Application to Chinese Corpus:Comparing ThesaurusWe shall know a word by the company it keeps C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Context-free patterns: Does Quality of Grammatical Knowledge Matter? • The implementation of CWS I simply adopts English like CF grammatical patterns (since Chinese and English supposedly share very similar PS rules) • However, the result was not very satisfactory • Missing a lot of relations, such as objects which do not appear right next to a verb • Mis-classifying topicalized objects as subjects • Missing objects in non-canonical positions C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Linguistic Knowledge Should Solve the above Problems Comprehensive Lexical Knowledge of Verb Frames exists • Information-based Case Grammar (ICG) • Encoded on over 40,000 verbs in Sinica Lexicon ICG Basic Patterns for Stative Pseudo-transitive Verb (VI) EXPERIENCER<GOAL[PP[對]]<VI EXPERIENCER<VI<<GOAL[PP[於]] THEME<GOAL[PP{對、以}]<VI THEME<VI<<GOAL[PP[於]] THEME<VI<<SOURCE[PP{自、於}] THEME< SOURCE[PP{歸、為}]<VI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Comparing Lexical Knowledge Between CWS I and CWS II • CWS I: 11 definitions, 11 patterns • One single patter for verb-object relation • CWS II: 32 definitions, 80 patterns • 20 patterns for verb-object relation • 59,183,238 tuples (<eat, obj, rice>) from 496,465,879 words • English has 39 definitions, 40 patterns C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Synergy among tagging, statistics, and linguistic knowledge • Collocations are identified with Context free rules in Word Sketch Engine • Collocating Pattern for Object from CSE I • 1:"V[BCJ]" "Di"? "N[abc]"? "DE"? "N[abc]"? 2: "Na" [tag!= "Na"] • Challenge: Long-distance relations • 全穀麵包,吃了很健康。 quan.gu mian.bao, chi le hen jian.kang • 有人嘗試要將這荷花分類,卻越分越累。 you ren chang.shi yao jiangzhe he.huafen.lei, que yue fen yue lei • 他只 吃了一口飯… Ta zhi chi let yi kou fan C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing Knowledge Source • Information-based Case Grammar (ICG, Chen and Huang 1992) • Encoded on over 40,000 verbs in Sinica Lexicon • ICG Basic Patterns for Stative Pseudo-transitive Verb (VI) EXPERIENCER<GOAL[PP[對]]<VI EXPERIENCER<VI<<GOAL[PP[於]] THEME<GOAL[PP{對、以}]<VI THEME<VI<<GOAL[PP[於]] THEME<VI<<SOURCE[PP{自、於}] THEME< SOURCE[PP{歸、為}]<VI C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing: Examples • 村莊(object) 明天將 被 夷為平地(VB11) cunzhuang mingtian jiang bei yiweipingdi • begin time1 location time1 adv? passive_prep adv_string 1:"V[BCJ].*" [tag!="DE"] • 大量 的 遊客 破壞(VC2) 公園景觀(object) daliang de youke pohuai gongyuan jingguan • 1:"VC.*" (particle|prep)? NP not_noun • (NP is defined as “…noun_modifier{0,2} 2:noun…”. C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing: Partial Result • Object Recall Comparison CSE I CSE II hong2 (red) 0 0 pao3 (run) 0 8,704 kan4 (look) 32,350 64,096 da3 (hit) 26,016 47,182 song4 (give) 0 76,378 shuo1 (say) 0 20,350 xiang1xin4 (believe) 0 52,373 quan4 (persuade) 0 3,852 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Integrating prior Knowledge in Processing: Partial Result II • Most salient objects for chi1 「吃」 in CSEII • Those among top 20 salient object fromCSE1, but not II • 飯fan4 rice 802 70.96 (4), • 虧kui disadvantage 329 59.24 (12) • 苦頭ku3tou2 suffering 194 58.71 (14) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Applications: Chinese WordSketch • Test version of Chinese Word Sketch is available • Permanent version of CWS will be available from Academia Sinica Soon http://wordsketch.ling.sinica.edu.tw C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Application: Resolving Nominalization • Chinese verbs are nominalized without overt markup Resolving Categorical ambiguity with distributional information only • Two Approaches: HMM and Bayesian Classifier • HMM: N-grams • Classifier: left, right contexts, plus own verb sub-class, weighted C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Nonminalization Results (Ma and Huang 2006) • Best overall HMM performance: 69% • Best Overall Bayesian classifier performance: 74% C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Mining Cross-Strait Lexical Difference • Strategy: Using a pair of know contrasting words as seeds and lookup SketchDifference • Clinton 克林頓 ke4 Vs. 柯林頓 ke1 • What is found: Other unique translation for either PRC or Taiwan • 克林頓 (PRC) only and/or patterns (vs柯林頓 only) 葉利欽88 54.6 Yeltin 葉爾勤(3) 布什65 49.7 Bush 布希 (4) 萊溫斯基10 41.3 Lewinsky 呂茵斯基 /呂女(1) 戈爾 20 39.4 Gore 高爾(2) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land: 文國尋寶記 http://www.sinica.edu.tw/Wen/ • Integrating the following • Corpora: Sinica Corpus, Textbook Corpus (3 different editions), Tang poems, Dream of the Red Chamber, On the Water Margin… • Lexicon: General, Classifier, Idiom (成語) • Linked with a corpus/lexicon interface • Developed by: Huang, Fengju Lo, Hui-chun Hsiao, and team of teachers C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
The Substantive IssuesLanguage Resources Used in WenGuo • Textual Databases (of classical texts) • Text Corpora • Linguistic and Philological Knowledge from previous research • LKB Extracted and composed from the above C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land (2001) C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007
Adventures in Wen-Land • What: Is a virtual theme park for on-line Chinese language learning and teaching . • How: Is the end product of a National Digital Museum Project sponsored by the National Science Council, ROC (A Linguistic and Literary KnowledgetNet for Elementary School Children) • When: Was completed in spring, 2001 C.R. Huang, 4th National NLP Research Symposium, De La Salle Univ., Manila, June 14-16 2007