560 likes | 723 Views
Language Archiving- Document Annotation and Corpus Linguistics. Keh-Jiann Chen Institute of Information science Academia Sinica. The goals of NDAP are : (Quote from [Hsieh 2002, “ Digital Media, Informatics, and Cultural Heritage “ ]). Preserving national cultural collections.
E N D
Language Archiving- Document Annotation and Corpus Linguistics Keh-Jiann Chen Institute of Information science Academia Sinica
The goals of NDAPare :(Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage“]) • Preserving national cultural collections. • Popularizing fine cultural holdings. • Strengthening cultural heritage as well as guiding cultural development. • Popularizing knowledge and Improving Information sharing. • Enhancing education and learning. • Bootstrapping cultural and value-added industries. • Improving literacy, creativity and quality of life. • Promoting International Cooperation and resource sharing.
Digital Archives and TSL coordinates: (Quote from [Hsieh 2002, “Digital Media, Informatics, and Cultural Heritage“])
Language Archiving is a Collection of Linguistic Resources • Collection of a linguistic archive (such as a balanced corpus) is guided by a set of design criteria • Design Criteriadefine natural classes of texts in a collection • Each criterion establishes a dimension for comparative studies • www.sinica.edu.tw/SinicaCorpus
How to make a single archive more versatile • One Corpus or Many Corpora? • Or How to make a Balanced Corpus Biased? • With Textual Markup Information (e.g. Metadata) • genre, style, mode, topic, medium etc. • word, part-of-speech, structure tags, semantic tags • Alignment for heterogeneous corpora
Creating Synergy from Uniform Resource Type • Each document is marked up with textual description features: topic, style etc. • Each feature selects a subset of documents • Sub-corpora (or new archives) can be created online according to user’s specification
Creating Synergy from Uniform Resource Type • Classical Chinese Corpora • http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html • Corpus of Formosan Austronesian Languages • Under construction, part of the National Digital Archive Initiative • Lexical Databases of other Sino-Tibetan and Tibeto-Burmese Languages
Creating Synergy from Heterogeneous Resource Type • Bi-lingual or multi-lingual corpora • Text and speech aligned corpora • Synchronized corpora collected from different areas
How to create a balanced corpus? Creating of Sinica corpus – A word segmented modern Chinese corpus with pos tagging
Introduction • TEI : A corpus is a body of texts put together in a principled way, typically in order to construct a sample of a given language or sublanguage. • It must be representative and balanced if it claims to faithfully represent the facts in that language or sublanguage [Sinclair 87].
Introduction • Sinica balanced corpus • Texts are classified according to 5 different features: (1)Genre(2)Style(3)Mode(4)Topic(5)Medium • Word segmentation standard • Segmentation standard for Chinese language processing • Http://godel.iis.sinica.edu.tw/ROCLING/juhuashu1.htm • Part-of-speech tagging • 46 syntactic categories
Sinica Corpus • philosophy 10% • natural sciences 10% • social 35% • arts 5% • general/leisure 20% • literature 20%
%% 文類 Genre=報導reportage %% 文體 Style=記敘 Description %% 語式 Mode= written %% 主題 Topic=訊息 Message %% 媒體 Medium=報紙 Newspaper %% 姓名 Author’s name= %% 性別 Gender=男女 %% 國籍 Nationality=中華民國Chinese %% 母語 Mother tone=中文Chinese %% 出版單位 Publisher=中研院週報Academia Sinica %% 出版地 Place=台北市台灣Taipei Taiwan %% 出版日期 date=1994 %% 版次 version= %% 標題 Title=國史研習會:中國宗教與社會 1. 。(PERIODCATEGORY) 由(P) 本(Nes) 院(Nc) 歷史(Na) 語言(Na) 研究所(Nc) 主辦(VC) ,(COMMACATEGORY) *********************************************** 2. ,(COMMACATEGORY) 台灣(Nc) 大學(Nc) 歷史系(Nc) 暨(Caa) 研究所(Nc) 與(Caa) 清華(Nb) 大學(Nc) 歷史系(Nc) 暨(Caa) 研究所(Nc) 協辦(VC) 之(DE) 「(PARENTHESISCATEGORY) 國史(Na) 研習會(Na) :(COLONCATEGORY) *********************************************** 3. :(COLONCATEGORY) 中國(Nc) 宗教(Na) 與(Caa) 社會(Na) 」(PARENTHESISCATEGORY) ,(COMMACATEGORY) ***********************************************
Introduction • Motivations for designing a corpus management system • It is hard to collect, maintain, classify, tagging a large amount of texts without using a management system. • Automate the word segmentation and tagging processes. • Maintain the precision and consistency of data collection. • Handle the out-of-vocabulary words.
Database for Texts … field 1 field 2 field 3 Text Id text record 1 Tagged text features Construction System Taggedtext Text Id text record 2 features Text database
網路(WWW) Construction Flow text Text Collection Module Text Database (SQL) text Text Files text Unknown word Identification Module Text & New words Inspection System New Word Editor Revised New Words Domain Lexicons text Word Segmentation and Pos-tagging Module Tagged Text Tagged Text Editor Revised Tagged text
Text Collection Module • Purpose:Semi-automatically collect the various texts from WWW. • Features:Automatic feature extraction and document classification.
Unknown Word Identification Module • Identify new words before word segmentation • Methods: • Detect the existence of unknown words • Apply statistical rules and morphological rules to identify unknown words
Word Segmentation & Tagging Module • Based on the word segmentation standard for information processing, the segmentation program segments input text and tags the result words with their part-of-speeches. • Methods:word matching based on lexicon and newly identified words. • Segmentation process:Longest matching and heuristic rules to resolve the segmentation ambiguities. • Pos tagging : Bi-gram model for resolving pos ambiguities.
Word Segmentation & Tagging Module (cont) • Additional features:Incorporate user defined dictionary or domain dictionary to enhance the word segmentation accuracy. • Domain dictionary:e.g. medical dictionary, dictionary for computing terminology. • Extracted unknown words:New words, such as personal names, always occurred in text. The unknown word identification process will extract the unknown words and they will be the supplement of dictionary.
General Lexicon Domain Lexicon Unknown words extracted from text Word segmentation and tagging Text Tagged text 台大(Nc) 本(Nes) 學期(Na) 舉辦(VC) 減重班(Na) 台大本學期舉辦 減重班
Inspection System • Purpose:To assure the quality of the corpus collection, the automatic processed texts need to be verified by human experts. Thus an inspection system was designed to speed up the verification process. • Major functions : • Editing functions:The errors of word breaks, pos-tags, features, sentence breaks can be fixed by just clicking the mouse. • Reminder functions : The system will highlight the common errors, prefix, suffix in the text. • Short term memory : The system will recall the most recent modifications and fixed the same type of errors automatically.
Inspection System (cont) • Provide lexical information and examples: • Friendly user interface:
J塑膠(Na) 皮(Na)→塑膠皮(Na) J公文(Na) 包(VC)→公文包(Na) J村(Nc) 上(Ncd)→村上(Nb) J毛利(Na) 遜(VH)→毛利遜(Nb) J吉姆(Nb) 毛利遜(Nb)→吉姆毛利遜(Nb) D世界級(Na)→世界(Nc) 級(Na) D科學方法(Na)→科學(Na) 方法(Na) D三代(Nd)→三(Neu) 代(Na) D交互作用(Na)→交互(VH) 作用(Na) D如一(VH)→如(P) 一(Neu) C改變(VC)→改變(Na) C傳統(VH)→傳統(Na) C企畫(VC)→企畫(Na) C自然(D)→自然(VH) C起來(VA)→起來(Di) F反射(VJ)→反射(VJ)[+nom] F遮雨(VA)→遮雨(VA)[+nom] F保持(VJ)[+nom]→保持(VJ) F萊特班(Na)→萊特班(Na)[+prop] F感動(VHC)→感動(VHC)[+nom]
Corpus Management System • Advantages: • The corpus management system speeds up the construction processes and reduces the human efforts. • It also increases the precision and consistency of the word segmentation and pos-tagging. • Database system facilitates the functions of searching, managing, retrieving, and reorganizing texts.
Using Corpora Reorganizing sub-corpora Searching tools
Reorganizing sub-corpora • Sub-corpora can be reorganized according to different features. • Sport corpus • Spoken corpus • Corpus of the most recent tree months • News corpus • Corpus of poetry
Corpus Searching Tools • KWIC search Key word vector what is matched • [代表, N, φ, φ] every word 代表daibiao tagged with the pos noun • [φ,VA, φ, 1] all monosyllabic intransitive verb(VA) • [φ, φ,+fw,φ] all foreign words • [..化,V, φ, 3] all tri-syllabic verb with suffix 化hua '-ize'
Corpus Searching Tools • Filtering • The filtering methods include: • random sampling, • removing redundant samples, • removing irrelevant samples by restricting the content in the window of key words. • Displaying, printing, and storing • The result KWIC files can be displayed on screen, or printed,or stored for future processing.
Corpus Searching Tools • Statistics: • Statistic functions provide statistical distributions of words and categories occurring within the context window of key words. • For instance, the category distribution of the word 把ba. • Category Frequency % • preposition P 2704 92.57 • measure Nf 211 7.22 • transitive verb Vc 3 0.10 • determiner Neqb 2 0.07 • noun Na 1 0.03
Corpus Searching Tools • Collocation finding • The system finds collocations of the key words by computing the mutual information [Church & Hanks 90] of the key words with the words or parts-of-speech in a user defined window. • Mutual Information= Log P(X,Y)/P(X)*P(Y) • I(x,y) >> 0 :x,y are strongly associated. • I(x,y) ≈ 0 :x,y are unrelated. • I(x,y) << 0 :x,y are mutually exclusive.
Examples • The top 16 collocations of ‘威脅’ within the window of distance 10. • 1. 飽受 2. 恫嚇 3. 綑綁 4. 構成 • 5. 嚴重 6. 崩坍 7. 恐怖 8. 恐嚇 • 9. 遭受 10. 刀槍 11. 滾滾 12. 安全 • 13. 尖刀 14. 健康 15. 成全 16. 備受
CorpusLinguistics • Corpus provides ample examples of word uses and syntactic patterns. It also reflect the real uses of the language and their frequency distribution. • Comparative study can be made within KWIC or between sub-corpora. • Automatic knowledge extraction techniques can be performed on corpus to reduce manual efforts.
Lexicography • Corpus provides ample examples of different word uses and syntactic patterns. • Corpus reflects the real uses of the language and their frequency distribution. • Collocations show idiomatic patterns and they are the most important uses of a word. • Examples can be extracted from corpora. • Senses and syntactic functions can be ordered according to their frequencies. • CoBuild, Oxford, EDR, Collocation Dictionary of Noun and Measure Words are examples of using corpora for editing dictionaries.
LanguageModeling • Markov Language Model: the probabilities are estimated from corpora. • P(W1W2…Wm)= P(W1)*P(W2|W1)*P(W3|W1W2)*…*P(Wn|W1W2…Wm-1) • N-gram Model: P(W1W2…Wn) P(W1)*P(W2|W1)*P(W3|W1W2)*…*P(Wn|Wm-n+1,…,Wm-1)
LanguageModeling • Applications of language modeling: • Inputting methods: speech recognition, character recognition, spelling check, phonetic input, … • Data compression: Huffman coding, Arithmetic Coding,… • Categorization: Text classification, pos tagging, sense disambiguation, word segmentation,…
Machine Translation • IBM [Brown etc. 1990] used the bi-lingual Hansard corpus to build translation models. • To translate a French sentence F to an English sentence E is equivalent to find the E which maximize P(E)*P(E|F). • P(E) is estimated from bi-gram model. • P(E|F) is estimated from aligned bi-lingual corpus.