170 likes | 330 Views
Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan. Languages of Concern --Modern Mandarin Chinese, -- Archaic, Ancient, and Near Modern Chinese
E N D
Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan
Languages of Concern • --Modern Mandarin Chinese, • --Archaic, Ancient, and Near Modern Chinese • (the diachronic record of three thousand years of Chinese ) • --Formosan Languages • (endangered, one of the richest branch of Austronesian languages)
Sharable Resources for Chinese Computational Linguistics • Corpora • Lexicons • Procedures • http://rocling.iis.sinica.edu.tw/ROCLING/
Sharable Resources for Chinese Computational Linguistics--Corpora • -Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) • -Sinica Treebank • -Standard Segmentation Corpus • -ROCLING Corpus • -Mandarin-Across-Taiwan (MAT) Speech Database
Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) • 5 million words, segmented and tagged • Direct WWW Access • -http://www.sinica.edu.tw/~tibe/2-words/modern-words/index.html OR • -http://www.sinica.edu.tw/ftms-bin/kiwi.sh • License Information • -http://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm
Sinica Treebank 1.0 38,725 Trees 239,532 Words Direct WWW Access (1000 sample trees) http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm License Information http://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm
Mandarin-Across-Taiwan (MAT) • Speech Database • Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences). • MAT-160 (160 speakers) • -http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm • MAT-2000 • http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm
A Database of Chinese Characters (i.e. Kanji) For each character: The Component Composition (部件組成) Information is important Over 10,000 Components (部件) have been identified for Chinese, roughly 2,000 of them productive http://www.dmpo.sinica.edu.tw:8000/~words/sou/sou.html --optional: radicals, number of strokes, variants
Sharable Resources for Chinese Computational Linguistics-Procedures Segmentation Standard for Chinese Language Processing Segmentation Standard http://godel.iis.sinica.edu.tw/ROCLING/juhuashu1.htm Standard Segmentation Corpus (2 million words, segmented) http://godel.iis.sinica.edu.tw/ROCLING/corpus98/segcorp_E.htm Standard Segmentation Lexicon (42,138 entries, w/ frequency) http://godel.iis.sinica.edu.tw/ROCLING/corpus98/segdic_E.htm Segmentation Program (free download) http://godel.iis.sinica.edu.tw/CKIP/ws/
Sharable Resources in Languages • Other than Modern Mandarin • Classical Chinese Corpora • http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html • Corpus of Formosan Austronesian Languages • Under construction, part of the National • Digital Archive Initiative • Lexical Databases of other Sino-Tibetan and • Tibeto-Burmese Languages
Synchronic and Diachronic • Chinese Corpora • Three Projects Sponsored by the CCK Foundation (1990-1995) • Chu-Ren Huang, Keh-jiann Chen and Pei-chuan Wei, Academia Sinica • Paul Thompson, SOAS, University of London • Chaofen Sun, Stanford University
Mechanisms for Scholarly Exchange and Collaboration • Department of International Programs, NSC • http://www.nsc.gov.tw/int/2_cooperation/index_02.html • Canada: NRC France: CNRS Japan: EAACST • Germany: DFG, DAAD, DKFG • Netherlands: NWO, IIAS • USA: NSF, NIH • UK: Royal Society of London, ETC
Other Resources in our area: Singapore (K.T. Lua) • Consortium of Asian Language Resources • http://cslp.comp.nus.edu.sg/cslp/index.htm • ---Last Updated Oct. 1999 • ----Contains detailed information of about 50 (mostly Chinese) linguistics resources • including comprehensive review, as well as license information
Other Resources: HowNet: An attribute-bases Semantic Network (Dong Zhengdong) http://www.keenage.com
Future 1. Linguistic Ontology: Wordnets --Bi- or Multi-lingual Wordnets in EuroWordNet style --Collaboration among Chinese speaking communities (Academia Sinica, City University of Hong Kong, Peking University)
Future • 2. Language Archives under the Digital Archive National Project • --Digital Archive Initiatives Started in 2001 • --The Language Resource Project (PI: Huang) • includes 3 corpus projects on 20th Century Taiwan Mandarinn Near Modern Chinese (17-18 Century) Pilot project on Formosan language corpora • --Expected to become a National Project in 2002
Future 3. A universal and sharable scheme for encoding Chinese characters 4. Join the Open Language Archives Community (OLAC) http://www.language-archives.org 5. Participation and Conformation to International Standards for Language Engineering (ISLE)