1 / 17

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan. Languages of Concern --Modern Mandarin Chinese, -- Archaic, Ancient, and Near Modern Chinese

haracha
Download Presentation

Current Status and Future of Language Resources in Taiwan Chu-Ren Huang

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Current Status and Future of Language Resources in Taiwan Chu-Ren Huang Institute of Linguistics, Academia Sinica Symposium on Language Resources in Asia January 19, 2001, Tokyo, Japan

  2. Languages of Concern • --Modern Mandarin Chinese, • --Archaic, Ancient, and Near Modern Chinese • (the diachronic record of three thousand years of Chinese ) • --Formosan Languages • (endangered, one of the richest branch of Austronesian languages)

  3. Sharable Resources for Chinese Computational Linguistics • Corpora • Lexicons • Procedures • http://rocling.iis.sinica.edu.tw/ROCLING/

  4. Sharable Resources for Chinese Computational Linguistics--Corpora • -Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) • -Sinica Treebank • -Standard Segmentation Corpus • -ROCLING Corpus • -Mandarin-Across-Taiwan (MAT) Speech Database

  5. Academia Sinica Balanced Corpus of Mandarin Chinese (Sinica Corpus) • 5 million words, segmented and tagged • Direct WWW Access • -http://www.sinica.edu.tw/~tibe/2-words/modern-words/index.html OR • -http://www.sinica.edu.tw/ftms-bin/kiwi.sh • License Information • -http://rocling.iis.sinica.edu.tw/ROCLING/corpus98/sinicor_E.htm

  6. Sinica Treebank 1.0 38,725 Trees 239,532 Words Direct WWW Access (1000 sample trees) http://godel.iis.sinica.edu.tw/CKIP/trees1000.htm License Information http://rocling.iis.sinica.edu.tw/ROCLING/Treebank/Treebank-E.htm

  7. Mandarin-Across-Taiwan (MAT) • Speech Database • Speech files are collected through telephone networks. The content Includes spontaneous speech (short answering statements) and read speech (numbers, Mandarin syllables, words of 2 to 4 syllables, phonetically balanced sentences). • MAT-160 (160 speakers) • -http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm • MAT-2000 • http://rocling.iis.sinica.edu.tw/ROCLING/MAT/index_cf.htm

  8. A Database of Chinese Characters (i.e. Kanji) For each character: The Component Composition (部件組成) Information is important Over 10,000 Components (部件) have been identified for Chinese, roughly 2,000 of them productive http://www.dmpo.sinica.edu.tw:8000/~words/sou/sou.html --optional: radicals, number of strokes, variants

  9. Sharable Resources for Chinese Computational Linguistics-Procedures Segmentation Standard for Chinese Language Processing Segmentation Standard http://godel.iis.sinica.edu.tw/ROCLING/juhuashu1.htm Standard Segmentation Corpus (2 million words, segmented) http://godel.iis.sinica.edu.tw/ROCLING/corpus98/segcorp_E.htm Standard Segmentation Lexicon (42,138 entries, w/ frequency) http://godel.iis.sinica.edu.tw/ROCLING/corpus98/segdic_E.htm Segmentation Program (free download) http://godel.iis.sinica.edu.tw/CKIP/ws/

  10. Sharable Resources in Languages • Other than Modern Mandarin • Classical Chinese Corpora • http://www.sinica.edu.tw/~tibe/2-words/old-words/index.html • Corpus of Formosan Austronesian Languages • Under construction, part of the National • Digital Archive Initiative • Lexical Databases of other Sino-Tibetan and • Tibeto-Burmese Languages

  11. Synchronic and Diachronic • Chinese Corpora • Three Projects Sponsored by the CCK Foundation (1990-1995) • Chu-Ren Huang, Keh-jiann Chen and Pei-chuan Wei, Academia Sinica • Paul Thompson, SOAS, University of London • Chaofen Sun, Stanford University

  12. Mechanisms for Scholarly Exchange and Collaboration • Department of International Programs, NSC • http://www.nsc.gov.tw/int/2_cooperation/index_02.html • Canada: NRC France: CNRS Japan: EAACST • Germany: DFG, DAAD, DKFG • Netherlands: NWO, IIAS • USA: NSF, NIH • UK: Royal Society of London, ETC

  13. Other Resources in our area: Singapore (K.T. Lua) • Consortium of Asian Language Resources • http://cslp.comp.nus.edu.sg/cslp/index.htm • ---Last Updated Oct. 1999 • ----Contains detailed information of about 50 (mostly Chinese) linguistics resources • including comprehensive review, as well as license information

  14. Other Resources: HowNet: An attribute-bases Semantic Network (Dong Zhengdong) http://www.keenage.com

  15. Future 1. Linguistic Ontology: Wordnets --Bi- or Multi-lingual Wordnets in EuroWordNet style --Collaboration among Chinese speaking communities (Academia Sinica, City University of Hong Kong, Peking University)

  16. Future • 2. Language Archives under the Digital Archive National Project • --Digital Archive Initiatives Started in 2001 • --The Language Resource Project (PI: Huang) • includes 3 corpus projects on 20th Century Taiwan Mandarinn Near Modern Chinese (17-18 Century) Pilot project on Formosan language corpora • --Expected to become a National Project in 2002

  17. Future 3. A universal and sharable scheme for encoding Chinese characters 4. Join the Open Language Archives Community (OLAC) http://www.language-archives.org 5. Participation and Conformation to International Standards for Language Engineering (ISLE)

More Related