1 / 28

Infrastructures for the Korean Language

Infrastructures for the Korean Language. Key-Sun Choi. Academic Society. SIG-Korean Language Computing under Korea Information Science Society 300 members Korea Information Society linguistics oriented. KIBS Korea Information Base and Systems. Purpose:

neil
Download Presentation

Infrastructures for the Korean Language

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Infrastructures for the Korean Language Key-Sun Choi

  2. Academic Society • SIG-Korean Language Computing under Korea Information Science Society • 300 members • Korea Information Society • linguistics oriented

  3. KIBSKorea Information Base and Systems • Purpose: • To improve Korean Language Processing Technology • To promote Korean Software Industry • in the planning phase (1993), targetted to Hangul Wordprocessor, Machine Translation and Korean Linguistic Research • 1995 - 1997 (Phase 1): “word” • Two ministry joint project + Industry • Ministry of Science&Technology, Ministry of Culture • 1998 - 2000 (Phase 2): “sentence” • Only by Ministry of Science&Technology + Industry • will be evaluated in October, 2000 • 2001 - 2003 (Phase 3): “discourse” - not decided • http://kibs.kaist.ac.kr/

  4. King Sejong Project • Purpose • To promote the Korean Language Research in the linguistics side • To prepare for the language planning • for Unification of South-/North-Korea • for International use of Korean • Sponsor: Ministry of Culture • Period: 1998 - 2007 (10 years) • Items • corpus, dictionary, internationalization, terminology, education, font, old Korean, old Chinese characters • http://www.sejong.or.kr/

  5. End User -- System Application Level Automatic SpeechTranslation Word processor MT system Information Retrieval System User(Programmer) Spell checker Style checker MT engine IR engine UI engine Engine Level Distributed Resource Management System Engine Module Level MA1 TA1 PA1 WSD1 DA1 RM1 MA2 TA2 PA2 WSD2 DA2 RM2 User(lexicographyist) Quality Management System Electronic Dictionary Common Knowledge Terminology Ontology Domain Knowledge Knowledge Level corpus Knowledge extractor Basic DB Knowledge Source Level Tagging Support Tool Master DB MRD KIBS: Architecture Terminology DB User(Dictionary)

  6. KIBS: Introduction • Title of Project • KIBS I : Integrated Korean Information Base • KIBS II : On Development of Deep-Level Processing and Quality Management Technology for Very Large Korean Information Base • Outline • Term : 1994.12.4 ~ 2004.9.30 (10 years) • Sponsor : Ministry of Science and Technology • Staff : 50 person/year

  7. The Goal of First step The Development of an Integrated, Environment and Support Management System • Standard Module Interface • Corpus and Electronic Dictionary Development and Management System • Korean Part-of-Speech Tagging System • Korean Syntactic Tagging System • Korean/English Alignment System The Standardization & the Specification for Korean Information Base • Terminological Data Base Development and Management System • Standard Korean Input/Output Environment • Standardized Methodology for the Construction of a Balanced Corpus • Part-Of-Speech Transfer Dictionary Rules and an Example Package The Construction of Korean Information Base • Tree-Tagged Corpus • Word-Level Narrative Speech Data Base • Hand-written Hangul scripts of high frequency

  8. The Goal of Second step Development/Management System of Electronic Dictionary for Sentence Analysis/Generation (100,000 entries) • Syntactic Information Base for Syntactic Analysis/Generation • Semantic Information Base for Semantic Analysis/Generation • Additional Information on Language and GUI for Developing Applications Terminology Dictionary and Development/Management System • Terminology Entries • Domain-specific Corpus for Terminology Building • Sublanguage Analysis and Extraction of Terminology Quality Management System for Language Information Processing • Development/Management System for Information Base • Development of Integrated Management System for Distributed Resources

  9. Development Tools • Korean Concordance Program (KCP) • Compound Noun Browser • Corpus Browser • Corpus Browser by Category • Automatic English-to-Korean Transliteration System (TLEK) • KAIST Ontology Browser • Korean Morphological Analyser • Korean Tagger • Korean Syntactic Analyser • Editing Support Tools to Electronic Dictionary

  10. Results & Distribution • Major Results • The first (KIBS I) : 1997.6. ~ present (80 site) • Text corpus 10 million word phrases • POS tagged corpus 1 million word phrases • Syntactic structure tagged corpus 10 thousands sentences • TDMS, Speech DB samples, Hand-written character DB samples • The second (KIBS II) : 1998.12. ~ present (140 site) • Raw corpus 10 million word phrases, POS tagged corpus – 200 thousands word phrases • The third (KIBS III) : 2000 (pending) • Proper noun 10 thousands entries, Compound noun 20 thousands entries, Verb sentence pattern dictionary 3 thousands entries, ... • Plan to maintain and distribute ...

  11. Integration of Electronic Dictionaries • Dictionaries: total 420K entries (estimated now) • Machine Readable Dictionary (Hangul Society): 200K entries • Compound Noun, Proper Noun Classification, Internal Semantic Structure: 50K entries • Searched Compound Noun, Proper Noun: open • Verb Subcategorization: 10K frames (K-J comparison) • Thesaurus: Korean-Japanese-Chinese-English – not so good quality – 150K entries • Usage from corpus for each sense • Functional words • Problem • Sense classification standardization • Character code: Korean, Japanese, Chinese, … (most important problem) – now under unicode transfer

  12. Open through web: • Corpus KWIC for Korean and Japanese • http://morph.kaist.ac.kr/kcp/ • Korean morphological analysis service • http://morph.kaist.ac.kr/ • By email, if send a text file, then reply its POS tagging • Graphic editor/debugger for Korean morphology • Project Status • http://kibs.kaist.ac.kr/

  13. KORTERM Korea Terminology Center for Language and Knowledge Engineering http://korterm.org/ (English) http://korterm.or.kr/ (Korean) http://eafterm.org/ (East Asian Terminology)

  14. Goals of KORTERM • Through World-Wide Terminology Collection and Their Standardization and Harmonization in Local Society • Distribution, Publication and Application in Language and Knowledge Engineering are promoted. • Through Education and Consultation of Terminology R&D Methodology for Each Subject Field, • High-Quality, High-Reliable Terminology and Its Infrastructure and System are achieved. Center of Terminology and Knowledge Engineering

  15. Phase 4 (2008 - ) Maintenance and Extension Phases and Subjects of KORTERM Phase 3 (2004-2007) Operation • Continuous Extension and Management • Terminology Study Promotion • Distribution of Terminology Information Base • Continuous Terminology Extension and Management • Multi-lingual Terminology Integration • Terminology Collection (Humanity and Social Science) • Maintenance and Extension • Large-Scale Knowledge Base for Terminology • Terminology Education Curriculum Development • Application Product Development Phase 2 (2001-2003) Value-Added Working System • Value-Added Terminology Integration • Terminology Collection (Extended S&T) • Extension & Maintenance (Industry Standards) • High-Quality Terminology • Application in Language Industry • Verification for High-Reliability and Distribution Phase 1 (1998-2000) R&D Environment and Basic Data Collection • Integration of Working Terminology • Terminology Collection (Basic S&T, Industry Standard, • Economics) • Electronic Terminology (Publication) • R&D Environment (System Standardization) • Terminology Theory and Education Infrastructure

  16. R & D (1) • Basic Data (Corpus) • Corpus for Each Subject Domain • Electronic Dictionary for Basic Vocabulary • Everyday Vocabulary consists of General Vocabulary and Everyday Terminology • Internationalization of Korean Language • South-North Korean Terminology Standardization, Korean language Input Methods • Korean Language Engineering • Standardized Term Use for Information Retrieval, Machine Translation and Document Classification

  17. R & D (2) • Language Engineering • Information Retrieval: • Effective Internet Information Creation and Information/Knowledge Acquisition • Multi-lingualism • Machine Translation: • Efficient Information Generation through Terminology and Vocabulary Collection and Standardization • Wordprocessor: • High Productivity by Spelling Correction, Summarization and Efficient Use.

  18. R & D (3) • Language, Information and Terminology • Language Education: • Technical Thinking and Technical Communication • Terminology-based Education • Language Study: • Domain-specific Language Study

  19. Terminology Sponsors • Support from Government, Organization and Industry according to each specialty • Ministry of Culture and Tourism (KORTERM Center Operation) • Ministry of Science and Technology (R&D Fund) • Ministry of Information and Telecommunication (R&D Fund) • Ministry of Diplomacy and Trade • Ministry of Industry and Resource • Ministry of Education • Korea Science and Technology Foundation (Event Support)

  20. Task Configuration R&D Industry Living Communication Use Terminology Information Environment Application Application-Specific Dictionary Language Education Adaptable to Student LanguageEducationEnvironment Language& Knowledge Product TerminologySymbolization Grid Size Controller Terminology Access Standard Channel R&D Environment International Term Standard Terminology Standard TerminologicalConceptualSpace Standardization & Harmonization Terminology Base (Collection)Non-standards

  21. Large-Scale Speech/Language/Image DB Construction and Evaluation Supported by Ministry of Science and Technology Two Year Project (1999.10-2001.10)

  22. Goals Final Goal Speech/Language/Image Evaluation Standardization Organization • Working Group Organization • Survey and Planning • IR Test Suite and Evaluation Model Recommend • MT Test Suite and Evaluation Model Recommend Language Specification Standardization • Sentence-unit Speech DB • Prosody for Speech Synthesis Speech • Image Attribute Format • Color-Lexical Entry • MPEG7 Specification Image • IR/QA 90 query/200K doc, MT 5,000 sentences Language Test Suite Speech • word-unit telephone speech DB: 100 token * 500 Image • Image 300 kinds - Meta Data

  23. Question-Answering IR Test Suites • Test Suites for IR/QA • Documents • 207,067 records (370MB) • Newspapers • Query Generation • 90 queries (through 300 quiz query analysis) • Queries for WH-question and other various types of answers • for NLP problem solving • relevent document set to include the answer • by using four kinds of commercialized IR systems by 16 kinds of methods

  24. English-Korean MT Test Suites • Type Classification: About 300 Kinds • Test Sentences and Test Query: 5,000 Records • Extracted from Textbook and Grammar books (1999-2000) • will be extracted from the Real usage like web, newspapers (2000-2001) • Evaluation by Yes/No Question • Tested for 4 Commercialized English-Korean MT Systems

  25. MT Evaluation Workbench

  26. Image Meta Data Editor • Meta data Input Workbench • by XML

  27. Image Retrieval by Meta data

  28. http://korterm.kaist.ac.kr/ksurimal/

More Related