180 likes | 273 Views
With 6,500 languages in the world, we must explore new ways to learn, document, and share our linguistic knowledge. John J. Kovarik NSA/CSS Senior Language Technology Authority.
E N D
With 6,500 languages in the world,we must explore new ways to learn, document, and share our linguistic knowledge. John J. Kovarik NSA/CSS Senior Language Technology Authority
Unlocking and Sharing LTCL Linguistic KnowledgeKeywords: CFG parsing, language generation, computational linguistics CALICO ’05 University of Michigan Ann Arbor, MI May 17-20, 2005
The Challenges of Learning and Sharing Knowledge of an LCTL in the 21st Century John J. Kovarik National Security Agency
Presentation Overview • General LCTL Challenges • Challenges of Learning Mongolian • Recipe for New Approach • Khalka Mongolian Parts of Speech • Mongolian Morphological Affixes • Method of Lexical Knowledge Representation • Analyze, Parse, Build Grammar Model, Test • Iterate Repeatedly
LCTL Learning Challenges • Fewer Learned Resources to Learn from • Less Recognition Nationally • Less Opportunities to Document What’s Learned • Very Few Students to Learn from You • Almost All Learning Done Manually • Few Reliable 21st Century Applications • Microsoft IME • Font
Mongolian Learning Challenges • Input Method Emulator (IME) • MicroSoft IME • Keyboard arranged for native Mongols • American Mongolists prefer phonetic keyboard • “a” key on Mongolian keyboard mapped to ASCII “a” etc. • Fonts commonly used on Internet • Russian Cyrillic fonts are commonly used • “|” and “0” commonly substituted for “ү” and “ө” • “у” and “о” often freely extended to “ү” and “ө”
Recipe for a New Approach • Take a student with a computational linguistics background • Infuse with curiosity and energy • Stir in access to the Internet • Add Mongolian syntax and morphology • Create morphological analyzer, context free parser, and grammatical generator for Mongolian • Resulting lexicons, software, and grammar models can be used by other linguistically adept students
Khalkha Mongolian Parts of Speech • Declinable Nouns • Declinable Adjectives • Inflected Verbs • Unchanging Adverbs • Declinable Converbs • Unchanging Postpositions • Unchanging Conjunctions • Unchanging Particles
Mongol Morphological Affixes • 27 verbal suffixes denoting tense and mood • 2 verb infixes denoting verb manner • Consultative • Passive • 6 verb paradigms or verb types • 3 irregular common verbs • 6 cases in singular and plural number • Both nouns and adjectives are declined
Lexical Knowledge Representations • Unchanging adverbs, conjunctions, particles, etc. and irregular verb forms (unchanging.txt file) • Lemmas of declinable nouns and adjectives (declinables.txt file) • Inflected verbs and nominalized verbs (regvb.txt file) • Affix files (casendings.txt, reflex.txt, infixes.txt, vbforms.txt)
Some Examples • declinables.txt file • N нэр Q хэн • regverb.txt file • V ир V өс • Affix files • casendings.txt g ний d д a ыг b оос • reflex.txt аа ээ оо • infixes.txt C лц R лд P гд • vbforms.txt) ipf нө i1p в i3p чээ Ypf охгүй • unchanging.txt file • Pg->талаар Pc->холбогдуулан
Merge Morphology Knowledge with the Power of the Computer Wrote yalgah.pl to become tireless lexical pedagogue • Searches for identifiable affixes by comparison with lexical knowledge affix files • Matches resulting lemma against lexical knowledge declinables, verbs, and unchanging words, then outputs word/part of speech tag to standard output file plus expository lexicon • Depending whether lemma can or cannot be matched, outputs: • Lemma to Out Of Vocabulary (oov) file noting affixes found • Word/part of speech tag to standard output file
Additional Outputs • Expository Morphology File (named morphlex.txt) IR->verb command imperative 2nd person singular IREEREY->converb future perfect continuative IREG-> verb command concessive 3rd person singular/plural BAGA->adjective HURAL->noun nominative IH->adjective AJILDAA->reflexive noun dative-locative ORLOO->verb indicative second past • Out Of Vocabulary File (namedoov) [C = : = > 5 = 0 E 0 0 A 0 0 ] (UNKNOWNAHAASAA) WORD 0 LINE 2 FALLS OUTSIDE OF VOCABULARY possible reflexive ending <0 0 >-<AA> possible declinable case ending<b>-<0 0 A >-<AAS> possible verbal part of speech <Ypf >-<0 E >-<AH> possible participial/converbal stem <C = : = > 5 = >--<UNKNOWN>
Feed Analytic Output to Parser • Developed context-free grammar (CFG) rules for both discourse and newspaper texts S->Sbj Prd S->Prd Sbj->Nn Sbj->NP NP->Tg Nn NP->Tg Ng Nn Prd->J • Wrote parse.pl to validate CFG rules against input text tagged as to part of speech • When each sentence can be fully parsed, outputs a parse tree and an English gloss. Working on "BAGA HURAL IH AJILDAA ORLOO ." ENGLISH GLOSS: large hural great work began . The sentence does parse. Branch nodes on tree: S -> (Sbj Prd) Sbj -> (NP) NP -> (J Nn) Prd -> (NPd Vi2p) NPd -> (J Nd) POS: J Nn J Nd Vi2p
Feed Output to Generator • Wrote gramgen.pl to generate sentences based on lexical knowledge, morphological knowledge, and syntactic knowledge gained • Output routinely reviewed for accuracy and Chomskian explanatory adequacy of the grammar models created for the parser and generator engines
Iterative Process • First take new newspaper article or dialogue and run morphological analyzer on it until all words are listed within vocabulary (no output in the oov [Out Of Vocabulary] file • Run output through parser, creating new CFG rules until new text parses • Run generator for a hundred or more examples to ensure adequacy of new rules
Morpho-analyzer, Parser, GeneratorSoftware Led This Student to Deeper Understanding of Mongolian • A linguistically adept learner can thus write software to help one learn deeper & faster • Language tool development is thus grounded in gaining and applying language knowledge in a systematic and linguistically principled manner for oneself and others
Contact Information John Kovarik • Email: kovarik@afterlife.ncsc.mil • Home Page: http://www.worldnet.att/~kovariks • Phone: 443-479-7188