1 / 40

Using Finite State Technology in a Tool for Linguistic Exploration

Using Finite State Technology in a Tool for Linguistic Exploration. Kemal Oflazer, Mehmet Erbaş, Müge Erdoğmuş Sabancı University Istanbul, Turkey. Background and Motivation .

jmoser
Download Presentation

Using Finite State Technology in a Tool for Linguistic Exploration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Finite State Technology in a Tool for Linguistic Exploration Kemal Oflazer, Mehmet Erbaş, Müge Erdoğmuş Sabancı University Istanbul, Turkey

  2. Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples.

  3. Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples. • A showcase for the natural language processing resources and technology (for Turkish in this case)

  4. Background and Motivation • LingBrowser is an active and interactive tool for students of (introductory) linguistics to explore linguistic information on real text as opposed to canned examples. • A showcase for the natural language processing resources and technology (for Turkish in this case) • A testbed for the use of NLP in (native and foreign) language learning.

  5. Background and Motivation • Joint work with UC Berkeley, recently funded by US and Turkish NSF as a 3-year joint project, as a follow-up project to: • TELL – Turkish Electronic Living Lexicon (US NSF) • A Unified Electronic Lexicon Of Turkish (US and Turkish NSF)

  6. Turkish • Agglutinative morphology with many morphophonological processes • e.g., vowel harmony • pronunciation (phoneme selection/stress position) is a function morphological structure and function, and lexical semantics • lots of derivational processes • semi/non-lexicalized collocations • free constituent order

  7. LingBrowser Functionality (Current Prototype) • Access to linguistic information in arbitrary Turkish Web content and text • Lexical • phonological • phonemes, syllables, stress position • morphological • Lexical and surface morpheme structure, morphological features encoded • Semantic • dictionary access • WordNet access, • root word translation

  8. LingBrowser Functionality(On-going Work and Future) • Access to linguistic information in arbitrary Turkish Web content and text • Multi-word constructs • Named-entity identification • Surface syntax • NP extraction and structure display • Surface syntactic relations • Lexical Translation/Paraphrasing • Phrasal translation

  9. LingBrowser Prototype

  10. LingBrowser Prototype • Morphological Analysis

  11. LingBrowser Prototype • Surface Morpheme Structure

  12. LingBrowser Prototype • Lexical Morpheme Structure

  13. LingBrowser Prototype • Aligned Lexical Surface Structure

  14. LingBrowser Prototype • Pronunciation Representation (SAMPA) • Interleaved • Parallel

  15. LingBrowser Prototype • WordNet Lookups (via aligned Turkish and English Wordnets) • English translations/glosses of the root word • Turkish Synonyms

  16. LingBrowser Prototype • Word Concordances • Morphological Concordance • All forms with the selected root / POS combination are listed in context • one can see possible objects of a verb regardless of the inflected/derived form it appears in • Much more meaningful for languages like Turkish, Finnish, etc.

  17. (Prototype) Implementation • LingBrowser (indirectly) employs almost all the finite state language resources we have built over the last 10 years • All built using Xerox xfst, lexc and twolc • Indirectly via a database interface

  18. Finite State Transducers Employed Total of 750 xfst regular expressions + 100K root words (mostly proper names) over about 50 files Stress Computation Transducer Syllabification Transducer Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer Surface form

  19. Finite State Transducers Employed Two-Level Morphological Analyzer 1M States, 1.6 M Transitions Stress Computation Transducer Syllabification Transducer • ev+Noun+A3sg+P3sg+Loc • ev+Noun+A3sg+P2sg+Loc Exceptional Phonology Transducer Feature Form SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

  20. Finite State Transducers Employed Lexical Morphemes Transducer ~400K States, 1M Transitions Stress Computation Transducer Syllabification Transducer • ev+sH+ndA • ev+Hn+DA Exceptional Phonology Transducer Lexical Morpheme Sequence SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

  21. Finite State Transducers Employed Surface Morphemes Transducer ~560K States, 1.4M Transitions Stress Computation Transducer • ev+i+nde • ev+in+de Syllabification Transducer Surface Morpheme Sequence Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

  22. Finite State Transducers Employed Pronunciation e – v i n – “d e Pronunciation Lexicon Transducer ~6.5M States, 8.5M Transitions Stress Computation Transducer Syllabification Transducer Exceptional Phonology Transducer SAMPA Mapping Transducer Filters Modified Inverse Two-Level Rule Transducer Filters Filters (Duplicating) Lexicon and Morphotactic Constraints Transducer Two-Level Rule Transducer evinde Surface form

  23. Finite State Transducers Employed • Aligned pairs transducer • Input is the surface form • Output is a representation of the aligned lexical-surface feasible pairs; e.g. for evinde wewant to produce • ev+Hn+DA ev+sH+nDA • ev0in0de ev00i0nde evinde

  24. Aligned-pairs Transducer • We use a modified version of the two-level rule transducer • Feasible pair a:b is replaced with "a-b":b • A rule like a:b => LC _ RCis rewritten as "a-b":b => LC' _ RC‘where contexts are in terms of the new feasible pairs • Let’s call this the AlignedTwoLevelTransducer

  25. Aligned-pairs Transducer • A MapToPairs transducer maps each lexical symbol in the original grammar to the representations of the feasible pairs in the original grammar in which it is the lexical side • e.g., if we have A:a, A:e and A:0 as three feasible pairs with A on the lexical side, • then MapToPairs maps A to "A-a ", "A-e"and"A:0"

  26. Aligned-pairs Transducer Feature Symbols Lexicon Transducer Lexical Symbols

  27. Aligned-pairs Transducer The new transducer accepts all lexical symbol sequences allowed by the morphotactic constraints. Feature Symbols Lexical Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols

  28. Aligned-pairs Transducer This transducer maps lexical symbol sequences to valid possible feasible pair sequences Feasible-pair symbols MapToPairs Feature Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols

  29. Aligned-pairs Transducer This transducer accepts all potentially valid feasible pair sequences. Feasible-pair symbols Feasible-pair symbols MapToPairs Feasible-pair sequence Recognizer Extract Upper Side Feature Symbols Extract Lower Side Feasible-pair symbols Lexicon Transducer Lexicon Transducer.l Lexical Symbols Lexical Symbols

  30. Aligned-pairs Transducer This transducer maps surface forms to feasible pair sequences subject to morphographemic and morphotactic constraints. Feasible-pair symbols Feasible-pair symbols MapToPairs Feasible-pair sequence Transducer Extract Upper Side Feature Symbols Extract Lower Side Lexicon Transducer Lexicon Transducer.l AlignedTwoLevelTransducer Lexical Symbols Lexical Symbols Surface Symbols

  31. Aligned-pairs Transducer • ev+Hn+DA ev+sH+nDA • |||||||| ||||||||| • ev0in0de ev00i0nde Feasible-pair sequence Transducer AlignedTwoLevelTransducer evinde

  32. Implementation • Other resources used • Turkish WordNet aligned with the English WordNet • Current prototype was implemented in 4 months as a senior project, on MS .NET platform • Now being ported to Java

  33. Implementation • Text is annotated on the background with multiple threads • All text items are reverse indexed on relevant features (morphemes, features, syllables, phonemes, etc) for fast search, e.g., • Find all bi-syllabic words with an open syllable ending in “a” • Find all words with bi-syllabic roots with a long root final vowel • Find all finite verbs in future tense with 3rd plu agreement. • Find all words using the lexical morpheme +sHz” • Find all words in which lexical “+sH” is aligned with surface “00u” • Find all words with syllables with multiconsonant codas

  34. Future Functionality • Lexical paraphrasing • evimizdekiler  (those things) in our house

  35. Future Functionality • Lexical paraphrasing • evimizdekiler  (those things) in our house • Gets nasty when multiple derivations are present • Finlandiyalılaştıramadıklarımızdanmışsınızcasına  (behaving) as one of those who we could not convert into a Finn(ish citizen) • Tree transducers

  36. Future Functionality • Lexical paraphrasing • evimizdekiler  (those things) in our house • Gets nasty when multiple derivations are present • Finlandiyalılaştıramadıklarımızdanmışsınızcasına  (behaving) as one of those who we could not convert into a Finn(ish citizen) • Tree transducers • Extensive explanatory feedback • Morphographemics (why is lexical s deleted?) • show triggering contexts in addition to the rule • Pronunciation (why is this syllable stressed?) • show exceptional stress morphemes and explain their intearction

  37. Future Functionality • Drills • Generate surface form from lexical form • Segment into surface morphemes • Identify morphosyntactic features encoded by morphemes • Generate surface form from a set of features

  38. Future Functionality • Surface syntactic relations Eski Mısır kültüründe, çocuğa akıllı küçük denilmekteydi. Küçük yetişkin deyimi geleneksel toplumların çocuğu yetişkin yaşamına teşvik eden işleriyle kabul gördü. Ortaçağ'da ise, Avrupa'da çocuklara küçük hayvanlar denildi. Sanayileşme bu kültürel ayırımı hayata geçirerek çocuğu yetişkin yaşamından kopardı. Çocukluğu yetişkinlikten ayrı bir döneme indirgemek, çocukların geleceğe uyumlarını güçleştirecektir. Kaldı ki, bilgi toplumunda öylesi bir soyutlamanın, yani çocukluğun yetişkinlikten ayrı tutulmasının, imkansız denecek hale geldiği ise, açık bir gerçektir... sanayi+Noun+..^DB+Verb+Become..^DB+Noun+Inf+..+Nom Subject kop+Verb^DB+Verb+Caus+Past+A3sg

  39. Planned Deployment • We expect to have a version to be tested in Sharon Inkelas’ Linguistics course at Berkeley, by Fall 2006.

  40. Summary • LingBrowser is an active and interactive tool for linguistic exploration on real (Turkish) text • Query • Search • See explanations • Extensive use of finite state language resources • Being extended to included additional functionality.

More Related