Building a transducer: Xerox Finite State Toolkit

Meeting the scientific responsibilities of documentation efforts: Lessons from NSF-DEL projects Jonathan D Amith, Gettysburg College (LSA 2011) Documentation of a morphologically complex language: Oapan Nahuatl Documentation of a morphologically isolating language: Yoloxóchitl Mixtec • Oapan Nahuatl • Complex morphology (approximately 10 possible affixes on given words) • Surface forms are often very distinct from underlying representations {no+ikxi+pah+pa:ka+s+keh} > noxí:pa:seh • Highly productive noun incorporation • Rich, productive derivational morphology (applicatives, causatives, anticausatives, nominalization, adjectivization, verbalization) • Documentation strategy • Shallow orthography • Necessity of limiting lexicon headwords (e.g., terms in corpus) • Dictionary subentries reflect nontransparent meaning of affixed stems (directionals, reduplications, nonreferential objects) • Necessity exists for extensive cross-referencing of roots in a separate field of dictionary entries • Reliance on transducer (or parser) for parsing/glossing of corpus texts • Yoloxóchitl Mixtec • Simple, isolating morphology (maximum 3 or 4 clitics/affixes to any given word) • Surface forms vary little from from underlying representations (some enclitic induced vowel harmonizaton, palatalization, and labialization) • Compound words are rare, while phrasal collocations prevalent • Opaque and little studied derivational morphology (1) tonal variation (tio1ko4‘ant’; ña1ña4tio4ko4 ‘ant eater’; i3ta2tio14ko3; ‘ant flower’; ta1xi3 ‘to be fired’ [from a job], ta3xi3 ‘to fire [from a job]’); (2) consonant alternation (kwa3chi3 ‘to fall to pieces: potential stem’ cha4chi3 ‘to fall to pieces: habitual stem’) • Documentation strategy • Deep orthography • Lexicon is finite, naturally limited in number of headwords • Dictionary subentries reflect nontransparent meaning of collocations and phrases • Cross-referencing of headwors (not roots) is marked by XML tagging of one element (non-headword) in collocation subentries • Regular expression searches are sufficient to generate concordances Minimal derivational and inflectional morphology (Segmental and tonal) Segmental: Derivational: Basically limited to prefixing for causative [sa4-] and inchoative [ku3-, ndu3-] Inflectional: Completive (ni1-), iterative (nda3-), verbal arguments and nominal possessors (pronominal enclitics), intensifier clitics (=ndi42, =lu3) Tonal: Derivational: Adjectivization/attribution (often 1st mora high tone [V4]); intransitive/transitive: V1~V3) Inflectional: Habitual verbs (change 1st mora tone to high [V4]), completive verbs (ni1 ~ V[stem]1/add low tone to first mora), negation (first mora of potential stem has rising 14 tone), 1st person (add low tone to final 4 or 3 [31 →2]) 1. Building a transducer: Xerox Finite State Toolkit Offers the ability to parse (up) and generate (down) 2. Generation of paradigms and forms not in the corpus: Advantages Overgeneration (e.g., uninhibited intervocalic ‘k’ deletion) produces unacceptable forms that may reveal errors in the grammar (e.g, insufficient encoding of conditioning elements and rule ordering) or incompleteness of the lexicon (lack of specificity, e.g., stems that do not accept certain inflections). Rule-governed paradigm generation produces documents that can first be checked for accuracy by native speakers (leading to corrections in the grammar and lexicon) and then regenerated to produce paradigms for archiving. • Phonological changes(Palatalization, labialization, vowel harmonization) are generally simply expressed • and optional or variable and always the result of encliticization • No tone sandhi or floating tones. • Tonal elision occurs only on final tones: ku3xi2 ← ku3xi3 + 2 (written ku3xi(3)=2). • Palatalization is simply expressed: • Before an enclitic with initial (/a/ or /o/): Cii=[ao] → Cyaa/Cyoo; Ci’i=[ao] → Cya’a/Cyo’o; and CVCi=[ao] → CVCya/CVCyo • Similarly labialization is also simply expressed: • Before an enclitic with initial (/a/ or /e/): Coo/Cuu=[ae] → Cwaa/Cwee]; Co’o/Cu’u=[ae] → Cwa,a/Cwe’e; and CVCo/CVCVu=[ae] → CVCw[ae] • 5. Vowel harmony, apart from the processes associated with palatalization and labialization (e.g., so1ko1=e1→ [so1kwe1 ‘its shoulder’]) occurs with =eT (inanimate pronominal enclitic) after a stem ending in /a/: xa1a1= e1→ xe1e1 [‘it arrives there’] High level of affixing [reduplication, directionals] presented in subentries whereas word [e.g., N+V, N+N] compounds are represented in distinct entries. High level of collocations found in concordances (panel to right) and tagged for cross-referencing in subentries (e.g., <vmix/> as in <mix>i3in3<mix> <vmix>i3ni2</vmix>) <lx>i3in3</lx> <lx_cita>i3in3</lx_cita> <uid>0001</uid> <gloss>one</gloss> <pos>Adj-quant</pos> <sen>one (the number, in counting)</sen> <fr_mGroup> <ex_m>Ta1 lu3u3 i4ndu'3u4 ja4si4ki34 ndi4 ni1-ki3xa3a4=ra2 ka'4bi3=ra2: "i3in3, u1bi1, u1ni1..."</ex_m> <ex_e>The child who was playing started to count: “one, two, three ..."</ex_e> <sen>(with intensifier : <mix>i3in3=ni42</mix>) [lit., ‘one only'] only one (emphasizing that there isn’t and shouldn’t be anything else)</sen> <ex_m>I3in3=ni42=a2 nda3ta3'bi4=a2 ji1ni4=un4..., ka4chi2=na1 ji4'in4=ra3</ex_m> <fr_e>Just one thing (emphasizing that there should be no other distractions) you should be thinking about, they told him</ex_e> <col><mix>i3in3=ni42</mix><vmix>i3ni2</vmix> | [lit., ‘one only heart'] very confident, very assured (e.g., to be awaiting something that one is sure will happen, although its occurrence does not depend on ones actions)</col> <ex_m>Xi14ni2=o4 ndi4... yo4o4 ndi4 i3in3=ni42 i3ni2 yo4o4 ba1xi3=o4 tan3</ex_m> <ex_e>Who knows (what will happen) ..., we came here quite sure of it and …,</ex_e> <root>i3in3</root> <uid>01706</uid> <lx>ki:sa</lx> <pos>V1</pos> <infl>class-3a</infl> <sen>to emerge, come into view; to come up, to rise (sun)</sen> <ex_n>Xe ki:sa to:nahli. Ma tihchiaka:n.</ex_n> <ex_e>The sun still hasn't come up. Let's wait for it.</ex_e> <sen>to go away (an ill effect or sth harmful)</sen> <ex_n>Xtlatskwepo:nalti mokone:w, ma ki:sa i:tlatsiwis.</ex_n> </ex_e>Whip your child hard so his laziness leaves him.</ex_e> <sen>(with intraverse directional prefix : <nawa>wa:lki:sa </nawa>) to emerge; to come out of (an enclosed space)</sen> <ex_n>O:wa:lki:s, koxtoya i:cha:n.</ex_n> </ex_e>He emerged, he was sleeping inside his house.</ex_e> <sen> (with directional affix : <nawa>o:ki:sako</nawa>) to pass from one side to another (e.g., a nail through a wall)</sen> <ex_n>Yo:ki:sako. Sahkó:n ma noka:wa!</ex_n> </ex_e>It's come through to this side (e.g., a nail hammered through a board or wall). Just leave it like that!</ex_e> <sen> with monomoraic reduplicant and coda /h/ : <nawa>kíki:sa</nawa> | to roam, to move from place to place (an animal that does not stay when left to graze)</sen> <ex_n>Kimima:ka:wan un bwe:yestih kintekipano:ltiah de o:n xweka yawih, de o:n xkíki:sah.</ex_n> <ex_e>They let the oxen they work with loose,[only] the ones that do not go far away, the ones that do not roam. </ex_e> <sem>motion</sem> <root>ki:sa</root> The relative costs and benefits of a deep versus shallow orthography for Yoloxóchitl Mixtec Deep orthography Advantages: 1. Maintains a constant orthographic representation for stems, independent of the effects of enclitics. 2. Facilitates dictionary lookup of stems, which are transparent to the reader. 3. Facilitates generation of concordances through regular expression searches; obviates the need for a parser or transducer. Disadvantages: 1. May prove too abstract for native speaker education 2. For a person to produce the surface forms he/she needs a fairly sophisticated level of linguistic knowledge, a level that may be absent from younger generations. 3. The selection of the underlying form, particularly for some clitics, may be somewhat arbitrary. 4. Speaker (sociolinguistic) variability will not be represented in the orthography. 5. Analysis of the relative frequency of forms that are the result of variable or optional rules is not possible. Shallow orthography Advantages: 1. Best in terms of readability and production (writing) for native speakers taught to read and write. 2. Captures phonological variation among speakers, particularly in terms of processes that are optional or variable (e.g., palatalization, labialization, vowel harmonization) and thus facilitates sociolinguistic analysis. 3. No knowledge of underlying roots is necessary (e.g., to represent elicted tones). Disadvantages: 1. Dictionary lookup is complicated by the fact that basic forms (lemmas) are often not immediately apparent. 2. Concordances can be produced only through the use of a parser that identifies/parses out stems from surface forms where they are not immediately obvious 3. Relationships among words (e.g., shared roots) may be obscure. • Text Morphological Annotation: Advantages • Words that do not parse can be tagged as: • Misspelled (leading to correction of the transcription) • 2)Morphologically unanalyzable (leading to correction of the morphological grammar or word classification in the lexicon) • 3)Absent from the lexicon (leading to enrichment of the dictionary with new lemmas) 3. Parsed texts with automated dictionary lookup 1. Facilitates reading by language learners 2. Generates archival documents: relevant dictionary entries archived along with the parsed and glossed text Conclusions 4. Orthography: Orthographies should be goal oriented, designed with particular objectives or audiences in mind. 1. Thus flexibility is key to a successful choice in a documentation project. The best strategy is to develop an orthography that can be easily transformed into systems that target different potential goals (analysis, revitalization, education). 2. The relative advantages of a shallow orthography (reading) versus a deep orthography (analysis and lookup) should be weighed against various factors: 2a. The variation between underlying and surface forms and therefore the relative abstractness of a deep orthography. 2b. The resources required and available to facilitate processing (parsing or generation) of a practical orthography. Regular expressions are easily employed with a shallow orthgraphy to generate concordances \b[Ii]3in\(?3\)?(=\w+?)?(=\w+?)? [\w\(\)-=']+? [\w\(\)-=']+?(?=[ ,.\r]) target wordoptional clitics (2)following two words upto comma, period, or end-of-line Lexicography:Thoroughness is a goal but it is only superficially related to “number of entries”, a figure that depends as much on organizational decisions of the lexicographer as the completeness of the research. Other considerations include: 1. The semantic richness of the dictionary, superficially related to the ratio of subentries to entries. 2. The degree to which key diagnostic features (e.g., “nouns” that are predicate/attributive-only) have been identified, researched, and included in dictionary entries. 3. The intricacy and denseness of the “word net” that links entries (by semantic field, diagnostics, PoS, etymology). List may be grouped by unique matches with number of occurrences expressed before the 3-word sequence. This facilitates discovery of set phrases/collocations with non-transparent meaning. Corpus: Whereas size matters (e.g., a target of 100 hours), this is simply a baseline for other considerations: 1. Language death/endangerment is uneven in its effects: A corpus should be oriented toward endangered genres of discourse and threatened domains of cultural knowledge. It should be ethnographically informed. 2. The corpus transcription should be associated to sound (time-coded) and searchable (i.e., either parsed or transparent). 3. Elicitation is part of corpus formation and elicitation sessions should be recorded and archived with proper metadata. Grammar: Whereas at one level documentation may be considered as providing primary resources for future descriptive work (e.g.,once the language has died), accurate documentation depends on a synergy with concurrent descriptive work. 1. For accurate documentation a clear phonology and morphology needs to be developed and clearly documented. 2. Whereas grammar should be corpus based, elicitation should complement the former and be recorded and archived 3. In the best of cases the morphological grammar should be written as an executable program, thus allowing the linguist to test the analysis of morphology against a corpus. 4. If possible the electronic grammar should be developed to be as extensible as possible and to function both to parse and generate. 5. Elements of grammatical analysis (e.g., unaccusative vs. unergative verbs) should be encoded in dictionary entries (either PoS or diagnostics) Extensibility of electronic grammar An electronic grammar can be written in such a way as to facilitate extension to variants of a target language For example, instead of simply writing the conditional ending as -sia (as it is in Oapan Nahuatl verb morphology) write it as -skia with an obligatory k-deletion rule k → ø / s ___ i . The rule can be commented out for Nahuatl variants where the conditional is always -skia. Note that in Oapan k → ø / s ___ e is, however, optional, leading to alternative irrealis plural forms -skeh and -seh. 5.

Building a transducer: Xerox Finite State Toolkit

Building a transducer: Xerox Finite State Toolkit

Presentation Transcript

Building Finite-State Machines

Finite State Automata

Finite State Automata

Finite State Machine

Finite State Machines

Finite State Machines

Finite state machines

Finite State Machines

Finite State Machines

Finite-State Methods

Building Finite-State Machines

Building Finite-State Machines

Finite State Machines

Chapter 3: Morphology and Finite State Transducer

Finite Automata (Finite State Machine)

Finite State Machines

Building a Technology Toolkit

Building Finite-State Machines

Building Finite-State Machines

Finite State Machines

Building Finite-State Machines