Search and Decoding in Speech Recognition

Search and Decoding in Speech Recognition Words and Transducers

Introduction • From Ch 1. – regular expressions we sow how easy it is to search for a plural of the woodchuck (woodchucks). • However searching for plural of fox, fish, peccary or wild goose is not as trivial as just tacking on an s. • Main Entry: foxPronunciation: 'fäksFunction: nounInflected Form(s): pluralfox·esalsofoxUsage: often attributiveEtymology: Middle English, from Old English; akin to Old High German fuhs fox and perhaps to Sanskrit puccha tail • Main Entry: fishPronunciation: 'fishFunction: nounInflected Form(s): pluralfishorfish·esUsage: often attributiveEtymology: Middle English, from Old English fisc; akin to Old High German fisc fish, Latin piscis • Main Entry: pec·ca·ryPronunciation: 'pe-k&-rEFunction: nounInflected Form(s): plural-riesEtymology: of Cariban origin; akin to Suriname Carib paki:ra peccary: any of several largely nocturnal gregarious American mammals resembling the related pigs: as a: a grizzled animal (Tayassu tajacu) with an indistinct white collar b: a blackish animal (Tayassu pecari) with a whitish mouth region • Main Entry: goosePronunciation: 'güsFunction: nounInflected Form(s): pluralgeese /'gEs/Etymology: Middle English gos, from Old English gOs; akin to Old High German gans goose, Latin anser, Greek chEn Veton Këpuska

Introduction • Required knowledge to correctly search for singulars and plurals in English language: • Orthographic rules: Words ending in –y are pluralized by chaning the –y to –i and adding an –es. • Morphological rules: tell us that fish has null plural and that the plural of goose is formed by changing the vowel. • Morphological parsing: recognizing that a word (like foxes) break down into component morphemes (fox and -es) and building a structured representation of it. • Parsing means taking an input and producing some sort of linguistic structure for it. • Parsing can be thought in broad terms producing structures based on: • Morphology, • Syntax • Semantics, • Discourse Producing a: • String, • Tree, or • Network Veton Këpuska

Introduction • Morphological parsing (or stemming) applies to many affixes other than plurals; • Example: Parsing any English verbs ending in –ing (e.g., going, talking, congratulating) into its verbal stem plus the –ing morpheme. • going ⇨ VERB-go + GERUND-ing • Morphological parsing important for speech and language processing: • Part-of-speech tagging • Dictionaries (spell-checking) • Machine translation Veton Këpuska

Introduction • To solve morphological parsing problem one could just store all the plural forms of English nouns and –ing forms of English verbs in dictionary as in English Speech Recognition tasks. • For many Natural Language Processing applications this is not possible because –ing is a productive suffix: that is it applies to every verb. • Similarly –s applies to almost every noun. • Productive suffixes apply to new words: • Example: fax and faxing • New words (e.g., acronyms and proper nouns) are created constantly – need to add the plural morpheme –s to each. • Plural form of new nouns depends on the spelling/pronunciation of the singular form (eg. The nouns ending in –z the plural is formed by replacing it with –es). • In other languages (e.g., Turkish) one cannot list all the morphological variants of every word: • Turkish verbs have 40,000 possible forms not counting derivational suffixes. Veton Këpuska

Outline • Survey of morphological knowledge for English and some other languages • Introduction of finite-state transducer as the key algorithm for morphological parsing. • Finite-state transducers are key algorithms for speech and language processing. • Related algorithms: • Stemming: mapping from the word to its root or stem. Important to Information Retrieval tasks. • Need to know if two words have a similar root despite their surface differences • Example: sang and sung. The word sing is called the common lemma of these words, and mapping form all these to sing is called lemmatization. Veton Këpuska

Outline • Tokenization or Word Segmentation – a related algorithms to morphological parsing that is defined as a task of separating out (tokenizing) words from running text. • English language text separates words by white space but: • New York, rock ‘n’ roll – are considered single words • I’m – is considered two words “I” and “am” • For many applications we need to know how similar two words are orthographically. • Morphological parsing is one method for computing similarity, • Comparison of strings of letters via minimum edit distance algorithm. Veton Këpuska

Survey of English Morphology • Morphology is the study of the way words are built up from smaller meaning-bearing units - morphemes. • Morpheme is often defined as the minimal meaning-bearing unit in a language. • Main Entry: mor·phemePronunciation: 'mor-"fEmFunction: nounEtymology: French morphème, from Greek morphE form: a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts Veton Këpuska

Survey of English Morphology • Example: • fox consists of a single morpheme: fox. • cats consists of two morphemes: cat and –s. • Two broad classes of morphemes: • Stems - main morpheme of a word, and • Affixes – add additional meaning to the word. • Prefixes – preceding the stem: unbuckle • Suffixes – following the stem: eats • Infixes – inserted in the stem: humingi (Philippine language Tagalog) • Circumfixes – precede and follow the stem. gesagt (German past participle of sagen) Veton Këpuska

Survey of English Morphology • A word can have more than one affix: • rewrites: • Prefix - re • Stem - write • Suffix - s • unbelievably: • Prefix - un • Stem - believe • Suffix - able, ly • English language does not tend to stack more than four or five affixes • Turkish can have words with nine or ten affixes – languages like Turkish are called agglutinative languages. Veton Këpuska

Survey of English Morphology • There are many ways to combine morphemes to create a word. Four methods are common and play important role in speech and language processing: • Inflection • Combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function like agreement. • Example: • -s: plural of nouns • -ed: past tense of verbs. Veton Këpuska

Survey of English Morphology • Derivation • Combination of word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict. • Example: • Computerize – verb • Computerization – noun. • Compounding • Combination of multiple word stems together. • Example: • Doghouse – dog + house. • Cliticization • Combination of a word stem with a clitic. A clitic is a morpheme that acts syntactically like a word, but is reduced in form and attached (phonologically and sometimes orthographically) to another word. • Example: • I’ve = I + ‘ve = I + have Veton Këpuska

Inflectional Morphology • English language has a relatively simple inflelectional system; Only • Nouns • Verbs • Adjectives (sometimes) • Number of possible inflectional affixes is quite small. • Nouns (English): • Plural • Possessive • Many (but not all) nouns can either appear in • bare stem or singular form, or • Take a plural suffix Veton Këpuska

Inflectional Morphology • Regular plural spelled: • -s • -es after words ending in • –s (ibis/ibises) • -z (waltz/waltzes) • -sh (thrush/thrushes) • -ch (finch/finches) • -x (box/boxes); sometimes • Nouns ending in –y preceded by a consonant change the –y to –i (butterfly/butterflies). • The possessive suffix is realized by apostrophe + -s for • Regular singular nouns (llama’s), and • Plural nouns not ending in –s (children’s), and often • Lone apostrophe after • Regular plural nouns (llamas’), and some • Names ending in –s or –z (Euripides’ Comedies). Veton Këpuska

Inflectional Morphology • English language inflection of verbs is more complicated than nominal inflection. • English has three kinds of verbs • Main verbs (eat, sleep, impeach) • Modal verbs (can, will, should) • Primary verbs (be, have, do) • Concerned with main and primary verbs because these have inflectional endings. • Of these verbs a large class are regular (all verbs in this class have the same endings marking the same functions) Veton Këpuska

Regular Verbs • Regular Verbs have four morphological forms. • For regular verbs we know the other forms by adding one of three predictable endings and making some regular spelling changes . Veton Këpuska

Regular Verbs • Since regular verbs • Cover majority of the verbs and forms, and • Regular class is productive, they are significant in the morphology of English language. • Productive class is one that automatically includes any new words that enter the language. Veton Këpuska

Irregular Verbs • Irregular Verbs are those that have some more or less idiosyncratic forms of inflection. • English irregular verbs • often have five different forms, but can have • as many as eight (e.g., the verb be), or • as few as three (e.g., cut or hit) • They constitute a smaller class of verbs estimated to be about 250 Veton Këpuska

Usage of Morphological Forms for Irregular Verbs • The –s form: • Used in “habitual present” form to distinguish the third-person singular ending: “She jogs every Tuesday” from the other choices of person and number “I/you/we/they jog every Tuesday”. • The stem form: • Used in in the infinitive form, and also after certain other verbs “I’d rather walk home, I want to walk home” • The –ing participle is used in the progressive construction to mark a present or ongoing activity “It is raining”, or when the verb is treated as a noun (this particular kind of nominal use of a verb is called gerund use: “Fishing is fine if you live near water”) • The –ed participle is used in the perfect construction “He’s eaten lunch already”, or passive construction “The verdict was overturned yesterday” Veton Këpuska

Spelling Changes • I number of regular spelling changes occur at morpheme boundaries. • Example: • A single consonant letter is double before adding the –ing and –ed suffixes: beg/begging/begged • If the final letter is “c”, the doubling is spelled “ck”: picnic/picnicking/picnicked • If the base ends in a silent –e, it is deleted before adding –ing and –ed: merge/merging/merged • Just as for nouns, the –s ending is spelled • –es after verb stems ending in –s (toss/tosses) • -z (waltz/waltzes) • -sh (wash/washes) • -ch (catch/catches) • -x (tax/taxes) sometimes. • Also like nouns, verbs ending in –y preceded by a consonant change the –y to –i (try/tries). Veton Këpuska

Derivational Morphology • Derivation is combination of a word stem with a grammatical morpheme • Usually resulting in a word of a different class, • Often with a meaning hard to predict exactly • English inflection is relatively simple compared to other languages. • Derivation in English language is quite complex. Veton Këpuska

Derivational Morphology • A common kind of derivation in English is the formation of new nouns • From verbs, or • Adjectives, called nominalization. • Example: • Suffix –ation produces nouns from verbs ending often in then suffix –ize (computerize → computerization) Veton Këpuska

Derivational Morphology • Adjectives can also be derived from nouns and verbs Veton Këpuska

Complexity of Derivation in English Language • There a number of reasons for complexity in Derivation in English: • Generally less productive: • Nominalizing suffix like –ation, which can be added to almost any verb ending in –ize, cannot be added to absolutely every verb. • Example: we can’t say *eatation or *spellation (* marks stem of words that do not have the named suffix in English) • There are subtle and complex meaning differences among nominalizing suffixes • Example: sincerity vs sincereness Veton Këpuska

Cliticization • Clitic is a unit whose status lies in between that of an affix and a word. • Phonological behavior: • Short • Unaccented • Syntactic behaviour: • Words, acting as: • Pronouns, • Articles, • Conjunctions • Verbs Veton Këpuska

Cliticization • Ambiguity • She’s→ she is or she has Veton Këpuska

Agreement • In English language plural is marked on both nouns and verbs. • Consequently the subject noun and the main verb have to agree in number: • Both must be either singular or plural. Veton Këpuska

Finite-State Morphological Parsing

Finite-State Morphological Parsing • Goal of Morphological Parsing is to take the column 1 following table and produce output forms like those in the 2 column Veton Këpuska

Finite-State Morphological • The second/fourth column of the previous slide table contains stem of each word as well as assorted morphological features. These features provide additional information about the stem. • Example: • +N – word is a noun • +SG – word is singular • Some of the input forms may be ambiguous: • Caught • Goose • For now we will consider the goal of morphological parsing as merely listing of all possible parses. Task of disambiguation among morphological parses will be discussed in Chapter 5. Veton Këpuska

Requirements of Morphological Parser • Lexicon: the list of stems and affixes, together with basic information about them. • Example: whether a stem is a Noun or a Verb, etc. • Morphotactics: the model of morpheme ordering that explains which classes of morphemes can follow other classes of morphemes inside a word. • Example: English plural morpheme follows the noun and does not precede it. • Orthographic rules: these are spelling rules that are used to model the changes that occur in a word, usually when two morphemes combine. • Example: They *y → *ie spelling rule as in city + -s → cities. Veton Këpuska

Requirements of Morphological Parser • In the next section we will present: • Representation of a simple lexicon for the sub-problem of morphological recognition • Build FSAs to model morphotactic knowledge • Finite-State Transducer (FST) is introduced as a way of modeling morphological features in the lexicon Veton Këpuska

Building a Finite-State Lexicon

Lexicon • A lexicon is a repository for words. • The simplest possible lexicon would consist of an explicit list of every word of the language • Every word means including • Abbreviations: AAA • Proper Names: Jane or Beijing, etc. • Example: • a, AAA, AA, Aachen, aardvark, aardwolf, aba, abaca, aback, … • For the various reasons discussed above in general it will be inconvenient or even impossible to list every word in a language. • Computational lexicons are usually structured with • a list of each of the stems and affixes of the language. • Representation of the morphotactics • One of the most common way to model morphotactics is with the finite-state-automaton. Veton Këpuska

Example of FSA for English nominal inflection • This FSA assumes that the lexicon includes • regular nouns (reg-nouns), that take • Regular –s plural: cat, dog, aardvark • Ignoring for now that the plural of words like fox have inserted e: foxes. • irregular noun forms that don’t take –s; both: • singular (irreg-sg-noun): goose, mouse, sheep • plural (irreg-pl-noun): geese, mice, sheep Veton Këpuska

Example for English verbal inflection • This lexicon has three stem classes: • reg-verb-stem • irreg-verb-stem • irreg-past-verb-form • Four affix classes: • -ed: past • -ed: participle • -ing: participle • -s: third case singular Veton Këpuska

Examples of Verb Inflections Veton Këpuska

English Derivational Morphology • As it has been discussed earlier in this chapter, English derivational morphology is significantly more complex than English inflectional morphology. • FSA for that model derivational morphology are thus tend to be quite complex. • Some models of English derivation are based on the more complex context-free grammars. Veton Këpuska

Morphotactics of English Adjectives • Example of Simple Case of Derivation from Antworth (1990): • big, bigger, biggest • happy, happier, happiest, happily • unhappy unhappier, unhappiest, unhappily • clear, clearer, clearest, clearly, unclear, unclearly • cool, cooler, coolest, coolly • red, redder, reddest • real, unreal, really Veton Këpuska

Problem Issues • While previous slide FSA will • recognize all the adjectives in the table presented earlier, • it will also recognize ungrammatical forms like: • unbig, unfast, oranger, or smally • adj-root would include adjectives that: • can occur with un- and –ly: clear, happy and real • can not occur: big, small, etc. • This simple example gives un idea of the complexity to be expected from English derivation. Veton Këpuska

Derivational Morphology Example 2 • FSA models a number of derivational facts: • generalization that any verb ending in –ize can be followed by the nominalizing suffix –ation • -al or –able → -ity or -ness • Exercise 3.1 to discover some of the individual exceptions to many of these constructs. • Example: • fossil → fossilize → fossilization • equal→ equalize→ equalization • formal→ formalize→ formalization • realize→ realizable→ realization • natural→ naturalness • casual → casualness • FSA models for another fragment of English derivational morphology Veton Këpuska

Solving the Problem of Morphological Recognition • Using FSAs above one could solve the problem of Morphological Recognition; • Given an input string of letters does it constitute legitimate English word or not. • Taking morphotactic FSAs and plugging in each “sub-lexicon” into the FSA. • Expanding each arc (e.g., reg-noun-stem arc) with all morphemes that make up the stem of reg-noun-stem. • The resulting FSA can then be defined at the level of the individual letter. Veton Këpuska

Solving the Problem of Morphological Recognition • Noun-recognition FSA produced by expanding the Nominal Inflection FSA of in previous slide with sample regular and irregular nouns for each class. • We can use below figure to recognize strings like aardvarks by simply starting at the initial state, and comparing the input letter by letter with each word on each outgoing arc, and so on, just as we saw in Ch. 2. Veton Këpuska

Finite-State Transducers

Finite-State Transducers (FST) • FSA can represent morphotactic structure of a lexicon and thus can be used for word recogntion. • In this section we will introduce the finite-state transducers and we will show how they can be applied to morphological parsing. • A transducers maps between one representation and another; • A finite-state transducer or FST is a type of finite automaton which maps between two sets of symbols. • FST can be visualized as a two-tape automaton which recognizes or generates pairs of strings. • Intuitively, we can do this by labeling each arc in the the FSM (finite-state machine) with two symbol strings, one from each tape. • In the figure in the next slide an FST is depicted where each arc is labeled by an input and output string, separated by a colon. Veton Këpuska

A Finite-State Transducer (FST) • FST has a more general function than an FSA; • Where an FSA defines a formal language by defining a set of strings, an FST defines a relation between sets of strings. • FST can be thought of as a machine that reads one string and generates another. Veton Këpuska

A Finite-State Transducer (FST) • FST as recognizer: • A transducer that takes a pair of strings as input and outputs: • accept if the string-pair is in the string-pair language, and • reject if it is not. • FST as a generator: • A machine that outputs pairs of strings of the language and outputs: • yes or no, and • a pair of output string. • FST as translator: • A machine that reads a string and outputs another string. • FST as a set relater: • A machine that computes relations between sets. Veton Këpuska

A Finite-State Transducer (FST) • All four categories of FST in previous slide have applications in speech and language processing. • Morphological parsing (and for many other NLP applications): • Apply FST translator metaphor: • Input: a string of letters • Output: a string of morphemes. Veton Këpuska

Formal Definition of FST Veton Këpuska

Properties of FST • FSAs are isomorphic to regular languages ⇔ FSTs are isomorphic to regular relations. • FSTs and regular relations are closed under union • Generally FSTs are not closed under difference, complementation and intersection. • In addition to union, FSTs have two closure properties that run out to be extremely useful: • Inversion: The inversion of a trasducer T(T-1) simply switches the input and output labels. Thus if T maps from the input alphabet I to the output alphabet O, T-1 maps from O to I. • Composition: If T1 is a transducer from I1 to O1 and T2 a transducer from O1 to O2, then T1∘T2 maps from I1 to O2. Veton Këpuska

Search and Decoding in Speech Recognition