1.86k likes | 2.25k Views
Search and Decoding in Speech Recognition. Words and Transducers. Outline. Outline. Outline. Introduction. Introduction. From Ch 1. – regular expressions; we saw how easy it is to search for a plural of the woodchuck ( woodchucks ) .
E N D
Search and Decoding in Speech Recognition Words and Transducers
Outline Veton Këpuska
Outline Veton Këpuska
Outline Veton Këpuska
Introduction Veton Këpuska
Introduction • From Ch 1. – regular expressions; we saw how easy it is to search for a plural of the woodchuck (woodchucks). • However searching for plural of fox, fish, peccary or wild goose, etc. is not as trivial as just tacking on an s. • Main Entry: foxPronunciation: 'fäksFunction: nounInflected Form(s): pluralfox·esalsofoxUsage: often attributiveEtymology: Middle English, from Old English; akin to Old High German fuhs fox and perhaps to Sanskrit puccha tail • Main Entry: fishPronunciation: 'fishFunction: nounInflected Form(s): pluralfishorfish·esUsage: often attributiveEtymology: Middle English, from Old English fisc; akin to Old High German fisc fish, Latin piscis • Main Entry: pec·ca·ryPronunciation: 'pe-k&-rEFunction: nounInflected Form(s): plural-riesEtymology: of Cariban origin; akin to Suriname Carib paki:ra peccary: any of several largely nocturnal gregarious American mammals resembling the related pigs: as a: a grizzled animal (Tayassutajacu) with an indistinct white collar b: a blackish animal (Tayassupecari) with a whitish mouth region • Main Entry: goosePronunciation: 'güsFunction: nounInflected Form(s): pluralgeese /'gEs/Etymology: Middle English gos, from Old English gOs; akin to Old High German gans goose, Latin anser, Greek chEn Veton Këpuska
Introduction • Required knowledge to correctly search for singulars and plurals in English language: • Orthographic rules: Words ending in –y are pluralized by changing the –y to –i and adding an –es. • Morphological rules: tell us that fish has null plural and that the plural of goose is formed by changing the vowel. • Morphological parsing: recognizing that a word (like foxes) break down into component morphemes (fox and -es) and building a structured representation of it. • Parsingmeans taking an input and producing some sort of linguistic structure for it. • Parsing can be thought in broad terms producing structures based on: Producing Veton Këpuska
Introduction • Morphological parsing (or stemming) applies to many affixes other than plurals; • Example: Parsing any English verbs ending in –ing (e.g., going, talking, congratulating) into its verbal stem plus the –ing morpheme. • going ⇨ VERB-go + GERUND-ing • Morphological parsing is important for speech and language processing: • Part-of-speech tagging • Dictionaries (spell-checking) • Machine translation Veton Këpuska
Introduction • To solve morphological parsing problem one could just store all the plural forms of English nouns and –ing forms of English verbs in dictionary as, for example, in English Speech Recognition tasks. • For many Natural Language Processing applications this is not possible because –ing is a productive suffix: that is, it applies to every verb and it requires knowing the rules to adding this suffix. • Similarly –s applies to almost every noun. • Productive suffixes apply to new words: • Example: fax and faxing • New words (e.g., acronyms and proper nouns) are created constantly – need to add the plural morpheme –s to each. • Plural form of new nouns depends on the spelling/pronunciation of the singular form (eg. The nouns ending in –z the plural is formed by replacing it with –es). • In other languages (e.g., Turkish) one cannot list all the morphological variants of every word: • Turkish verbs have 40,000 possible forms not counting derivational suffixes. Veton Këpuska
Noun • Most of us learned the classic definition of noun back in elementary school, where we were told simply that - “a noun is the name of a person, place, or thing.” • That's not a bad beginning; it even clues us in to the origin of the word, since noun is derived ultimately from the Latin word nōmen, which means ‘name’. Veton Këpuska
noun • any member of a class of words that can function as the main or only elements of subjects of verbs (A dog just barked), or of objects of verbs or prepositions (to send money from home), and that in English can take plural forms and possessive endings (Three of his buddies want to borrow John's laptop). Nouns are often described as referring to persons, places, things, states, or qualities, and the word noun is itself often used as an attributive modifier, as in noun compound; noun group. Veton Këpuska
Verb • The key word in most sentences, the word that reveals what is happening, is the verb. It can declare something: • You ran, • ask a question • Did you run?, • convey a command • Run faster!, or • express a wish • May this good weather last!, or • a possibility • If you had run well, you might have won; • if you run better tomorrow, you may win. Veton Këpuska
Verb • You cannot have a complete English sentence without at least one verb. Verb • any member of a class of words that function as the main elements of predicates, that typically express action, state, or a relation between two things, and that may be inflected for tense, aspect, voice, mood, and to show agreement with their subject or object. Veton Këpuska
The definitions of noun and verb were taken from dictionary.com Veton Këpuska
Outline Veton Këpuska
Outline • Survey of morphological knowledge for English • Introduction of finite-statetransduceras the key algorithm for morphological parsing. • Finite-state transducers are key algorithms for speech and language processing. • Related algorithms: • Stemming: mapping from the word to its root or stem. Important to Information Retrieval tasks. • Need to know if two words have a similar root despite their surface differences • Example: sang and sung. The word sing is called the common lemma of these words, and mapping form all these to sing is called lemmatization. Veton Këpuska
Outline • Tokenization or Word Segmentation – a related algorithms to morphological parsing that is defined as a task of separating out (tokenizing) words from running text. • English language text separates words by white space but: • “New York”, “rock ‘n’ roll” – are considered single words • I’m – is considered two words “I” and “am” • … etc. • For many applications we need to know how similar two words are orthographically. • Morphological parsing is one method for computing similarity, • Comparison of strings of letters via minimum edit distance algorithm. Veton Këpuska
Morphological Parsing • Morphological parsing, in natural language processing, is the process of determining the morphemes from which a given word is constructed. It must be able to distinguish between orthographic rules and morphological rules. For example, the word 'foxes' can be decomposed into 'fox' (the stem), and 'es' (a suffix indicating plurality). • The generally accepted approach to morphological parsing is through the use of a finite state transducer (FST), which inputs words and outputs their stem and modifiers. The FST is initially created through algorithmic parsing of some word source, such as a dictionary, complete with modifier markups. Veton Këpuska
Outline Veton Këpuska
Survey of English Morphology Veton Këpuska
Survey of English Morphology • Morphology is the study of the way words are built up from smaller meaning-bearing units - morphemes. • Morpheme is often defined as the minimal meaning-bearing unit in a language. • Main Entry: mor·phemePronunciation: 'mor-"fEmFunction: nounEtymology: French morphème, from Greek morphE form: a distinctive collocation of phonemes (as the free form pin or the bound form -s of pins) having no smaller meaningful parts Veton Këpuska
Survey of English Morphology • Example: • fox consists of a single morpheme: fox. • cats consists of two morphemes: cat and –s. • Two broad classes of morphemes: • Stems - main morpheme of a word, and • Affixes – add additional meaning to the word. • Prefixes – preceding the stem: unbuckle • Suffixes – following the stem: eats • Infixes – inserted in the stem: humingi(Philippine language Tagalog – in English “more or less”) • Circumfixes – precede and follow the stem. gesagt (German past participle of sagen) Veton Këpuska
Survey of English Morphology • A word can have more than one affix: • rewrites: • Prefix - re • Stem - write • Suffix - s • unbelievably: • Prefix - un • Stem - believe • Suffix - able, ly • English language does not tend to stack more than four or five affixes • Turkish can have words with nine or ten affixes – languages like Turkish are called agglutinative languages. Veton Këpuska
ag·glu·ti·na·tive Pronunciation: \ə-ˈglü-tən-ˌā-tiv, -ə-tiv\ Function: adjective Date: 1634 1:adhesive2: characterized by linguistic agglutination Veton Këpuska
Survey of English Morphology • There are many ways to combine morphemes to create a word. Four methods are common and play important role in speech and language processing: • Inflection • Combination of a word stem with a grammatical morpheme, usually resulting in a word of the same class as the original stem, and usually filling some syntactic function like agreement. • Example: • -s: plural of nouns • -ed: past tense of verbs. Veton Këpuska
Survey of English Morphology • Derivation • Combination of word stem with a grammatical morpheme, usually resulting in a word of a different class, often with a meaning hard to predict. • Example: • Computerize – verb • Computerization – noun. • Compounding • Combination of multiple word stems together. • Example: • Doghouse: dog + house. • Cliticization • Combination of a word stem with a clitic. A clitic is a morpheme that acts syntactically like a word, but is reduced in form and attached (phonologically and sometimes orthographically) to another word. • Example: • I’ve = I + ‘ve = I + have Veton Këpuska
Outline Veton Këpuska
Inflectional Morphology Veton Këpuska
Inflectional Morphology • English language has a relatively simple inflectional system; Only • Nouns • Verbs • Adjectives (sometimes) • Number of possible inflectional affixes is quite small. Veton Këpuska
Inflectional Morphology: Nouns • Nouns (English): • Plural • Possessive • Many (but not all) nouns can either appear in • bare stem or singular form, or • Take a plural suffix Veton Këpuska
Inflectional Morphology: Nouns • Regular plural spelled: • -s • -es after words ending in • –s (ibis/ibises) • -z (waltz/waltzes) • -sh (thrush/thrushes) • -ch (finch/finches) • -x (box/boxes); sometimes • Nouns ending in –y preceded by a consonant change the –y to –i (butterfly/butterflies). • The possessive suffix is realized by apostrophe + -s for • Regular singular nouns (llama’s), and • Plural nouns not ending in –s (children’s), and often • Lone apostrophe after • Regular plural nouns (llamas’), and some • Names ending in –s or –z (Euripides’ comedies’). Veton Këpuska
Inflectional Morphology: Verbs English language inflection of verbs is more complicated than nominal inflection, e.g. regular & irregular verbs English has three kinds of verbs • Main verbs (eat, sleep, impeach) • Modal verbs (can, will, should) • Primary verbs (be, have, do) • Concerned with main and primary verbs because these have inflectional endings. • Of these verbs a large class are regular (all verbs in this class have the same endings marking the same functions) Veton Këpuska
Inflectional Morphology Regular & Irregular Verbs Veton Këpuska
Regular Verbs • Regular Verbs have four morphological forms. • For regular verbs we know the other forms by adding one of three predictable endings and making (some) regular spelling changes. Veton Këpuska
Regular Verbs • Since regular verbs • Cover majority of the verbs and forms, and • Regular class is productive, they are significant in the morphology of English language. Productive class is one that automatically includes any new words that enter the language. Veton Këpuska
Irregular Verbs • Irregular Verbs are those that have some more or less idiosyncratic forms of inflection. • English irregular verbs • often have five different forms, but can have • as many as eight (e.g., the verb be), or • as few as three (e.g., cut or hit) • They constitute a smaller class of verbs estimated to be about 250 Veton Këpuska
Usage of Morphological Forms for Irregular Verbs • The –s form: • Used in “habitual present” form to distinguish the third-person singular ending: “She jogs every Tuesday” from the other choices of person and number “I/you/we/they jog every Tuesday”. • The stem form: • Used in in the infinitive form, and also after certain other verbs “I’d rather walk home, I want to walk home” • The –ing participle is used in the progressive construction to mark a present or ongoing activity “It is raining”, or when the verb is treated as a noun (this particular kind of nominal use of a verb is called gerund use: “Fishing is fine if you live near water”) • The –ed participle is used in the perfect construction “He’s eaten lunch already”, or passive construction “The verdict was overturned yesterday” Veton Këpuska
Spelling Changes • A number of regular spelling changes occur at morpheme boundaries. • Example: • A single consonant letter is doubled before adding the –ing and –ed suffixes: beg/begging/begged • If the final letter is “c”, the doubling is spelled “ck”: picnic/picnicking/picnicked • If the base ends in a silent –e, it is deleted before adding –ing and –ed: merge/merging/merged • Just as for nouns, the –s ending is spelled • –es after verb stems ending in –s (toss/tosses) • -z (waltz/waltzes) • -sh (wash/washes) • -ch (catch/catches) • -x (tax/taxes) sometimes. • Also like nouns, verbs ending in –y preceded by a consonant change the –y to –i (try/tries). Veton Këpuska
Outline Veton Këpuska
Derivational Morphology Veton Këpuska
Derivational Morphology • Derivation is combination of a word stem with a grammatical morpheme • Usually resulting in a word of a different class, • Often with a meaning hard to predict exactly • English inflection is relatively simple compared to other languages. • Derivation in English language is quite complex. Veton Këpuska
Derivational Morphology • A common kind of derivation in English is the formation of • new nouns, • From verbs, or • Adjectives, called nominalization. • Example: • Suffix –ation produces nouns from verbs ending often in the suffix –ize (computerize → computerization) Veton Këpuska
Derivational Morphology • Adjectives can also be derived from nouns and verbs Veton Këpuska
Complexity of Derivation in English Language • There a number of reasons for complexity in Derivation in English: • Generally it is less productive: • Nominalizing suffix like –ation, which can be added to almost any verb ending in –ize, cannot be added to absolutely every verb. • Example: we can’t say *eatation or *spellation (* marks stem of words that do not have the named suffix in English) • There are subtle and complex meaning differences among nominalizing suffixes • Example: sincerity vs sincereness Veton Këpuska
Outline Veton Këpuska
Cliticization Veton Këpuska
Cliticization • clitic noun (linguistics) a morpheme that functions like a word, but appears not as an independent word but rather is always attached to a following or preceding word. In English, the possessive ('s), -'s is an example. • cliticization noun process or instance of a word becoming a clitic Veton Këpuska
Cliticization • Clitic is a unit whose status lies in between that of an affix and a word. • Phonological behavior: • Short • Unaccented • Syntactic behaviour: • Words, acting as: • Pronouns, • Articles, • Conjunctions • Verbs Veton Këpuska
Cliticization • Proclitics – clitics proceeding a word • Enclitics – clitics following a word • Ambiguity • She’s→ she is or she has Veton Këpuska
Outline Veton Këpuska