Analogy in morphology: Only a beginning

Analogy in morphology:Only a beginning John Goldsmith The University of Chicago CNRS/MoDyCo Analogy in grammar: Form and acquisition Max Planck Institute for Evolutionary Biology Leipzig September 2006

Outline of talk • Word segmentation problem • Minimum Description Length (MDL) framework • Learning morphological structure: analogy takes us only so far

signature Finite State Automaton

Part 1:The word segmentation problem

Input: inprincipioerailverbo Language-independent device Output: in principio era il verbo

Word segmentation Work by Michael Brent and by Carl de Marcken in the mid-1990s at MIT. A lexicon Lis a pair of objects (L, pL ): a set L A*, and a probability distribution pL that is defined on A* for which L is the support of pL. We call L the words. • We insist that A L: all individual letters are words. • We define a language as a subset of L*; its members are sentences. • Each sentence can be uniquely associated with an utterance (an element in A *) by a mapping F:

F:L*A*

F:L*A* S If F(S) = U then we say that S is a parse of U. U

F:L*A* S U We pull back the measure from the space of letters to thespace of words.

Different lexicons lead to different probabilities of the data Given an utterance U The probability of a string of letters is the probabilityassigned to its best parse.

Class of models originally studied in the word segmentation problem Our data is a finite string (“corpus”), generated by a finite alphabet; We find the best parse for the string; The probability of the parse is the product of the probability of its words; The words are assigned a maximum likelihood probability of the simplest sort.

Results • The Fulton County Grand Ju ry said Friday an investi gation of At l anta 's recent prim arye lectionproduc ed no e videnc e that any ir regul ar it i e s took place . • Thejury further s aid in term - end present ment sthatthe City Ex ecutiveCommit t e e ,which had over - all charg eofthee lection , d e serv e s the pra is e and than ksofthe City of At l antafortheman ner in whichthee lection was conduc ted. Chunks are too big Chunks are too small

From Encarta: trained on the first 150 articles La funzione originari a dell'abbigliament o fu for s e quell a di pro t egger e il corpo dalle av vers i tà del c li ma . Ne i paesi cal di uomini e donn e indoss ano gonn ell in i mor bi di e drappeggi ati e per i z om i . In generale gli abit ant i delle zon ecal d e non port ano ma i più di due stra t i di vestit i. Al contr ari o, nei luog h i dove il c li ma è più rigid o sono diffus i abiti ader enti e a più stra ti . C omun e alle due tradizion i è tuttavi a l' abitudin e di ricor re re a mantell i per ri par arsi dagli e le ment i.

3 major categories of failures of MDL word-discovery • Many failures of word-discovery are correct discovery of morphemes (word-pieces) investi-gation, pro-t-egger-e. • Many (thought fewer) failures of word-discovery are discovery of pairs of words that frequently appear together (for example, ofthe). • Many failures are too short to be likely words.

As we add more linguistic sophistication to the class of models considered, MDL makes increasingly better predictions.

Part 2: Minimum Description Length (MDL) Analysis Jorma Rissanen (1989) Stochastic complexity in statistical enquiry.

Synthetic apriori • The mind’s construction of the world is its best understanding of what the senses provides it with. • The real world is the one which is most probable, given our observations. Bayesian, maximum a posteriori reasoning

Bayes’ Rule D = Data H = Hypothesis

Bayes’ Rule D = Data H = Hypothesis Definition Define pr(A|B) = pr(A&B)/pr(B)

Bayes’ Rule D = Data H = Hypothesis Definition Definition

Bayes’ Rule D = Data H = Hypothesis 

If reality is the most probable hypothesis, given the evidence... • we must find the hypothesis for which the following is a maximum: D = Data H = Hypothesis How do we calculate the probability of our hypothesis about what reality is? How do we calculate the probability of our observations, given our understanding of reality? rationalism empiricism

How do we calculate the probability of our hypothesis about what reality is? How do we calculate the probability of our observations, given our understanding of reality? Assign a (“prior”) probability to all hypotheses, based on their coherence. Measure the coherence. Call it an evaluation metric. Insist that your grammars be probabilistic: they assign a probability to their generated output. Kraft’s inequality: If grammars have the “prefix property” (guaranteed local punctuation), then we can assign pr(G) = 2-length(G)

Usage of MDL If description length of data D, given model M, is equal to the inverse log probability assigned to D by M + compressed length of M, then The process of word-learning is unambiguously one of increasing the probability of the data, and using the length of M as a stopping criterion.

Essence of MDL 2. MDL

Part 3Learning morphology

Corpus: jump, jumps, jumping laugh, laughed, laughing sing, sang, singing the, dog, dogs total: 62 letters Analysis: Stems: jump laugh sing sang dog (20 letters) Suffixes: s ing ed (6 letters) Unanalyzed: the (3 letters) total: 29 letters. Naïve MDL 3. Morphology

1st approximation: a morphology is: a list of stems, a list of affixes (prefixes, suffixes), and a list of pointers indicating which combinations are permissible. Unlike the word segmentation problem, now we have no obvious search heuristics. These are very important (for that reason)—and I will not talk about them. Model/heuristic 3. Morphology

Size of model 3. Morphology M[orphology] = { Stems T, Affixes F, Signatures S } stems affixes What is a signature, and what is its length? sig’s extensivity

What is a signature? 3. Morphology

What is the length (=information content) of a signature? A signature is an ordered pair of two sets of pointers: (i) a set of pointers to stems; and (ii) a set of pointers to affixes. The length of a pointer p is –log freq (p): So the total length of the signatures is: Sum over signatures Sum over stem ptrs

Generation 1 Linguistica http://linguistica.uchicago.edu Initial pass: assumes that words are composed of 1 or 2 morphemes; finds all cases where signatures exist with at least 2 stems and 2 affixes: 3. Morphology

Generation 1 3. Morphology Then it refines this initial approximation in a large number of ways, always trying to decrease the description length of the initial corpus.

French roots 3. Morphology

3. Morphology 4. Detect allomorphy Signature: <e>ion . NULL composite concentrate corporate détente discriminate evacuate inflate opposite participate probate prosecute tense What is this? composite and composition composite composit  composit + ion It infers that iondeletes a stem-final ‘e’ before attaching.

4. Morphology Swahili verb

4. Morphology Swahili verb Subject marker

4. Morphology Swahili verb Subject marker Tense marker

4. Morphology Swahili verb Subject marker Tense marker Object marker

4. Morphology Swahili verb Subject marker Object marker Tense marker Root

4. Morphology Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive)

4. Morphology Swahili verb Subject marker Object marker Tense marker Root Voice (active/passive) Finalvowel

Signature: reduces false positives 4. Morphology

Generalize the signature… 4. Morphology Sequential FSA: each state has a unique successor.

Alignments 4. Morphology

Analogy in morphology: Only a beginning