550 likes | 703 Views
Automatic Morphology and Minimum Description Length. John Goldsmith Department of Linguistics. Today’s plan. 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length. 3 Situate MDL within a linguistic context...
E N D
Automatic Morphology andMinimum Description Length John Goldsmith Department of Linguistics
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... Comparison with Early Generative Grammar 4 Situate MDL within a broader intellectual context 5 More substantive description of Automorphology’s design 6 The broader perspective
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... 4 Comparison with Early Generative Grammar 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology
WinAutomorphology 1 • A version available on the web at http://humanities.uchicago.edu/faculty/ goldsmith • A C++ Windows program that accepts data as input and provides a morphological analysis....
raw data Automorphology Analyzed data
The Big Questions: • What do you have to put into a program like that? How much do you have to put into a program like that? • That is, does it have to have a lot of innate knowledge? Does it help for it to have a lot of innate knowledge? • If you build such a program, how do you know if it does it the same way as a child?
What do we want? If you give the program a computer file containing Tom Sawyer, it should tell you that the language has a category of words that take the suffixes ing,s,ed, and NULL; another category that takes the suffixes 's, s, and NULL; If you give it Jules Verne, it tells you there's a category with suffixes: a aient ait ant (chanta, chantaient, chantait, chantant)
And it should tell you about irregular stem allomorphy if your language contains it.
That's what AutoMorphology does. How much data do you need? • You get reasonable results fast, with 5,000 words, but results are much better with 50,000, and much better with 500,000 words (length of corpus).
Unsupervised learning... • No prepared corpus; no tagging; just the facts. • The goal is to reconstruct the logic of linguistics in a quantitative fashion (to the extent that is necessary).
Unsupervised learning • A fully explicit linguistic hypothesis. • A device (an algorithm) with immediate practical uses. • Arguably the embodiment of linguistic theory: the explicit and quantifiable specification of the relationship between data and analysis (grammar).
Turning to the problem of learning morphology...
For the purposes of version 1 of AutoMorphology, I will restrict myself to Indo-European languages, and in general languages in which the average number of suffixes per word is not greater than 2. (We drop this requirement in AutoMorphology 2.)
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... Comparison with Early Generative Grammar 4 Situate MDL within a broader intellectual context 5 More substantive description of Automorphology’s design 6 The broader perspective
Minimum Description Length Jorma Rissanen (1989) Analysis Analyzer Data Select the analyzer and analysis such that the sum of their lengths is a minimum.
Analysis Analyzer Analysis Analyzer Analysis Analyzer Data Analysis Analyzer Etc... Analysis Analyzer
The challenge Is to find a means of quantifying • the length of an analyzer, and • the length of an analysis
“Compressed form of data?” Think of data as a dense, rich, detailed description (evidence), and Think of compressed form as • Description in high level language + • Description of the particulars of the event in question (a.k.a. boundary conditions, etc.)...
“Analyzer” Is the set of statements that allows translation between high-level and low-level descriptions.
Minimizing sum of length of Analyzer + Compressed form of data = Aim for conciseness in high-level description + Principles of analysis
Don’t overlook the fact... …that the goal of MDL analysis is nothing less than the solution of the problem of induction. How do we justify generalization, given evidence?
the problem of induction Speech child/linguistic theory grammar Data scientist theory Sense brain thought/percept Evidence mind belief
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... 4 Comparison with Early Generative Grammar 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology
Morphological analyser Data Morphological analysis of that corpus “signature”
Very simply put... • Just state “ed” “s” “ing” “heit” “ité” once in the grammar; • pay for its occurrence (how many bits does it take to pay for those few letters) just once; • then make repeated reference (use pointers) to those entries.
References, pointers... • Are not free. • Information theory tells us exactly what they cost. The fundamental measure is Shannon’s: a pointer to an item of reference frequency P out of a universe of N possibilties is of length: log (N/P)
Summing over all items, and weighting by count gives us the famous formula:
A probabilistic morphology: • Assigns a probability to all words that it can generate; and these probabilities must add up to 1.0. • A word is three choices: • choice of signature • choice of stem within signature • choice of suffix within signature
Each of those is assigned a probability, based on counts. • Probability of a signature
Similarly, the probability of a stem is the number of times of its occurrence divided by the number of occurrences of that signature in the corpus.
Likewise for the suffixes… If the analysis is wrong, the numbers will be much worse than if it’s right. “The numbers” a model of frequencies of words.
Maximum Likelihood • The best morphology is the one that assigns the highest probability to the observed data. • …known in the biz as Maximum Likelihood.
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... 4 Comparison with Early Generative Grammar 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology
Compare with Early Generative Grammar (EGG) Data Linguistic Theory Preference: A1/A2 Analysis 1 Analysis 2
Analysis Linguistic theory Data Data Yes/No Linguistic theory Analysis Data Linguistic theory 1 is better/ 2 is better Analysis 1 Analysis 2
Implicit in EGG was the notion... that the best Linguistic Theory could be selected by... Getting a set of n candidate LTs; submitting to each a set of corpora; search (using unknown heuristics) for best analyses of each corpus within each LT; The LT wins for whom the sum total of all of the analyses is the smallest.
No cost to UG • In EGG, there was no cost associated with the size of UG -- in effect, no plausibility measure.
In MDL, in contrast…. • we can argue for a grammar for a given corpus. • We can also argue at the Linguistic Theory level if we so choose...
Select n corpora, and select that LT on the basis of LT’s length plus the length of all of the grammars derived from it, plus the lengths of the compressed corpora derived from those grammars. • Pick the LT with the shorted some total length.
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... 4 Comparison with Early Generative Grammar 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology
Today’s plan 1 A computer program -- what it looks like, what it does. 2 The framework -- Minimum Description Length 3 Situate MDL within a linguistic context... 4 Comparison with Early Generative Grammar 5 Situate MDL within a broader intellectual context 6 More substantive description of Automorphology’s design 7 Consequences for anthropology
Distinction between heuristicsand “theory” • In the context of MDL, the heuristics are extratheoretical, but from the point of view of the (psycho-)linguist, they are very important. • The heuristics propose; the theory disposes.
Stems with their signatures abrupt NULL ly ness. abs ence ent. absent -minded NULL ia ly. absent-minded NULL ly absentee NULL ism absolu NULL e ment. absorb ait ant e er é ée abus ait er abîm e es ée.
Now build up signature collection... Top 10, 100K words 1 .NULL.ed.ing. 65 1214 2 .NULL.ed.ing.s. 27 1464 3 .NULL.s. 290 8184 4 .'s.NULL.s. 27 2645 5 .NULL.ed.s. 26 541 6 .NULL.ly. 128 2124 7 .NULL.ed. 87 767 8 .'s.NULL. 75 3655 9 .NULL.d.s. 14 510 10 .NULL.ing. 62 983
Verbose signature... .NULL.ed.ing. 58 heap check revolt plunder look obtain escort proclaim arrest gain destroy stay suspect kill consent knock track succeed answer frighten glitter....
Stem allomorphy In a corpus of French, we find pairs of stems: ç:c/_# 10 commenç\commenc menaç\menac renonç\renonc avanç\avanc annonç\annonc s'effaç\s'effac enfonç\enfonc recommenç\recommenc perç\perc forç\forc lanç\lanc
Heuristics Find more than one stem that commutes with more than one suffix
Negotiate for where the stem/suffix break should be: mea all take the suffixes n/ns. christia roma reig rui saxo tow