120 likes | 255 Views
Linguistica. INPUT: text file as input, typically 5,000 to 1,000,000 words OUTPUT: partial morphological analysis of most of the words in the corpus Unsupervised No dictionary No morphological rules MDL Framework (Rissanen 1989). The Problem.
E N D
Linguistica • INPUT: text file as input, typically 5,000 to 1,000,000 words • OUTPUT: partial morphological analysis of most of the words in the corpus • Unsupervised • No dictionary • No morphological rules • MDL Framework (Rissanen 1989) Learning Morphology
The Problem • Determination of the correct mophological split for individual words into stem and suffixes. • Establishment of accurate categories of stems based on the range of suffixes they accept. Learning Morphology
Four Approaches • Identify morpheme boundaries (and hence morphemes) on the basis of degree of predictibility of n+1st letter given the first n letters. (Z.Harris, 1955, 1967) • Identify bigrams and trigrams that have a high probability of being morpheme-internal • Discovery of patterns of phonological relationships between pairs of related words • Seek analysis that is globally most concise (Goldsmith 2001) Learning Morphology
Minimum Description Length Model: 4 Components • A model of a set of data that assigns a probability distribution to the sample space fron which the data is drawn. • The model can be used to assign a compressed length to the data using information-theoretic notions. • The model can itself be assigned a length. • The optimal analysis of the data is the one for which the sum of the length of the compressed data and the length of the model is the smallest. • In other words, we seek a minimally compact representation of both the model and the data simultaneously. Learning Morphology
An Example Model • List of stems • The set of unanalysed words plus the material that precedes the final suffix of any unanalysed word • List of suffixes that occur with at least one stem • List of signatures • Each stem is associated with a list of observed suffixes. This is the stem’s signature. This list is created using pointers Learning Morphology
STEMS:9 cat dog hat John jump laugh sav the walk AFFIXES:6 NULL ed ing s e es MDL Example Learning Morphology
MDL Example: Signatures S1: ptr(cat) ptr(NULL) ptr(dog) ptr(s) ptr(hat) S2: ptr(sav) ptr(e) ptr(es) ptr(ing) S3: ptr(jump) ptr(NULL) ptr(laugh) ptr(ed) ptr(walk) ptr(ing) ptr(s) S4: ptr(John) ptr(the) Learning Morphology
Notation t a stem f a suffix s signature T set of stems in corpus F set of suffixes in corpus S set of signatures in corpus <T>, <F>, <S> cardinalities of T,F,S [t],[f] frequency of t, f in corpus W set of words in the corpus [W] length of the corpus <W> vocabulary size Learning Morphology
A signature comprises two lists: • List of pointers to stems • List of pointers to suffixes To specify a list of length N need L(N) bits where L(N) ~= log2(N) A pointer to a stem t is of length –log(P(t)) where P(t) = [t]/[W] Learning Morphology