1 / 18

Variant definitions of pointer length in MDL

Variant definitions of pointer length in MDL. Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago. Degrees of freedom in MDL modeling. MDL does not specify the form of the grammar being inferred. Carl de Marcken (1996)

arama
Download Presentation

Variant definitions of pointer length in MDL

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago

  2. Degrees of freedom in MDL modeling • MDL does not specify the form of the grammar being inferred. • Carl de Marcken (1996) • There are alternatives to pointers for representing connections. • Different representations may lead to different grammars.

  3. { }{ } walk jump ... ed ing ... A sample signature: Linguistica (Goldsmith 2001) • Website: linguistica.uchicago.edu • Data: corpus segmented into words • Model: • List of stems • List of suffixes • List of signatures

  4. Reminder: MDL analysis • Corpus C • 2 or more competing models describing C • Model M assigns a probability to C : pr(C | M) • Compressed length of C given M : L(C | M) = - log2pr(C | M) • Length of model M : L( M ) • Description length of C given M : DL(C | M) = L(C | M) + L( M )

  5. Learning process • Bootstrapping heuristic: word = stem + suffix • Successive heuristics propose modifications. • MDL sanctions modifications. • Compute L( corpus | model ) + L( model ) before and after modification. • If it results in a decrease in DL, retain modification, otherwise discard it.

  6. Length of the morphology • L( morphology ) = sum of the lengths of lists (stems, suffixes, signatures) • Length of a list = sum of the lengths of elements in it + small cost for list structure • Length of a stem / suffix is proportional to the number of symbols in it.

  7. { }{ } { }{ } walk jump ... ed ing ... { } { } walk jump great ... ed ing est ... List of stems List of suffixes Length of the morphology (2) • A signature specifies that a set of stems associate with a set of suffixes:

  8. Length of the morphology (3) • A pointer is a symbol that stands for a given morpheme. • The information content of a pointer to a morpheme m is - log2pr( m ) • The more probable the morpheme, the smaller the cost of a pointer to it:

  9. Length of the morphology (4) • Length of signature = sum of lengths of 2 lists of pointers (to stems and to suffixes) • Length of each list = sum of information cost of pointers in it + small cost for list structure

  10. Corpus: walking in the... { }{ } { }{ } Morphology: Morphology: { } { } { } { } walk jump great ... walk jump great ... ed ing est ... ed ing est ... Compressed length of the corpus

  11. Compressed length of the corpus (2) • Compressed length of a word w = information content of pointer to signature σ + information content of pointer to stem t given σ + information content of pointer to suffix f given σ = - log2pr (σ) - log2pr (t|σ) - log2pr (f|σ) • L( corpus | morphology ) = sum of lengths of each individual word

  12. 1 1 0 … 0 signature σ List of (all) stems { } { } { } walk jump great ... chin binary string Alternatives to pointers • There are alternatives to pointers for representing connections in the morphology.

  13. List of pointers vs. binary strings • The number of symbols in a binary string is constant and equal to thetotal number of stems. • The information content of the string depends on the distribution of 0's and 1's in it: total number of stems times entropy of string

  14. Expected difference in DL • Theoretical inference (see details in paper): • Binary strings are shorter when: • the distribution of stems tends to be uniform • the distribution of the number of stems being pointed to tends to be uniform • Lists of pointers are shorter when: • the distribution of stems departs from uniformity • the average number of stems being pointed to is small

  15. { }{ } walk jump ... ed ing { }{ } walks broke ...  A specific example • Current state of the morphology: • Proposed modification: walks = walk + s

  16. { }{ } { }{ } ... jump ... walk ed ing s ed ing { }{ } walks broke ...  A specific example (2) • State of the morphology after modification: • Cost: pointers to ed, ing and s • Savings: the string walks, a pointer to it

  17. Crucial difference • The compressed length of binary strings is independent of the frequency of the items being pointed to. • This encoding does not favor the creation of pointers to frequent items (or the deletion of pointers to rare items).

  18. Conclusion • There is more than one way of representing the connections between items in a grammar. • The choice of a representation can have important consequences on the grammar being induced. • Mathematical details can be found in the paper.

More Related