1 / 30

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior. Mikhail Belkin and John Goldsmith The University of Chicago July 2002. Dual motivation. Unsupervised learning of syntactic behavior of words

harva
Download Presentation

Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using eigenvectors of a bigram-induced matrix to represent and infer syntactic behavior Mikhail Belkin and John Goldsmith The University of Chicago July 2002

  2. Dual motivation • Unsupervised learning of syntactic behavior of words • Solving a problem in the unsupervised learning of morphology: disambiguating morphs

  3. Disambiguating morphs? • Automatic learning of morphology can provide us with a signature associated with a given stem: • Signature = alphabetized list of affixes associated with a given stem in a corpus.

  4. For example: Signature NULL.ed.ing.s: • aid, ask, call, claim, help,kick Signature NULL.ed.ing: • add, assist, attend, consider Signature NULL.s • achievement, acre, action, administrator, affair

  5. The signature NULL.ed.ing is much more a subsignature of NULL.ed.ing.s than NULL.s is because of s’s ambiguity (noun, verb).

  6. How can we determine whether a given morph (“ed”, “s”) represents more than 1 morpheme? • I don’t think that we can do this on the basis of morphological information.

  7. Goal: find a way of describing syntactic behavior in a way that is dependent only on a corpus. • That is, in a fashion that is language-independent but corpus-dependent – though the global structure that is induced from 2 corpora from the same language will be very similar.

  8. Finite verbs French plural nouns Fem. sg. nouns

  9. With such a method… We can look at words formed with the “same” suffix, putting words into buckets based on the signature their stem is in: • Bucket 1 (NULL.ed.ing.s): aided, asked, called • Bucket 2 (NULL.ed.ing): added, assisted, attended. Q: do the average position of each bucket form a tight cluster?

  10. If the average locations of each bucket of –ed words form a tight cluster, then –ed is not ambiguous. If the average locations of each bucket (from distinct signatures) does not form a tight cluster, the morpheme is not the same across signatures.

  11. Method • Not a clustering method; neither top-down nor bottom-up. • Two step procedure: 1. Construct a nearest-neighbor graph. 2. Reduce the graph to 2-dimensions by means of eigenvector decomposition.

  12. Nearest neighbors Following a long list of researchers: • We begin by assuming that a word W’s distribution can be described by a vector L describing all of its left-hand neighbors and a vector R describing all of its right-hand neighbors.

  13. V = Size of corpus’ vocabulary V Lw,Rw are vectors that live in RV. If V is ordered alphabetically, then Lw = (4, 0, 0, 0, …) # of occurrences of “abandoned” before w # of occurrences of “a” before w # of occurrences of “abatuna” before w

  14. Similarity of syntactic behavior is modeled as closeness of L-vectors …where “closeness” of 2 vectors is modeled as the angle between them.

  15. Construct a (non-directed) graph: Its vertices are the words W in V. For each word W: • Pick the K most-similar words (K = 20, 50) (by angle of L-vector) • Add an edge to the graph connecting W to each of those words.

  16. Canonical matrix representation of a graph: M(i,j) = 1 iff there is an edge connecting wi and wj – that is, iff wi and wj are similar words as regards how they interact with the word immediately to the left.

  17. Where is this matrix M? • It’s a point in a space of size V(V-1)/2. Not very helpful, really. • How can we optimally reduce it to a space of small dimension? • Find the eigenvectors of the normalized laplacian of the graph. See Chung, Malik and Shi, Belkin and Niyogi (references in written version)

  18. A graph and its matrix M • The degree of a vertex (= word) is the number of edges adjacent (linked) to it. • Notice that this is not fixed across words. • The degree of vertex vi is the sum of the entries of the ith row.

  19. The laplacian of the graph Let D = VxV diagonal matrix s.t. diagonal entry M(i,i) = degree of vi D – M is the Laplacian of the graph. Its rows sum to 0.

  20. Normalized laplacian: • For each i, divide all entries in the ith row by √d(i). • For each i, divide all entries in the ith column by √d(i). • Result: Diagonal elements are all 1. • Generally:

  21. Eigenvector decomposition • The eigenvectors form a spectrum, ranked by the value of their eigenvalues. • Eigenvalues run from 0 to 2 (L is positive semi-definite). • The eigenvector with 0 eigenvalue reflects word’s frequency (“zeroth”). • But the next smallest (the “first”) gives us a good represenation of the words…

  22. …in the sense that the values associated with each word show how close the words are in the original graph. We can graph the first two eigenvectors of the Left (or Right) graph: each word is located at the coordinates corresponding to it in the eigenvector(s):

  23. masculine plurals Spanish (left) fem. plurals feminine sg nouns masc. sg. nouns past participles finite verbs

  24. German (left) Neuter sg nouns numbers, centuries Fem. sg. nouns Names of places

  25. English (right) nouns modals prepositions + of + “to”

  26. English (left) infinitives past verbs modals the +

  27. Results of experiment • If we define the size of the minimal box that includes all of the vocabulary as being 1 by 1, then we find a small ( < 0.10 ) average distance to mean for unambiguous suffixes (e.g., -ed (English), -ait (French) ) – only for them.

  28. Measure • To repeat: we find the “virtual” location of the conflation of all of the stems of a given signature, plus the suffix in questione.g., NULL.ed.ing_ed • We do this for all signatures containing “ed” • We compute average distance to the mean.

  29. Average <= 0.10 Average > 0.10

  30. Conclusion • The technique appears to work appropriately for the task. • But we suspect that the actual use of the technique is much more interesting and open-ended than this (simple) application suggests.

More Related