Learning Random Walk Models for Inducing Word Dependency Distributions

Learning Random Walk Models for Inducing Word Dependency Distributions Kristina Toutanova Christopher D. Manning Andrew Y. Ng

Word-specific information is important for many NLP disambiguation tasks Language modeling for speech recognition I would like to make a collect…. ? Need a good estimate of P(coal|collect) P(cow|collect) P(call|collect) coal cow call

Sparseness: 1 million words is very little data • The number of parameters to be estimated is very large – for a vocabulary size of 30,000 words, we have 900 million pair-wise parameters • Pair-wise statistics involving two words are very sparse, even on topics central to the domain of the corpus. Examples from WSJ (1million words): • stocks plummeted 2 occurrences • stocks stabilized 1 occurrence • stocks rose 50 occurrences • stocks skyrocketed 0 occurrences • stocks laughed 0 occurrences

stabilize stabilized stabilizing morphology rise climb synonyms “is-a” relationships skyrocket rise It must be possible to use related words instead • The hypothesis is that similar words (according to various measures) have similar distributions. Many ways of defining related words. • Similarity with respect to some features or distributions extracted from a corpus (examples later) • Co-occurrence in a huge corpus (Web)

Using multiple similarity measures and chaining inferences stocks rose rise skyrocket skyrocketed

Random walks in the space of words • We propose to use a Markov Chain model in the state space of words, connected by different links according to their relatedness. • The target word dependency distribution is estimated as the stationary distribution of the Markov Chain. • This allows us to make inference based on different relatedness concepts and to chain inferences together. • We automatically learn weights for the different information sources from data.

Related work: PageRank • PageRank • A Markov Chain with transition probabilities given by the Web graph and a manually varied probability of reset to uniform • With probability 1- choose a link uniformly at random from the out-links of the current page • With probability pick a web page from the entire collection uniformly at random

This work • In our work, the Markov Chain transition matrix is defined using different “link types” . • The probability of following each link type is learned from data. • This allows for a richer class of models incorporating multiple knowledge sources.

Word dependency probabilities for structural ambiguities • Structural ambiguities: He broke [the window] [witha hammer] He broke [the window] [withthe white curtains] • Good probability estimates of P(hammer | broke, with) and P(curtains| window, with) will help with disambiguation

The problem : prepositional phrase attachment • Attachment of a phrase of a preposition, such as with,of, in, for, etc. • Binary decision – attach to the preceding noun or verb phrase • We represent examples by using only the head words (the main words of each phrase) She [broke] [the window] [with [a hammer]] V=broke, N1=window, P=with, N2=hammer • The class variable is attachment Att=va or Att=na • We learn a generative model P(V, N1,P, N2,Att)

Sparseness of distributions returning rates to normal (va) pushing rate of inflation (na) We will learn Markov Chains whose stationary distributions are good estimates of these distributions over words.

Basic form of learned Markov Chains • Example: v:hang n1:painting p:with n2:peg P(n2=peg|v=hang,p=with,va) Pwith,va(n2=peg|v=hang) • Colored states showing whether the word is a head or a dependent. e.g hang is a head and peg is a dependent in a with-type dependency • Links among states represent different kinds of relatedness hook The initial state distribution places probability 1 on hang hang fasten hooks

An example Markov ChainPwith(n2:peg|v:hang) pegs hang empirical distribution link peg fasten rivet hook We include the empirical P(head | dependent) and P(dependent | head) distributions as link types hooks

An example Markov ChainPwith(n2:peg|v:hang) pegs hang peg fasten rivet we have seen pegs as dependent of hang in the training corpus hook peg and pegs have the same root form hooks the probability of peg given hang will again be high

An example Markov ChainPwith(n2:peg|v:hang) pegs hang peg fasten rivet The empirical probability of rivet given hang is high hook rivet and peg are semantically related hooks The probability of peg given hang will again be high

An example Markov ChainPwith(n2:peg|v:hang) pegs hang peg fasten rivet hook hooks The probability of peg given hang will again be high

Form of learned transition distributions • A list of links is given. • Each link is a transition distribution from states to states and is given by a matrix Ti. • The transition probabilities of the learned MC are given by

Training the model • The model is trained by maximizing the conditional log-likelihood of a development set samples. • We fix the maximum number of steps in the Markov Chain to a small number d called the maximum degree of the walk, for computational reasons • The stopping probability is another trainable parameter of the MC.

Evaluation data • Four-tuples of words and correct attachment extracted from the WSJ part of the Penn Treebank • 20,801 samples training, 4,039 development, and 3,097 test set. • Average human accuracy of 88.2% given four words (Ratnaparkhi 94) • Many cases where the meaning is pretty much the same They built a house in London • Cases where it is difficult to decide from the head words only (93% human accuracy if the whole sentence is seen) • Accuracy 59% noun attachment 72.2% most common for each preposition

Baseline MC model • Only includes empirical distributions from the training corpus. Links from head to dependent states with the following transition probabilities Empirical distribution link back-off link back-off link back-off link back-off link A random walk with these links is the same as the familiar linear interpolation models

Accuracy of Baseline Collins & Brooks 95 (1 NN) 84.15% Abney et al. 99 (Boosting) 84.40% Vanschoenwinkel et al. 03 (SVM) 84.80% Baseline 85.85% Human 88.20%

hangs nails hang nail Adding morphology and WordNet synonyms • Morphology links for verbs and nouns – probability of transitioning to a word with the same root form is proportional to the smoothed empirical count of this word • Synonym links for verbs and nouns – uniform probability of transitioning to any synonym with respect to the top three senses of a word in WordNet fasten nail rivet rivet

Accuracy using morphology and synonyms maximum degree d=3 Error reduction from baseline wrt human is 28.6% but significant only at level 0.1

Using unannotated data • Unannotated data is potentially useful for extracting information about words. • We use the BLLIP corpus consisting of about 30 million words of automatically parsed text (Charniak 00). • We extract 567,582 noisy examples of the same form as training set (called BLLIP-PP). • We use this data in two ways: • Create links using the empirical distribution in BLLIP-PP • Create distributional similarity links

Distributional similarity links from BLLIP-PP • Distributional similarity of heads wrt dependents and vice versa (Dagan 99, Lee 99) • Based on Jensen-Shannon divergence of empirical distributions Let • Distributional similarity of heads wrt distribution over another variable, not involved in the dependency relation of interest

Accuracy using BLLIP-PP maximum degree d=3 Error reduction from baseline 12%, wrt human accuracy 72%

Summary • We built Markov Chain models defined using a base set of links. • This allowed inference based on different relatedness concepts and chaining inferences together. • We automatically learned weights for the different information sources. • Large improvements were possible in the application of Prepositional Phrase attachment. • Methods may be applicable to learning random walks on the web (cf. PageRank/HITS).

Learning Random Walk Models for Inducing Word Dependency Distributions