Learning Language from Distributional Evidence

Learning Language from Distributional Evidence Christopher Manning Depts of CS and Linguistics Stanford University Workshop on Where Does Syntax Come From?, MIT, Oct 2007

There’s a lot to agree on![C. D. Yang. 2004. UG, statistics or both? Trends in CogSci 8(10)] • Both endowment (priors/biases) and learning from experience contribute to language acquisition • “To be effective, a learning algorithm … must have an appropriate representation of the relevant … data.” • Languages, and hence models of language, have intricate structure, which must be modeled.

More points of agreement • Learning language structure requires priors or biases, to work at all, but especially at speed • Yang is “in favor of probabilistic learning mechanisms that may well be domain-general” • I am too. • Probabilistic methods (and especially Bayesian prior + likelihood methods) are perfect for this! • Probabilistic models can achieve and explain: • gradual learning and robustness in acquisition • non-homogeneous grammars of individuals • gradual language change over time • [and also other stuff, like online processing]

The disagreements are important, but two levels down • In his discussion of Saffran, Aslin, and Newport (1998), Yang contrasts “statistical learning” (of syllable bigram transition probabilities) from use of UG principles such as one primary stress per word. • I would place both parts in the same probabilistic model • Stress is a cue for word learning, just as syllable transition probabilities are a cue (and it’s very subtle in speech!!!) • Probability theory is an effective means of combining multiple, often noisy, sources of information with prior beliefs • Yang keeps probabilities outside of the grammar, by suggesting that the child maintains a probability distribution over a collection of competing grammars • I would place the probabilities inside the grammar • It is more economical, explanatory, and effective.

The central questions • What representations are appropriate for human languages? • What biases are required to learn languages successfully? • Linguistically informed biases – but perhaps fairly general ones are enough • How much of language structure can be acquired from the linguistic input? • This gives a lower bound on how much is innate.

1. A mistaken meme: language as a homogeneous, discrete system • Joos (1950: 701–702): • “Ordinary mathematical techniques fall mostly into two classes, the continuous (e.g., the infinitesimal calculus) and the discrete or discontinuous (e.g., finite group theory). Now it will turn out that the mathematics called ‘linguistics’ belongs to the second class. It does not even make any compromise with continuity as statistics does, or infinite-group theory. Linguistics is a quantum mechanics in the most extreme sense. All continuities, all possibilities of infinitesimal gradation, are shoved outside of linguistics in one direction or the other.” • [cf. Chambers 1995]

The quest for homogeneity • Bloch (1948: 7): • “The totality of the possible utterances of one speaker at one time in using a language to interact with one other speaker is an idiolect. … The phrase ‘with one other speaker’ is intended to exclude the possibility that an idiolect might embrace more than one style of speaking.” • Sapir (1921: 147) • “Everyone knows that language is variable”

Variation is everywhere • The definition of an idiolect fails, as variation occurs even internal to the usage of a speaker in one style. • As least as: • Black voters also turned out at least as well as they did in 1996, if not better in some regions, including the South, according to exit polls. Gore was doing as least as well among black voters as President Clinton did that year. • (Associated Press, 2000)

Linguistic Facts vs. Linguistic Theories • Weinreich, Labov and Herzog (1968) see 20th century linguistics as having gone astray by mistakenly searching for homogeneity in language, on the misguided assumption that only homogeneous systems can be structured • Probability theory provides a method for describing structure in variable systems!

The need for probability models inside syntax The motivation comes from two sides: • Categorical linguistic theories claim too much: • They place a hard categorical boundary of grammaticality, where really there is a fuzzy edge, determined by many conflicting constraints and issues of conventionality vs. human creativity • Categorical linguistic theories explain too little: • They say nothing at all about the soft constraints which explain how people choose to say things • Something that language educators, computational NLP people – and historical linguists and sociolinguists dealing with real language – usually want to know about

Clausal argument subcategorization frames • Problem: in context, language is used more flexibly than categorical constraints suggest: • E.g., most subcategorization frame ‘facts’ are wrong Pollard and Sag (1994) inter alia [regard vs. consider]: • *We regard Kim to be an acceptable candidate • We regard Kim as an acceptable candidate The New York Times: • As 70 to 80 percent of the cost of blood tests, like prescriptions, is paid for by the state, neither physicians nor patients regard expense to be a consideration. • Conservatives argue that the Bible regards homosexuality to be a sin. And the same pattern repeats for many verbs and frames….

Probability mass functions: subcategorization of regard      

Bresnan and Nikitina (2003) on the Dative Alternation • Pinker (1981), Krifka (2001): verbs of instantaneous force allow dative alternation but not “verbs of continuous imparting of force” like push • “As Player A pushed him the chips, all hell broke loose at the table.” • www.cardplayer.com/?sec=afeature&art id=165 • Pinker (1981), Levin (1993), Krifka (2001): verbs of instrument of communication allow dative shift but not verbs of manner of speaking • “Hi baby.” Wade says as he stretches. You just mumble him an answer. You were comfy on that soft leather couch. Besides … • www.nsyncbitches.com/thunder/fic/break.htm • In context such usages are unremarkable! • It’s just the productivity and context-dependence of language • Examples are rare → because these are gradient constraints. Here data is gathered from a really huge corpus: the web.

The disappearing hard constraints of categorical grammars • We see the same thing over and over • Another example is constraints on Heavy NP Shift [cf. Wasow 2002] • You start with a strong categorical theory, which is mostly right. • People point out exceptions and counter examples • You either weaken it in the face of counterexamples • Or you exclude the examples from consideration • Either way you end up without an interesting theory • There’s little point in aiming for explanatory adequacy when the descriptive adequacy of the used representations [as opposed to particular descriptions] just isn’t there. • There is insight in the probability distributions!

Explaining more:What do people say? • What people say has two parts: • Contingent facts about the world • People have been talking a lot about Iraq lately • The way speakers choose to express ideas using the resources of the language • People don’t often put that clauses pre-verbally: • It appears almost certain that we will have to take a loss • That we will have to take a loss appears almost certain • The latter is properly part of people’s Knowledge of Language. Part of syntax.

Variation is part of competence[Labov 1972: 125] • “The variable rules themselves require at so many points the recognition of grammatical categories, of distinctions between grammatical boundaries, and are so closely interwoven with basic categorical rules, that it is hard to see what would be gained by extracting a grain of performance from this complex system. It is evident that [both the categorical and the variable rules proposed] are a part of the speaker’s knowledge of language.”

What do people say? • Simply delimiting a set of grammatical sentences provides only a very weak description of a language, and of the ways people choose to express ideas in it • Probability densities over sentences and sentence structures can give a much richer view of language structure and use • In particular, we find that (apparently) categorical constraints in one language often reappear as the same soft generalizations and tendencies in other languages • [Givón 1979, Bresnan, Dingare, and Manning 2001] • Linguistic theory should be able to uniformly capture these constraints, rather than only recognizing them when they are categorical

Explaining more: what determines ditransitive vs. NP PP for dative verb [Bresnan, Cueni, Nikitina, and Baayen 2005] • Build mixed effects [logistic regression] model over a corpus of examples • Model is able to pull apart the correlations between various predictive variables • Explanatory variables: • Discourse accessibility, definiteness, pronominality, animacy (Thompson 1990, Collins 1995) • Differential length in words of recipient and theme (Arnold et al. 2000, Wasow 2002, Szmrecsanyi 2004b) • Structural parallelism in dialogue (Weiner and Labov 1983, Bock1986, Szmrecsanyi 2004a) • Number, person (Aissen1999, 2003; Haspelmath 2004; Bresnan and Nikitina 2003) • Concreteness of theme • Broad semantic class of verb (transfer, prevent, communicate, …)

Explaining more: what determines ditransitive vs. NP PP for dative verb [Bresnan, Cueni, Nikitina, and Baayen 2005] • What does one learn? • Almost all the predictive variables have independent significant effects • Only a couple fail to: e.g., number of recipient • Shows that reductionist theories of the phenomenon that tries to reduce things to one or two factors are wrong • First object NP is preferred to be: • Given, animate, definite, pronoun, shorter • Model can predict whether to use a double object or NP PP construction correctly 94% of the time. • It captures much of what is going on in this choice • These factors exceed in importance differences from individual variation

Statistical parsing models also give insight into processing [Levy 2004] • A realistic model of human sentence processing explain: • Robustness to arbitrary input, accurate disambiguation • Inference on the basis of incomplete input [Tanenhaus et al 1995, Altmann and Kamide 1999, Kaiser and Trueswell 2004] • Sentence processing difficulty is differential and localized • On the traditional view, resource limitations, especially memory, drive processing difficulty • Locality-driven processing [Gibson 1998, 2000]: multiple and/or more distant dependencies are harder to process the reporter who attacked the senator Easy Hard the reporter who the senator attacked Processing

Expectation-driven processing[Hale 2001, Levy 2005] • Alternative paradigm: Expectation-based models of syntactic processing • Expectations are weighted averages over probabilities • Structures we expect are easy to process • Modern computational linguistics techniques of statistical parsing  precise psycholinguistic model • Model matches empirical results of many recent experiments better than traditional memory-limitation models

Prediction Result easy slow hard fast hard fastest Example: Verb-final domainsLocality predictions and empirical results • [Konieczny 2000] looked at reading times at German final verbs • Locality-based models (Gibson 1998) predict difficulty for longer clauses • But Konieczny found that final verbs were read faster in longer clauses Er hat die Gruppe geführt He led the group Er hat die Gruppe auf den Berg geführt He led the group to the mountain Er hatdie Gruppe auf den sehr schönen Berg geführt He led the group to the very beautiful mountain

S VP NP Vfin NP PP V Er hat die Gruppe auf den Berg geführt Deriving Konieczny’s results • Seeing more = having more information • More information = more accurate expectations NP? PP-goal? PP-loc? Verb? ADVP? • Once we’ve seen a PP goal we’re unlikely to see another • So the expectation of seeing anything else goes up • Rigorously tested: for pi(w), using a PCFG derived empirically from a syntactically annotated corpus of German(the NEGRA treebank)

3 2 Locality-based difficulty (ordinal) 1 Locality-based models (e.g., Gibson 1998, 2000) would violate monotonicity Predictions from the Hale 2001/Levy 2004 model Er hat die Gruppe (auf den (sehr schönen) Berg) geführt

2.&3. Learning sentence structure from distributional evidence • Start with raw language, learn syntactic structure • Some have argued that learning syntax from positive data alone is impossible: • Gold, 1967: Non-identifiability in the limit • Chomsky, 1980 : The poverty of the stimulus • Many others have felt it should be possible: • Lari and Young, 1990 • Carroll and Charniak, 1992 • Alex Clark, 2001 • Mark Paskin, 2001 • … but it isa hard problem

Random 41.7 Language learning idea 1: Lexical affinity models • Words select other words on syntactic grounds • Link up pairs with high mutual information • [Yuret, 1998]: Greedy linkage • [Paskin, 2001]: Iterative re-estimation with EM • Evaluation: compare linked pairs (undirected) to a gold standard congress narrowly passed the amended bill Method Accuracy Paskin, 2001 39.7

expect brushbacks but no beanballs congress narrowly passed the amended bill Idea: Word Classes • Mutual information between words does not necessarily indicate syntactic selection. • Individual words like brushbacksare entwined with semantic facts about the world. • Syntactic classes, like NOUN and ADVERB are bleached of word-specific semantics. • We could build dependency models over word classes. [cf. Carroll and Charniak, 1992] NOUN ADVERB VERB DET PARTICIPLE NOUN congress narrowly passed the amended bill

congress narrowly passed the amended bill Problems: Word Class Models • Issues: • Too simple a model – doesn’t work much better supervised • No representation of valence/distance of arguments) Random 41.7 Carroll and Charniak, 92 44.7 Adjacent Words 53.2 NOUN NOUN VERB NOUN NOUN VERB stock prices fell stock prices fell

distance Bias: Using better dependency representations [Klein and Manning 2004] ? head arg Adjacent Words 55.9 Klein/Manning (DMV) 63.6

Idea: Can we learn phrase structure constituency as distributional clustering the president said that the downturn was over president governor the a said reported [Finch and Chater 92, Schütze 93, many others]

Distributional learning • There is much debate in the child language acquisition literature about distributional learning and the possibility of kids successfully using it: • [Maratsos & Chalkley 1980] suggest it • [Pinker 1984] suggests it‘s impossible (too many correlations, too abstract properties needed) • [Reddington et al. 1998] say it succeeds because there are dominant cues relevant to language • [Mintz et al. 2002] look at distributional structure of input • Speaking in terms of engineering, let me just tell you that it works really well! • It’s one of the most successful techniques that we have – my group uses it everywhere for NLP.

S VP NP PP Span Context fell in september payrolls __  payrolls fell in factory __ sept Idea: Distributional Syntax? [Klein and Manning NIPS 2003] • Can we use distributional clustering for learning syntax?  factory payrolls fell in september

… but figuring out which are constituents is hard. Principal Component 2 Principal Component 2 NP PP VP + - Principal Component 1 Principal Component 1 Problem: Identifying Constituents Distributional classes are easy to find… the final vote two decades most people the final the intitial two of the of the with a without many in the end on time for now decided to took most of go with

Initialization: A little UG? Tree Uniform Split Uniform

Results: Constituency CCM Parse Treebank Parse

Combining the two models[Klein and Manning ACL 2004] Dependency Evaluation • Supervised PCFG constituency recall is at 92.8 • Qualitative improvements • Subject-verb groups gone, modifier placement improved Constituency Evaluation !

Syntactic category, parent, grandparent (subj vs obj extraction; VP finiteness • Head words (wanted vs to vs eat) • Presence of daughters (NP under S) • Syntactic path (Gildea & Jurafsky 2002): <SBAR,S,VP,S,VP> origin? Beyond surface syntax…[Levy and Manning, ACL 2004] Plus: feature conjunctions, specialized features for expletive subject dislocations, passivizations, passing featural information properly through coordinations, etc., etc. • cf. Campbell (2004 ACL) – a lot of linguistic knowledge

Evaluation on dependency metric: gold-standard input trees

Word Distributions: she, he agent door, bottle patient key, corkscrew instrument Allowed Linkings: { agent=subject, patient=object } { patient=subject } { agent=subject, patient=object, instrument=obl_with } { instrument=subject, patient=object } Might we also learn a linking to argument structure distributionally? • Why it might be possible: Instances of open: The bottle opened easily. She opened the door with a key. The door opened as they approached. Fortunately, the corkscrew opened the bottle. He opened the bottle with a corkscrew. She opened the bottle very carefully. This key opens the door of the cottage. He opened the door.

s1 r1 w1 subj 0 plunge s2 r2 w2 np M today s3 r3 w3 obj1 2 them s4 r4 w4 obj2 1 test Probabilistic Model Learning[Grenager and Manning 2006 EMNLP] • Given a set of observed verb instances, what are the most likely model parameters? • Use unsupervised learning in a structured probabilistic graphical model • A good application for EM! • M-step: • Trivial computation • E-Step: • We compute conditional distributions over possible role vectors for each instance • And we repeat v give l { 0=subj, 1=obj2, 2=obj1 } o [ 0=subj, M=np 1=obj2, 2=obj1 ]

Semantic role induction results • The model achieve some traction, but it’s hard • Learning becomes harder with greater abstraction • This is the right research frontier to explore! Verb: give Verb: pay Linkings: {0=subj,1=obj2, 2=obj1} 0.46 Linkings: {0=subj,1=obj1} 0.32 {0=subj,1=obj1, 2=to} 0.19 {0=subj,1=obj1, 2=for} 0.21 {0=subj,1=obj1} 0.05 {0=subj} 0.07 … … {0=subj,1=obj1, 2=to} 0.05 {0=subj,1=obj2, 2=obj1} 0.05 Roles: 0 it, he, bill, they, that, … … … 1 power, right, stake, … 2 them, it, him, dept., … Roles: 0 it, they, company, he, … … … 1 $, bill, price, tax … 2 stake, gov., share, amt., … … …

Conclusions • Probabilistic models give precise descriptions of a variable, uncertain world • There are many phenomena in syntax that cry out for non-categorical or probabilistic representations of language • Probabilistic models can –and should – be used over rich linguistic representations • They support effective learning and processing • Language learning does require biases or priors • But a lot more can be learned from modest amounts of input than people have thought • There’s not much evidence of a poverty of the stimulus preventing such models being used in acquisition.

Learning Language from Distributional Evidence

Learning Language from Distributional Evidence

Presentation Transcript

Distributional models

Learning Language

Learning Language

Television and language change – evidence from Glasgow

Learning Language

Language learning

Distinguishing Language Acquisition From Learning Disabilities

Acquisition of Semantic Classes for Adjectives from Distributional Evidence

From Theory of Language to Language Learning

First steps in Language Acquisition: Evidence from ERPs

Learning Classifiers from Distributional Data

Distributional Effects of Prescription Drug Programs: Canadian Evidence

LEARNING ANY NEW LANGUAGE FROM

Learning Taxonomic Relations from Heterogeneous Evidence

Efficient Language Learning from Restricted Information

Learning Language Semantics from Ambiguous Supervision

Learning from assessment: insights about student learning from programme level evidence

Learning The Language From Top Chinese Language School Shanghai

Language Learning

Learning Language from its Perceptual Context