230 likes | 341 Views
Day 2: Pruning continued; begin competition models. Roger Levy University of Edinburgh & University of California – San Diego. Today. Concept from probability theory: marginalization Complete Jurafsky 1996: modeling online data Begin competition models. Marginalization.
E N D
Day 2: Pruning continued;begin competition models Roger Levy University of Edinburgh & University of California – San Diego
Today • Concept from probability theory: marginalization • Complete Jurafsky 1996: modeling online data • Begin competition models
Marginalization • In many cases, a joint p.d. will be more “basic” than the raw distribution of any member variable • Imagine two dice with a weak spring attached • No independence → joint more basic • The resulting distribution over Y is known as the marginal distribution • Calculating P(Y) is called marginalizing over X
Today • Concept from probability theory: marginalization • Complete Jurafsky 1996: modeling online data • Begin competition models
Modeling online parsing • Does this sentence make sense? The complex houses married and single students and their families. • How about this one? The warehouse fires a dozen employees each year. • And this one? The warehouse fires destroyed all the buildings. • fires can be either a noun or a verb. So can houses: [NP The complex] [VP houses married and single students…]. • These are garden path sentences • Originally taken as some of the strongest evidence for serial processing by the human parser Frazier and Rayner 1987
Limited parallel parsing • Full-serial: keep only one incremental interpretation • Full-parallel: keep all incremental interpretations • Limited parallel: keep some but not all interpretations • In a limited parallel model, garden-path effects can arise from the discarding of a needed interpretation [S [NP The complex] [VP houses…] …] discarded [S [NP The complex houses …] …] kept
Modeling online parsing: garden paths • Pruning strategy for limited ranked-parallel processing • Each incremental analysis is ranked • Analyses falling below a threshold are discarded • In this framework, a model must characterize • The incremental analyses • The threshold for pruning • Jurafsky 1996: partial context-free parses as analyses • Probability ratio as pruning threshold • Ratio defined as P(I) : P(Ibest) • (Gibson 1991: complexity ratio for pruning threshold)
Garden path models 1: N/V ambiguity • Each analysis is a partial PCFG tree • Tree prefix probability used for ranking of analysis • Partial rule probs marginalize over rule completions these nodes are actually still undergoing expansion *implications for granularity of structural analysis
N/V ambiguity (2) • Partial CF tree analysis of the complex houses… • Analysis of houses as noun has much lower probability than analysis as verb (> 250:1) • Hypothesis: the low-ranking alternative is discarded
N/V ambiguity (3) • Note that top-down vs. bottom-up questions are immediately implicated, in theory • Jurafsky includes the cost of generating the initial NP under the S • of course, it’s a small cost as P(S -> NP …) = 0.92 • If parsing were bottom-up, that cost would not have been explicitly calculated yet
(that was) Garden path models II • The most famous garden-paths: reduced relative clauses (RRCs) versus main clauses (MCs) • From the valence + simple-constituency perspective, MC and RRC analyses differ in two places: The horse raced past the barn fell. p=0.14 p≈1 best intransitive: p=0.92 transitive valence: p=0.08
Garden path models II (2) • 82 : 1 probability ratio means that lower-probability analysis is discarded • In contrast, some RRCs do not induce garden paths: • Here, found is preferentially transitive (0.62) • As a result, the probability ratio is much closer (≈ 4 : 1) • Conclusion within pruning theory: beam threshold is between 4 : 1 and 82 : 1 • (granularity issue: when exactly does probability cost of valence get paid??? c.f. the complex houses) The bird found in the room died. *note also that Jurafsky does not treat found as having POS ambiguity
Notes on the probabilistic model • Jurafsky 1996 is a product-of-experts (PoE) model • Expert 1: the constituency model • Expert 2: the valence model • PoEs are flexible and easy to define, but… • The Jurafsky 1996 model is actually deficient (loses probability mass), due to relative frequency estimation
sometimes approximated as Notes on the probabilistic model (2) • Jurafsky 1996 predated most work on lexicalized parsers (Collins 1999, Charniak 1997) • In a generative lexicalized parser, valence and constituency are often combined through decomposition & Markov assumptions, e.g., • The use of decomposition makes it easy to learn non-deficient models
Jurafsky 1996 & pruning: main points • Syntactic comprehension is probabilistic • Offline preferences explained by syntactic + valence probabilities • Online garden-path results explained by same model, when beam search/pruning is assumed
General issues • What is the granularity of incremental analysis? • In [NPthe complex houses], complex could be an adjective (=the houses are complex) • complex could also be a noun (=the houses of the complex) • Should these be distinguished, or combined? • When does valence probability cost get paid? • What is the criterion for abandoning an analysis? • Should the number of maintained analyses affect processing difficulty as well?
Today • Concept from probability theory: marginalization • Complete Jurafsky 1996: modeling online data • Begin competition models
General idea • Disambiguation: when different syntactic alternatives are available for a given partial input, each alternative receives support from multiple probabilistic information sources • Competition: the different alternatives compete with each other until one wins, and the duration of competition determines processing difficulty
Origins of competition models • Parallel competition models of syntactic processing have their roots in lexical access research • Initial question: process of word recognition • are all meanings of a word simultaneously accessed? • or are only some (or one) meanings accessed? • Parallel vs. serial question, for lexical access
Origins of competition models (2) • Testing access models: priming studies show that subordinate (= less frequent) meanings are accessed as well as dominant (=more frequent) meanings • Also, lexical decision studies show that more frequent meanings are accessed more quickly
Origins of competition models (3) • Lexical ambiguity in reading: does the amount of time spent on a word reflect its degree of ambiguity? • Readers spend more time reading equibiased ambiguous words than non-equibiased ambiguous words (eye-tracking studies) • Different meanings compete with each other Of course the pitcher was often forgotten… ? ? Rayner and Duffy (1986); Duffy, Morris, and Rayner (1988)
Competition in syntactic processing • Can this idea of competition be applied to online syntactic comprehension? • If so, then multiple interpretations of a partial input should compete with one another and slow down reading • does this mean increase difficulty of comprehension? • [compare with other types of difficulty, e.g., memory overload]
Constraint types • Configurational bias: MV vs. RR • Thematic fit (initial NP to verb’s roles) • i.e., Plaus(verb,noun), ranging from • Bias of verb: simple past vs. past participle • i.e., P(past | verb)* • Support of by • i.e., P(MV | <verb,by>) [not conditioned on specific verb] • That these factors can affect processing in the MV/RR ambiguity is motivated by a variety of previous studies (MacDonald et al. 1993, Burgess et al. 1993, Trueswell et al. 1994 (c.f. Ferreira & Clifton 1986), Trueswell 1996) *technically not calculated this way, but this would be the rational reconstruction