For the course Cognitive models of language and beyond University of Amsterdam February 28, 2011

For the courseCognitive models of language and beyondUniversity of Amsterdam February 28, 2011 Connectionist models of language part IIThe Hierarchical Prediction Network

Recap: Problems of the Simple Recurrent Network • The SRN does not learn the kind of generalizations that are typical for language users, when they substitute words or entire phrases in different sentence positions. • I catch the train I catch the last bus. - Such generalizations presuppose a representation involving phrasal syntactic categories, and a neural mechanism for substitution, which are lacking in the SRN. • We need a connectionist network that can deal with the phrasal structure of language, and that can learn (syntactic) categories. • That should be based on a neural theory about the acquisition of syntax

S PN VP Jill V NP N DET reads book a Is there a neural story behind the acquisition of a hierarchical syntax? A neural or connectionist account of phrase structure must cope with these questions: • What are the neural correlates of syntactic categories and phrase structure rules? • How are syntactic categories acquired, and how do they become abstract? • How do local trees combine to parse a sentence within a neural architecture?  binding problem • How does the brain represent hierarchical constituent structure? This work develops a theory and a computational model of the neural instantiation of syntax.

Jane plays a game in the house PP: in the house NP: red carpet VP: play game go house fine in when more you red the p d k s ʒ ɑː eɪ ər b tʃ g z ʃ æ ə ɜr t dʒ ð ŋ ɔː Conceiving of neural grammar in analogy to visual hierarchy • Progressively larger receptive fields (spatial and temporal compression) • Progressively more complex, and more invariant representations  If visual categories are hierarchically represented by neural assemblies, then why not syntactic categories?

Parse trees in the visual domain Fragment based hierarchical object recognition (Ullman, 2007) • Decomposition of the object results in a hierarchical object representation based on informative fragments. • Visual object recognition involves construction of parse trees • Rules of `visual grammar’ conjunctively bind simple categories into more complex categories (they do feature binding) • Visual categories become progressively more invariant. • Lacking: temporal element, ability for recursion

The Memory Prediction Framework (MPF) • MPF is an neurally motivated algorithm for pattern recognition by the neocortex [Hawkins & Blakeslee, 2004] • Cortical categories (columns) represent temporal sequences of patterns • Categories become progressively temporally compressed and invariant as one moves up in the cortical hierarchy • Hierarchical temporal compression allows top-down prediction and anticipation of future events • Parallel: phrasal categories temporally compress word sequences (NP  Det Adj N)

Syntactic categories are prototypical and graded • For a cognitive theory of syntax one must give up on the notion that syntactic categories are discrete, set-theoretical entities. • Syntactic categories are prototypical (Lakoff, 1987). (a nose is a more typical noun than a walk, or time) • Category membership is graded (nouniness), exhibits typicality effects, defines similarity metric (e.g. adverb-adj are more similar than adverb-pronoun). • Children’s categories are different than adult categories: there must be an incremental learning path from concrete, item-based to abstract adult categories  Underlying the conventional syntactic categories is a continuous category space.

A hypothesis for a neural theory of grammar • The language area of the cortex contains local neuronal assemblies that function as neural correlates of graded syntactic categories. • Hierarchical temporal compression of assemblies in higher cortical layers is responsible for phrase structure encoding. [Memory Prediction Framework (Hawkins, 2004)] • The topological and hierarchical arrangement of the syntactic assemblies in the cortical hierarchy constitutes a grammar. grammar acquisition amounts to learning a topology. • Assemblies can be dynamically, and serially bound to each other, enabling (phrasal) substitution (and recursion).

NP No labels, no fixed associations with non-terminals Det Adj N Hierarchical temporal compression Prototypical, graded categories NP VP NP VP Continuous category space S NP NP Det Adj N VP Pointers stored in local memories Dynamic, serial binding The Hierarchical Prediction Network (HPN)From theory to model Features:

Syntactic categories are regions in a continuous “substitution space” in HPN Simple, lexical units complex units X1 under VP, verb eat NP, noun PP, prep X2 tomato the X3 happy A node’s position in substitution space defines its graded membership to one or more syntactic categories in a continuous space. Substitutability is given as the topological distance in substitution space, and it is learned gradually. Labels are whatever the linguist projects on this space.

Temporal integration in compressor nodes X1 compressor node: temporal integration root a compressor node fires after all its slots are activated in specific order slot3 slot1 slot2 slots Binding via slots Probability of binding a node to a slot is function of distance between their representations in “substitution space” lexical input nodes w1 w2 w4 w5 w8 w3 w6 w7

Parsing with HPN A derivation in HPN is a connected trajectory through the nodes of the physical network that binds a set of nodes together through pointers stored in the slots. Derivation of the sentence “Sue eats a tomato”. bindings HPN grammar X Y Z 1 3 2 2 1 4 2 3 3 Node state is characterized by start index j, active slot i and slot index n. 2 4 1 3 a Sue tomato eats (Sue (eats (a tomato))

S S VP NP V VP NP NP V John NP VP loves John John loves NP V S loves NP VP NP John Mary NP V loves Mary Left corner parsing • One of many parse strategies, psychologically plausible • Combines bottom-up with top-down processing, and proceeds incrementally, from left to right. • The “left corner” is the first symbol on the right hand side of the rule. Words bottom-up “enable” application of rules.

Substitution versus dynamic binding • In symbolic grammar substitution between categories depends on their labels, which are supposed to be innate. • In connectionist model no labels: substitutability relations must be learned! (as distances in topology) • Substitution is a mechanistic operation, corresponding to dynamic, serial binding [Roelfsema, 2006]. • A node binds to a slot by transmitting its identity (plus a state index) to the slot, which then stores this as a pointer (cf. short-term memory in pre-synaptic connections [Barak & Tsodyks, 2007]). • The stored pointers to bound nodes connect a trajectory through the network, from which a parse tree can be reconstructed  distributed stack

Parse and learn cycle • Initialize random node and slot representation. • For every sentence in training corpus • Find the most probable parse using left corner chart parser. (Parse probability is product of binding probabilities, which are proportional to metric distances) • Strengthen the bindings of the best parse by moving bound node n and slot s closer to each other in substitution space Δn= λs,andΔs= λn • As the topology self-organizes, substitutability relations between units are learned, hence a grammar.

X1 X1 1) 2) dog cat dog cat the the feed feed Generalization: from concrete to abstract • Start with random node representations. Parsing a corpus causes the nodes to become inter-connected, reflecting the corpus distribution (the topology self-organizes). • If two words occur in the same contexts they tend to bind to the same slots. The slots mediate formation of abstract category regions in space through generalization. HPN shows how constructions with slots (I want X) are learned. Stage 2 in developmental stages of Usage Based Grammar (Tomasello, 2003).

Demo…

Experiment with artificial CFG HPN learns a topology from artificial corpus that reflects the original categories from the CFG grammar. • 1000 sentences generated by same artificial grammar with relative clauses as in [Elman, 1991]. E.g., boy who chases dog walks • HPN initialized with 10 compressor nodes with 2 slots, 5 with 3 slots • All lexical and c-nodes have random initial representations • Next word prediction is more difficult to do in HPN, work in progress.

Experiment with Eve corpus from Childes • 2000 sentences from second half of Eve corpus • Brackets available from Childes, but not good quality. Binarized. • Initialize HPN with 120 productions with 2 slots • Reasonable clustering of part of speech categories • Note, that SRN cannot work with realistic corpora.

Conclusions • Dynamic binding between nodes in HPN adds to the expressive power of neural networks, and allows explicit representation of hierarchical constituent structure. • Continuous category representations enable incremental learning of syntactic categories, from concrete to abstract. • HPN offers a novel perspective on grammar acquisition as a self-organizing process of topology construction. This is a biologically plausible way of learning. • HPN creates synthesis between connectionist and rule-based approach to syntax (it encodes and learns context-free `rules’ in the compressor nodes.  answers the critique of [Fodor & Pylyshyn, 1988] that connectionist networks are not systematic.

For the courseCognitive models of language and beyondUniversity of Amsterdam February 28, 2011 Connectionist models of language part IIIEpisodic grammar

Limitations of HPNcontext freeness • HPN cannot do contextual conditioning, because parser decisions depend only on distance between 2 nodes (HPN is `context free’) • Yet, for realistic language processing, and in particular left corner parsing, one must be able to condition on contextual (lexical and structural) history. • Connectionism puts constraint on parser: conditioning events must be locally accessible. • Find a neurally plausible solution for conditioning on sentence context: use episodic memory.

Episodic and semantic memory Language processing makes use of two memory systems • Semantic memory is a person’s general world knowledge, including language, in the form of concepts (of objects, processes and ideas) that are systematically related to each other. • Episodic memory is a person’s memory of personally experienced events or episodes, embedded in a temporal and spatial context. • In the language domain, semantic memory encodes abstract, rule-based linguistic knowledge (a grammar) , and episodic memory encodes memories of concrete sentence fragments (exemplars). • HPN implements a semantic memory of a (context free) grammar “bread” Me lining up in front of the bakery

Semantic-episodic memory distinction maps to debate on rule-based versus exemplar-based cognitive modeling • Interesting parallel between episodic-semantic memory processes and rule-based vs. exemplar-based language processing. • Evidence for abstract, rule-based grammars in tradition of generative grammar (e.g., [Marcus, 2001]) • Usage Based Grammar [Tomasello, 2000] emphasizes item-based nature of language with a role for concrete constructions larger than rules. • Suggests that a semantic memory of abstract grammatical relations and an episodic memory of concrete sentences interact in language processing.

Properties of episodic memory • All episodic experiences that can be consciously remembered leave physical memory traces in the brain. • Episodic memories are content addressable: their retrieval can be primed by cues from semantic memory (for instance the memory of a smell)  priming effects • Sequentiality: episodes are construed as temporal sequences that bind together static semantic elements, within a certain context [Eichenbaum, 2004]. • The chronological order of episodic memories is preserved.  HPN can be enriched with an episodic memory to model the interaction between semantic and episodic memory in language processing, and at the same time solve the problem of context.

boy mango tango likes girl dances who Episodic memory traces in HPN Episodic traces are stored in local memories of visited nodes After successful derivation, local STMs in the slots must be made permanent The traces prime derivations of processed sentences (content addressability)

Trace encoding in a symbolic episodic grammar Episodic traces after processing sentences “girl who dances likes tango”, and “boy likes mango” S 1-1 2-1 NP NP VP 1-3 1-10 2-2 2-6 VP NP 1-8 2-4 N 1-2 VT VT NP RC NP 1-9 2-5 N N 2-3 1-4 likes RC boy girl 1-5 WHO VI N N WHO VI 1-7 1-11 2-7 1-6 tango mango who dances • In symbolic, supervised case nodes become treelets that correspond one-to-one to CFG rules. • Traces are encoded as x-y; x = #sentence in corpus; y = #position in derivation (top-down or left-corner)

Trace encoding for a left corner derivation S 1-12 1-20 att Treelets have a register that keeps track of execution of operations pr NP VP VP NP pr 1-4 1-11 1-15 1-19 att NP NP RC VT NP RC 1-3 1-18 1-7 1-10 pr pr att sh N WHO VI pr pr att N WHO VI VT N 1-2 1-6 1-9 1-14 1-17 sh girl who dances likes tango pr pr pr pr pr sh dances girl who likes tango sh 1-1 1-5 1-8 1-13 1-16 START* NP* RC* S* VP*

Parsing as priming activation of traces • When parsing a novel sentence, the traces in a visited treelet are primed, and trigger memories of stored sentences (derivations). • The traces (ex) receive an activation value (A), of which the strength depends on the common history (CH) of the pending derivation (d) with the stored derivation (x) • CH is measured as #derivation steps shared between d and x. • Every step in the derivation is determined by competition between activated traces of different exemplars.

Probabilistic episodic grammar • Derivation in episodic grammar is a sequence of visits to treelets • Probability of continuing derivation from treelet tk to treelet tk+1 = set of traces in tk with preference for tk+1 = set of all traces in tk = activation of trace ex in tk

Training the episodic grammar • For parsing task, episodic grammar is trained on Wall Street Journal (WSJ) sections 2-21, and evaluated on WSJ section 22. Training as follows: • One treelet is created for every unique CF production in treebank. • For every sentence in corpus determine the sequential order of treelets in derivation d • For every treelet in derivation d store a trace x-y consisting of sentence nr. plus position in derivation. • After training, the probability of a given parse of a test sentence can be computed by dynamically updating the trace activations • For every step in a given derivation check whether the traces in the current treelet are successors of traces in the previous treelet. • If so, increase their CH by 1. Otherwise, set their CH to 0.

#matching constit. LP = #cstit. model parse #matching constit. 7 1 2 5 4 6 3 LR = #cstit. GStd parse Reranking with the episodic grammar reranking Parse P Parse P 1 0.050 Episodic probabilistic grammar trained on WSJ sec 2-21 1  3 0.041 3rd party parser trained on WSJ sec 2-21 2 0.038 2  2 0.003 3 0.021 3  5 0.021 4 0.017 4  1 0.066 5 0.009 5  4 0.012 Test sentence (WSJ sec 22) “Silver performed quietly” evaluation Compare most probable parse with Gold standard parse (S (NP (NN Silver)) (VP (VBD performed) (ADVP (RB quietly))) (. .)) PARSEVAL LR=89.88 LP=90.10 F =89.99

Experiments Precision and recall scores of the top-down episodic (TDE) reranker, and the left-corner episodic (LCE) reranker as a function of the maximum common history considered. Nbest = 5; λ0=4; λ1 = λ2 = λ3 = 0.2. Best F-score for TDE reranker: F = 90.36 for his=5; for LCE reranker: F = 90.61 for his=8. Both better than Charniak’99.

Robustness F-scores of the left-corner episodic reranker when the output list of the 3rd party parser is varied between the 5, 10 and 20 best parses.

Discontiguities and shortest derivation reranker When taking discontiguous episodes into account F-score improves to F=90.68. (Nbest=5; d=0.95; f=0.6) Shortest derivation episodic reranker selects parses from the Nbest list according to a preference for derivations that use the fewest episodes. Best F-score is F=90.44 (better than Charniak’99).

Parsing results compared to state-of-the-art

Relation to Data Oriented Parsing (DOP) • Episodic grammar gives neural perspective on parsing with larger fragments, in terms of episodic memory retrieval. • Complementary: in DOP substitution of an arbitrary large subtree is conditioned on a single nonterminal; in the episodic parser the application of a local treelet is conditioned on an arbitrary large episode. • But: shortest derivation variant does use large fragments • Advantages over DOP • Every exemplar is stored only once in the network, thus space complexity is linear in the number of exemplars. • Content-addressability: episodes are reconstructed from traces, obviating a search through an external memory

Relation to history-based parsing • Like state-of-the-art history-based parsers (e.g., [Collins, 2003, Charniak, 1999] the episodic grammar makes use of lexical and structural conditioning context. • Yet, no preprocessing of labels is needed • Conditining on arbitrary long histories at no cost to the grammar size, because history is implicit in the representation: no need to form equivalence classes. • Association between conditioning event and sentence from where it originates is preserved: this allows to exploit discontiguous episodes.

Conclusions • Episodic grammar clarifies the trade-off between rule-based and exemplar-based models of language, and unites them under a single, neurally plausible framework. • Proposes and evaluates an original hypothesis about the representation of episodic memory in the brain: the emerging picture of episodic memory is as a life-long thread spun through semantic memory. • Fits with the “reinstatement hypothesis of episodic retrieval”: priming of traces reactivates the cortical circuits that were involved in encoding the episodic memory • Fits with hippocampal models of episodic memory [e.g., Eichenbaum, 2004, Levy, 1996]: Special ‘context neurons’ uniquely identify (part of) an episode, and function as a neural correllate of a counter.

Work in progress • Developing an episodic chart parser, that computes parse probabilities through activation spreading to (the traces in) its states. • Integrating episodic memory in probabilistic HPN chart parser; do unsupervised grammar induction with episodic-HPN • Want to join for a project? Contact me at gideonbor@gmail.com • homepage: staff.science.uva.nl/~gideon

Thank you! References: • Bod (1998). Beyond Grammar: An experience-based theory of language. CSLI Publications, Stanford, CA. • Borensztajn, G., Zuidema, W., & Bod, R. (2009). The hierarchical prediction network: towards a neural theory of grammar acquisition. Proc. of the 31th Annual Meeting of the Cognitive Science Society • Elman (1990). Finding Structure in Time. Cognitive Science, 14:179–211 • Elman (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning. • Fodor, J. D., & Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 3-71. • Hadley (1994). Systematicity in connectionist language learning. Mind and Language. • Hawkins, J., & Blakeslee, S. (2004). On intelligence. New York: Henry Holt and Company. • Tomasello, M.(2003). Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.

For the course Cognitive models of language and beyond University of Amsterdam February 28, 2011

For the course Cognitive models of language and beyond University of Amsterdam February 28, 2011

Presentation Transcript

Overview of University of Amsterdam progress

The cognitive basis of language

The cognitive basis of language

The cognitive psychology of language – 2

Welcome! February 28, 2011

Formal Models of Language

Success factors for GDP and beyond indicators e-Frame Amsterdam February 2014

28 February 2011

Models of Language

Monday, February 28, 2011

Eric Linder 28 February 2011

February 18 - February 28 2011

Recap of 28 February

AMSTEL Institute University of Amsterdam

Theories and Models of Language

Maurice Crul University of Amsterdam

The cognitive psychology of language – 1

University of Amsterdam Amsterdam Institute for Advanced Labour Studies

Isa Baud University of Amsterdam

The Partnership Board | 28 February 2011

Bell Ringer 28 February 2011

The 28 th of February, Friday.