1.07k likes | 1.23k Views
Reti Neurali Ricorsive e sistemi connessionisti per il NLP Il problema del First-Pass Attachment Disambiguation. Fabrizio Costa University of Florence. Overview . What is this talk about: brief review of connectionist architectures for NLP
E N D
Reti Neurali Ricorsive e sistemi connessionisti per il NLPIl problema del First-Pass Attachment Disambiguation Fabrizio Costa University of Florence
Overview • What is this talk about: • brief review of connectionist architectures for NLP • introduction of a connectionist recursive system for syntactic parsing • Hybrid model: • Dynamic grammar: strong incrementality hypothesis • Recursive connectionist predictor • Investigations: • linguistic preferences expressed by the system • Advances: • enhance performance of the proposed model by information reduction, domain partitioning
Connectionism and NLP • Even if strongly criticized (Fodor,Pylyshyn. Connectionism and Cognitive architecture. A critical analyses. Cognition,1998) connectionist system are applied to a variety of NLP tasks: • grammaticality prediction • case role assignment • syntactic parsers • multy-features system (lexical,syntactic,semantic)
Advantages • Parallel satisfaction of multiple and diverse constraints • in contrast rule-based sequential processing needs to compute solution that will later on be discarded if not meeting the requirements • Develop and process distributed representations for complex structures
Connectionist Architectures • Brief review of: • Neural Networks • Recurrent Networks • Recursive Auto Associative Memory • Introduction to Recursive Neural Networks
Neural Networks • What is an Artificial Neural Network (ANN)? • A simple model: the perceptron x1 x2 x3 . . xn w1 w2 y output inputs w3 w4 y=i=1..n wi xi -
What can it learn? • A perceptron can learn linear decision boundaries from examples • Ex: bidimensional input x2 x2 Classe C1 + + Classe C2 x1 x1 + + + + + + w1 x1 + w2 x2 - =0 Esempi
How can it learn? • Iterative algorithm: • at every presentation of the input the procedure reduces the error x2 x2 x2 + + + x1 x1 x1 + + + + + + + + + Iterazione 1 Iterazione 2 Iterazione 3
Getting more complex • Multylayer Feed-forward Network • hidden layer • non linearity hidden layer 2 x1 x2 x3 . . xn Input layer output layer 1 hidden layer 1
Signal Flow • Learning takes place in 2 phases: • Forward phase (compute prediction) • Backward phase (propagate error signal) Forward x1 x2 x3 . . xn Backward
+ + + + + + + Decision Boundaries • Now decision boundaries can be made arbitrary complex increasing the number of neurons in the hidden layer.
ANN and NLP • Since ANN can potentially learn any mapping from input values to any desired output • ...why not apply them to NLP problems? • How to code linguistic information to make it processable by ANN? • What are the properties of linguistic information?
Neural Networks and NLP • Problems: • in the easiest formulation NLP input consists of a sequence of variable length of tokens • the output at any time depends on the input received an arbitrary numer of time-steps in the past • standard neural networks can directly process only sequences of fixed size • Idea: • introduce an explicit state representation • recycle this state as input for the next time-step
Recurrent Networks Output Units Holding Units Hidden Units Holding Units Rumelhart, Hinton, Williams. Parallel distributed processing: Explorations in the microstructure of Cognition. MIT Press 1986 Input Units
Working Principle • A complete forward propagation (from input to output) is 1 time step • the backward connections copy activations to Holding Units for next time step • at the first time step all Holding Units have a fixed conventional value (generally all 0s) • at a general time step: • the Hidden Units receive input from both the Input Unit and the appropriate Holding Units • the Output Unit receive input from both the Hidden Unit and the appropriate Holding Units
Simple Recurrent Networks • J.L.Elman proposed and studied the properties of a Simple Recurrent Networks applied to linguistic problems • Elman. Finding structure in time. Cognitive Science, 1990 • Elman. Distributed Representations, Simple Recurrent Networks and Grammatical Structure. Machine Learning, 1991 • Simple Recurrent Networks are Recurrent networks that • have a single set of holding units for the hidden layer called context layer • backpropagation of error signal is trucated at context layer
Simple Recurrent Networks Output Units Hidden Units Context Layer Input Units
Elman Experiment • Task: predict next token in a sequence representing a sentence • Claim: in order to exhibit good prediction performance the system has to • learn syntactic constraints • learn to preserve relevant information of previous inputs in the Context Layer • learn to discard irrelevant information
Elman Setting • Simplified grammar capable of generating embedded relative clauses • 23 words plus the end of sentence marker ‘.’ • number agreement • verb argument structure • interaction with relative clause • the agent or subject in a RC is omitted • center embedding • viable sentences • the end of sentence marker cannot occur at all positions in the sentence
Network and Training • Local representations for input and output • Hidden layer with 70 units • Training set: 4 sets of 10.000 sentences • sets increasingly “difficult”, with a higher percentage of sentences containing RCs • 5 epochs
Results • On sentences from the training set: • the network has learned to predict verb agreement in number with the noun and all the other constraints present in the grammar • Good generalization on other sentences not present in the training set
Critics • Possibility of rote learning • the vocabulary simbols are 23 but there are only 10 different classes (ie. ‘boy’ ‘dog’ ‘girl’ are equivalent) • the number of disinct sentence pattern is small (there are 400 different sentences of 3 words but only 18 different patterns if we consider the equivalences) • there are very few sentence patterns that are in the test set that are not in the training set
Critics • The simplified error backproagation algoritm doesn’t allow learning of long term dependencies • the backpropagation of the error signal is trucated at the context layer. • This makes computation simplier (no need to store the history of activations), and local in time • The calculated gradient forces the network to transfer informatioin about the input into the hidden layer only if that information is usefull for the current output, but if it is usefull more than one time step in the future, there is no guarantee that it will be somehow preserved
NP PP D N P NP D N The servant of the actress Processing Linguistic data • Linguistic Data as syntactic information is “naturally” represented in a structured way The servant of the actress Flat info Syntactic info
A B C (A(B(DEFGH)C)) DEFGH Processing structured data • It is possible to “flatten” the information and transform it in a vector representation • BUT • flattening makes dependencies “further” apart, more difficult to process • We need to directly process structured information!!
How to process structured information with a connectionist architecture? • Idea: recursively compress sub trees into distributed vector representations • Pollack. Recursive distributed representations. Artificial Intelligence, 1990 • Autoencoder net to learn how to compress the fields of a node to a label and uncompress the label to the fields
Recursive Auto Associative MemoryRAAM left right whole left right
A r B C q D p q r A B r D q C Example A((BC)D) p q A r D B C
Training • It is a moving target training • when learning (B C) r (B C) the representation of r changes • this changes the target for another example (r D) q (r D) • this can cause instability so the learning rate has to be kept small and hence we have very long trining times
Coding and Decoding • A RAAM can be viewed as 2 nets trained simultaneously: • an encoding network that learns the reduced representation r for BC • a decoding network that takes in input the reduced representation r and decodes it to BC • Both encoding and decoding are recursive operations • The encoding net knows, from the structure of the tree, how many times it must recursively compress representations • The decoding network has to decide if a decoded field is a terminal node or an internal node which should be further decoded
Decoding • Pollack solution: • use binary codes for terminal nodes • internal representations have codes with values in [0,1] but not close to 0 or 1 • Decision rule: • if a decoded field has all its values close to 0 or 1 then it is a terminal node and it is not decoded further
Experiment • Task: coding and decoding of compositional prepositions • Ex: • Pat thought that John loved Mary • (thought Pat (loved John (Mary)))
Experimental Setting • The network has: • 48 input units • 16 hidden units • 48 output units • Training set: 13 complex propositions
Results • After training the network is able to code and decode some (not all) novel propositions • Cluster analysis shows that similar trees are encoded with similar codes • Ex: (loved John Pat) and (loved Pat Mary) are more similar to each other than any of the codes for other trees
Further results • Pollak tried to test if a network could be trained to manipulate representations without decoding them • Ex: train a network to transfor reduced representation for (is liked X Y) into (likes Y X) • Results: • if trained on 12 of the 16 possible propositions • the network generalizes correctly to the other 4 propositions • Chalmers experiments with a network that transforms reduced representations from passive to active structures
Problems • Generalization problems for new structures • Very long training times • Information storage limits: • all configurations have to be stored in fixed-size vector representations • as the hight grows we have an exponential number of possible trees • the numerical precision limits are exceeded by small trees • Unknown (but believed not good) scaling properties
Recursive Neural Network • We want to • directly process complex tree structures • work in a supervised framework • We use Recursive Neural Networks (RNN) specialized for tree structures • Frasconi, Gori, Sperduti. A general Framework for Adaptive Processing of Data Structures. IEE Transactions on Neural Networks, 1998
node state ... Output label encoding 1st child state last child state Root state What is a RNN? • Recursive Neural Networks for trees are composed of several replicas of a Recursive Neetwork and one Output Network Recursive Network Output Network
How does a RNN process tree data structures? • General processing step: • Structure unfolding • Prediction phase: • Recursive state update • Learning phase: • Backpropagation through structure
S VP NP NP PP NP VBZ DT NN IN PRP It has no bearing on Structure unfolding
Structure unfolding S VP NP NP NP PP It has no bearing on
Structure unfolding S VP NP It has no bearing on
Structure unfolding S VP It has no bearing on
Structure unfolding S It has no bearing on
Structure unfolding It has no bearing on
Structure unfolding Output network It has no bearing on
Prediction phaseInformation Flow It has no bearing on
Prediction phaseInformation Flow It has no bearing on
Prediction phaseInformation Flow It has no bearing on
Prediction phaseInformation Flow It has no bearing on