Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Comparing Convolution Kernelsand Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil Department of Computer Science University College London, UK

Structured Data • Many applications… • … is useful to represent the objects of the domain by structured data (trees, graphs, …) • … better capture important relationships between the sub-parts that compose an object ANNPR 2003, Florence 12-13 September 2003

Natural Language: Parse Trees S VP NP ADVP NP PRP VBD RB NN NN . He was previous vice president . ANNPR 2003, Florence 12-13 September 2003

Structural Genomics:Protein Contact Maps ANNPR 2003, Florence 12-13 September 2003

0 0.00 1.00 0.00 1.00 1 0.00 0.23 0.00 1.00 1 0.23 0.26 0.00 1.00 1 0.23 0.29 0.00 1.00 1 0.00 0.23 0.00 0.01 1 0.00 0.23 0.02 1.00 1 0.00 0.23 0.00 0.01 1 0.01 0.12 0.02 1.00 1 0.01 0.12 0.23 1.00 Document Processing: XY-Trees ANNPR 2003, Florence 12-13 September 2003

CH3 CH CH2 CH3 CH2 CH3(CH(CH3,CH2(CH2(CH3)))) CH3 [-1,-1,-1,1]([-1,1,-1,-1]([-1,-1,-1,1],[-1,-1,1,-1]([-1,-1,1,-1]([-1,-1,-1,1])))) Predictive Toxicology, QSAR:Chemical Compounds as Graphs ANNPR 2003, Florence 12-13 September 2003

Ranking vs. Preference 1 Ranking 5 3 2 4 Preference ANNPR 2003, Florence 12-13 September 2003

Preference on Structured Data ANNPR 2003, Florence 12-13 September 2003

The Target Space Ranking and Preference Non-metric space Finite Ordered Metric space Unordered Classification Regression Classification, Regression and Ranking • Supervised learning task • f:X→Y • Classification • Y is a finite unordered set • Regression • Y is a metric space (reals) • Ranking and Preference • Y is a finite ordered set • Y is a non-metric space ANNPR 2003, Florence 12-13 September 2003

Conventional Learning Algorithms 1 2 Learning on Structured Data • Learning algorithms on discrete structures often derive from vector based methods • Both Kernel Machines and RNNs are suitable for learning on structured domains ANNPR 2003, Florence 12-13 September 2003

Kernels vs. RNNs • Kernel Machines • Very high-dimensional feature space • How to choose the kernel? • prior knowledge, fixed representation • Minimize a convex functional (SVM) • Recursive Neural Networks • Low-dimensional space • Task-driven: representation depends on the specific learning task • Learn an implicit encoding of relevant information • Problem of local minima ANNPR 2003, Florence 12-13 September 2003

A Φ B A A A A B C B C C B C B C C B C C A B C A B A B C A B A Kernel for Labeled Trees • Feature Space • Set of all tree fragments (subtrees) with the only constraint that a father can not be separated from his children • Φn(t) = # occurences of tree fragment n in t • Bag of “something” • A tree is represented by • Φ(t) = [Φ1(t),Φ2(t),Φ3(t), …] • K(t,s) = Φ(t)∙Φ(s) is computed efficiently by dynamic programming (Collins & Duffy, NIPS 2001) ANNPR 2003, Florence 12-13 September 2003

φw:X→Rn ow’:Rn→O A output space D C B A C Recursive Neural Networks • Composition of two adaptative functions • φtransition function • o output function • φ,o functions are implemented by feedforward NNs • Both RNN parameters and representation vectors are found by maximizing the likelihood of training data ANNPR 2003, Florence 12-13 September 2003

Recursive Neural Networks Network Unfolding Prediction Phase Error Correction Labeled Tree output network A C E D B B ANNPR 2003, Florence 12-13 September 2003

Preference Models • Kernel Preference Model • Binary classification of pairwise differences between instances • RNNs Preference Model • Probabilistic model to find the best alternative • Both models use an utility function to evaluate the importance of an element ANNPR 2003, Florence 12-13 September 2003

Utility Function Approach • Modelling of the importance of an object • Utility function U:X→R • x>z ↔ U(x)>U(z) • If U is linear • U(x)>U(z) ↔ wTx>wTz • U can be also model by a neural network • Ranking and preference problems • Learn U and then sort by U(x) U(x)=11 U(z)=3 ANNPR 2003, Florence 12-13 September 2003

Kernel Preference Model • x1 = best of (x1,…,xr) • Create a set of pairs between x1 and x2,…,xr • Set of constraints if U is linear • U(x1)>U(xj) ↔ wTx1>wTxj↔ wT(x1-xj)>0 for j=2,…,r • x1-xj can be seen as a positive example • Binary classification of differences between instances • x →Φ(x): the process can be easily kernelized • Note: this model does not take into consideration all the alternatives together, but only two by two ANNPR 2003, Florence 12-13 September 2003

RNNs Preference Model • Set of alternatives (x1,x2,…,xr) • U modelled by a recursive neural network architecture • Compute U(xi) = o(φ(xi)) for i=1,…,r • Softmax function • The error (yi - oi) is backpropagated through whole network • Note: the softmax function compares all the alternatives together at once ANNPR 2003, Florence 12-13 September 2003

Learning Problems • First Pass Attachment • Modeling of a psycolinguistic phenomenon • Reranking Task • Reranking the parse trees output by a statistical parser ANNPR 2003, Florence 12-13 September 2003

S S 3 VP VP NP NP 2 NP NP NP NP ADJP 1 NP NP PP SBAR PP NP PRP PRP NP ADVP ADVP IN PRN QP PRP PRP VBZ VBZ DT DT NN NN on IN NONE IN IN It has no bearing It has no bearing 1 2 3 4 First Pass Attachment (FPA) 4 • The grammar introduces some ambiguities • A set of alternatives for each word but only one is correct • The first pass attachment can be modelled as a preference problem ANNPR 2003, Florence 12-13 September 2003

# correct trees ranked in first position • Evaluation Measure = total number of sets Heuristics forPrediction Enhancement • Specializing the FPA prediction for each class of word • Group the words in 10 classes (verbs, articles, …) • Learn a different classifier for each class of words • Removing nodes from the parse tree that aren’t important for choosing between different alternatives • Tree reduction ANNPR 2003, Florence 12-13 September 2003

Experimental Setup • Wall Street Journal (WSJ) Section of Penn TreeBank • Realistic Corpus of Natural Language • 40,000 sentences, 1 million words • Average sentence length: 25 words • Standard Benchmark in Computational Linguistics • Training on sections 2-21, test on section 23 and validation on section 24 ANNPR 2003, Florence 12-13 September 2003

Voted Perceptron (VP) • FPA + WSJ = 100 million trees for training • Voted Perceptron instead of SVM (Freund & Schapire, Machine Learning 1999) • Online algorithm for binary classification of instances based on perceptron algorithm (simple and efficient) • Prediction value: weighted sum of all training weight vectors • Performance comparable to maximal-margin classifiers (SVM) ANNPR 2003, Florence 12-13 September 2003

Kernel VP vs. RNNs ANNPR 2003, Florence 12-13 September 2003

Kernel VP vs. RNNsModularization ANNPR 2003, Florence 12-13 September 2003

Small Datasets No Modularization ANNPR 2003, Florence 12-13 September 2003

Complexity Comparison • VP does not scale linearly with the number of training examples as the RNNs do • Computational cost • Small datasets • 5 splits of 100 sentences ~ a week @ 2GHz CPU • CPU(VP) ~ CPU(RNN) • Large datasets (all 40,000 sentences) • VP took over 2 months to complete an epoch @ 2GHz CPU • RNN learns in 1-2 epochs ~ 3 days @ 2GHz CPU • VP is smooth in respect to training iterations ANNPR 2003, Florence 12-13 September 2003

Reranking Task Statistical Parser • Reranking problem: rerank the parse trees generated by a statistical parser • Same problem setting of FPA (preference on forests) • 1 forest/sentence vs. 1 forest/word (less computational cost involved) ANNPR 2003, Florence 12-13 September 2003

Evaluation: Parseval Measures • Standard evaluation measure • Labeled Precision (LP) • Labeled Recall (LR) • Crossing Brackets (CBs) • Compare a parse from a parser with an hand parsing of a sentence ANNPR 2003, Florence 12-13 September 2003

Reranking Task ANNPR 2003, Florence 12-13 September 2003

Why RNNs outperform Kernel VP? • Hypothesis 1 • Kernel Function: feature space not focused on the specific learning task • Hypothesis 2 • Kernel Preference Model worst than RNNs preference model ANNPR 2003, Florence 12-13 September 2003

Linear VP on RNN Representation • Checking hypothesis 1 • Train VP on RNN representation • The tree kernel replaced by a linear kernel • State vector representation of parse trees generated by RNN as input to VP • Linear VP is trained on RNN state vectors ANNPR 2003, Florence 12-13 September 2003

Linear VP on RNN Representation ANNPR 2003, Florence 12-13 September 2003

Conclusions • RNNs show better generalization properties… • … also on small datasets • … at smaller computational cost • The problem is… • … neither the kernel function • … nor the VP algorithm • Reasons: linear VP on RNN representation experiment • The problem is… • … the preference model! • Reasons: kernel preference model does not take into consideration all the alternatives together, but only two by two as opposed to RNN ANNPR 2003, Florence 12-13 September 2003

Acknowledgements Thanks to: Alessio Ceroni Alessandro Vullo Andrea Passerini Giovanni Soda ANNPR 2003, Florence 12-13 September 2003

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Presentation Transcript

Department of Technology Systems College of Technology and Computer Science

Department of Computer Science

Department of Computer Science

College Of Computer Science and Information, Information Systems Department

Department of Computer Science

Department of Computer and Information Science

Department of Computer Science

Department of computer science

Department of Computer Science

Department of Computer Science and Computer Engineering

Department of Computer Science

Department of Computer Science

Fabrizio Costa University of Florence

Department of Computer Science and Engineering

DEPARTMENT OF COMPUTER SCIENCE

Department of Computer Science

College Of Computer Science and Information, Information Systems Department

Department of Computer Science

Department of Computer Science

Department of Computer Science

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

Department of Computer Science and Engineering