1 / 36

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data. Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil

fathia
Download Presentation

Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Convolution Kernelsand Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil Department of Computer Science University College London, UK

  2. Structured Data • Many applications… • … is useful to represent the objects of the domain by structured data (trees, graphs, …) • … better capture important relationships between the sub-parts that compose an object ANNPR 2003, Florence 12-13 September 2003

  3. Natural Language: Parse Trees S VP NP ADVP NP PRP VBD RB NN NN . He was previous vice president . ANNPR 2003, Florence 12-13 September 2003

  4. Structural Genomics:Protein Contact Maps ANNPR 2003, Florence 12-13 September 2003

  5. 0 0.00 1.00 0.00 1.00 1 0.00 0.23 0.00 1.00 1 0.23 0.26 0.00 1.00 1 0.23 0.29 0.00 1.00 1 0.00 0.23 0.00 0.01 1 0.00 0.23 0.02 1.00 1 0.00 0.23 0.00 0.01 1 0.01 0.12 0.02 1.00 1 0.01 0.12 0.23 1.00 Document Processing: XY-Trees ANNPR 2003, Florence 12-13 September 2003

  6. CH3 CH CH2 CH3 CH2 CH3(CH(CH3,CH2(CH2(CH3)))) CH3 [-1,-1,-1,1]([-1,1,-1,-1]([-1,-1,-1,1],[-1,-1,1,-1]([-1,-1,1,-1]([-1,-1,-1,1])))) Predictive Toxicology, QSAR:Chemical Compounds as Graphs ANNPR 2003, Florence 12-13 September 2003

  7. Ranking vs. Preference 1 Ranking 5 3 2 4 Preference ANNPR 2003, Florence 12-13 September 2003

  8. Preference on Structured Data ANNPR 2003, Florence 12-13 September 2003

  9. The Target Space Ranking and Preference Non-metric space Finite Ordered Metric space Unordered Classification Regression Classification, Regression and Ranking • Supervised learning task • f:X→Y • Classification • Y is a finite unordered set • Regression • Y is a metric space (reals) • Ranking and Preference • Y is a finite ordered set • Y is a non-metric space ANNPR 2003, Florence 12-13 September 2003

  10. Conventional Learning Algorithms 1 2 Learning on Structured Data • Learning algorithms on discrete structures often derive from vector based methods • Both Kernel Machines and RNNs are suitable for learning on structured domains ANNPR 2003, Florence 12-13 September 2003

  11. Kernels vs. RNNs • Kernel Machines • Very high-dimensional feature space • How to choose the kernel? • prior knowledge, fixed representation • Minimize a convex functional (SVM) • Recursive Neural Networks • Low-dimensional space • Task-driven: representation depends on the specific learning task • Learn an implicit encoding of relevant information • Problem of local minima ANNPR 2003, Florence 12-13 September 2003

  12. A Φ B A A A A B C B C C B C B C C B C C A B C A B A B C A B A Kernel for Labeled Trees • Feature Space • Set of all tree fragments (subtrees) with the only constraint that a father can not be separated from his children • Φn(t) = # occurences of tree fragment n in t • Bag of “something” • A tree is represented by • Φ(t) = [Φ1(t),Φ2(t),Φ3(t), …] • K(t,s) = Φ(t)∙Φ(s) is computed efficiently by dynamic programming (Collins & Duffy, NIPS 2001) ANNPR 2003, Florence 12-13 September 2003

  13. φw:X→Rn ow’:Rn→O A output space D C B A C Recursive Neural Networks • Composition of two adaptative functions • φtransition function • o output function • φ,o functions are implemented by feedforward NNs • Both RNN parameters and representation vectors are found by maximizing the likelihood of training data ANNPR 2003, Florence 12-13 September 2003

  14. Recursive Neural Networks Network Unfolding Prediction Phase Error Correction Labeled Tree output network A C E D B B ANNPR 2003, Florence 12-13 September 2003

  15. Preference Models • Kernel Preference Model • Binary classification of pairwise differences between instances • RNNs Preference Model • Probabilistic model to find the best alternative • Both models use an utility function to evaluate the importance of an element ANNPR 2003, Florence 12-13 September 2003

  16. Utility Function Approach • Modelling of the importance of an object • Utility function U:X→R • x>z ↔ U(x)>U(z) • If U is linear • U(x)>U(z) ↔ wTx>wTz • U can be also model by a neural network • Ranking and preference problems • Learn U and then sort by U(x) U(x)=11 U(z)=3 ANNPR 2003, Florence 12-13 September 2003

  17. Kernel Preference Model • x1 = best of (x1,…,xr) • Create a set of pairs between x1 and x2,…,xr • Set of constraints if U is linear • U(x1)>U(xj) ↔ wTx1>wTxj↔ wT(x1-xj)>0 for j=2,…,r • x1-xj can be seen as a positive example • Binary classification of differences between instances • x →Φ(x): the process can be easily kernelized • Note: this model does not take into consideration all the alternatives together, but only two by two ANNPR 2003, Florence 12-13 September 2003

  18. RNNs Preference Model • Set of alternatives (x1,x2,…,xr) • U modelled by a recursive neural network architecture • Compute U(xi) = o(φ(xi)) for i=1,…,r • Softmax function • The error (yi - oi) is backpropagated through whole network • Note: the softmax function compares all the alternatives together at once ANNPR 2003, Florence 12-13 September 2003

  19. Learning Problems • First Pass Attachment • Modeling of a psycolinguistic phenomenon • Reranking Task • Reranking the parse trees output by a statistical parser ANNPR 2003, Florence 12-13 September 2003

  20. S S 3 VP VP NP NP 2 NP NP NP NP ADJP 1 NP NP PP SBAR PP NP PRP PRP NP ADVP ADVP IN PRN QP PRP PRP VBZ VBZ DT DT NN NN on IN NONE IN IN It has no bearing It has no bearing 1 2 3 4 First Pass Attachment (FPA) 4 • The grammar introduces some ambiguities • A set of alternatives for each word but only one is correct • The first pass attachment can be modelled as a preference problem ANNPR 2003, Florence 12-13 September 2003

  21. # correct trees ranked in first position • Evaluation Measure = total number of sets Heuristics forPrediction Enhancement • Specializing the FPA prediction for each class of word • Group the words in 10 classes (verbs, articles, …) • Learn a different classifier for each class of words • Removing nodes from the parse tree that aren’t important for choosing between different alternatives • Tree reduction ANNPR 2003, Florence 12-13 September 2003

  22. Experimental Setup • Wall Street Journal (WSJ) Section of Penn TreeBank • Realistic Corpus of Natural Language • 40,000 sentences, 1 million words • Average sentence length: 25 words • Standard Benchmark in Computational Linguistics • Training on sections 2-21, test on section 23 and validation on section 24 ANNPR 2003, Florence 12-13 September 2003

  23. Voted Perceptron (VP) • FPA + WSJ = 100 million trees for training • Voted Perceptron instead of SVM (Freund & Schapire, Machine Learning 1999) • Online algorithm for binary classification of instances based on perceptron algorithm (simple and efficient) • Prediction value: weighted sum of all training weight vectors • Performance comparable to maximal-margin classifiers (SVM) ANNPR 2003, Florence 12-13 September 2003

  24. Kernel VP vs. RNNs ANNPR 2003, Florence 12-13 September 2003

  25. Kernel VP vs. RNNs ANNPR 2003, Florence 12-13 September 2003

  26. Kernel VP vs. RNNsModularization ANNPR 2003, Florence 12-13 September 2003

  27. Small Datasets No Modularization ANNPR 2003, Florence 12-13 September 2003

  28. Complexity Comparison • VP does not scale linearly with the number of training examples as the RNNs do • Computational cost • Small datasets • 5 splits of 100 sentences ~ a week @ 2GHz CPU • CPU(VP) ~ CPU(RNN) • Large datasets (all 40,000 sentences) • VP took over 2 months to complete an epoch @ 2GHz CPU • RNN learns in 1-2 epochs ~ 3 days @ 2GHz CPU • VP is smooth in respect to training iterations ANNPR 2003, Florence 12-13 September 2003

  29. Reranking Task Statistical Parser • Reranking problem: rerank the parse trees generated by a statistical parser • Same problem setting of FPA (preference on forests) • 1 forest/sentence vs. 1 forest/word (less computational cost involved) ANNPR 2003, Florence 12-13 September 2003

  30. Evaluation: Parseval Measures • Standard evaluation measure • Labeled Precision (LP) • Labeled Recall (LR) • Crossing Brackets (CBs) • Compare a parse from a parser with an hand parsing of a sentence ANNPR 2003, Florence 12-13 September 2003

  31. Reranking Task ANNPR 2003, Florence 12-13 September 2003

  32. Why RNNs outperform Kernel VP? • Hypothesis 1 • Kernel Function: feature space not focused on the specific learning task • Hypothesis 2 • Kernel Preference Model worst than RNNs preference model ANNPR 2003, Florence 12-13 September 2003

  33. Linear VP on RNN Representation • Checking hypothesis 1 • Train VP on RNN representation • The tree kernel replaced by a linear kernel • State vector representation of parse trees generated by RNN as input to VP • Linear VP is trained on RNN state vectors ANNPR 2003, Florence 12-13 September 2003

  34. Linear VP on RNN Representation ANNPR 2003, Florence 12-13 September 2003

  35. Conclusions • RNNs show better generalization properties… • … also on small datasets • … at smaller computational cost • The problem is… • … neither the kernel function • … nor the VP algorithm • Reasons: linear VP on RNN representation experiment • The problem is… • … the preference model! • Reasons: kernel preference model does not take into consideration all the alternatives together, but only two by two as opposed to RNN ANNPR 2003, Florence 12-13 September 2003

  36. Acknowledgements Thanks to: Alessio Ceroni Alessandro Vullo Andrea Passerini Giovanni Soda ANNPR 2003, Florence 12-13 September 2003

More Related