1 / 70

Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa

Distributed Tree Kernels and Distributional Semantics : Between Syntactic Structures and Compositional Distributional Semantics. Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa University of Rome ”Tor Vergata”. Prequel. P 1 : T 1  H 1. T 1.

menefer
Download Presentation

Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed TreeKernels and Distributional Semantics:BetweenSyntacticStructures and CompositionalDistributional Semantics Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa University of Rome ”Tor Vergata”

  2. Prequel

  3. P1: T1  H1 T1 “Farmersfeedcowsanimalextracts” H1 “Cowseatanimalextracts” RecognizingTextualEntailment (RTE) The task (Dagan et al., 2005)Given a text T and an hypothesis H, decide if T implies H • RTE as a classification task • Selecting the best learningalgorithm • Defining the featurespace

  4. P1: T1  H1 T1 “Farmersfeedcowsanimalextracts” H1 “Cowseatanimalextracts” RecognizingTextualEntailment (RTE)

  5. P3: T3 H3 P1: T1  H1 P2: T2  H2 T1 T2 T3 “They feed dolphins fish” “Mothersfeedbabies milk” “Farmersfeedcowsanimalextracts” H1 H2 H3 “Cowseatanimalextracts” “Fisheatdolphins” “Babieseat milk” Learning RTE Classifiers: the featurespace Training examples Rules with Variables(First-orderrules) RelevantFeatures feed eat feed eat feed eat X X X Y Y Y X X Y Y Y X Classification

  6. Learning RTE Classifiers: the featurespace VP S  NP VP VB NP NP X X Y RTE 2 Results VB NP Rules with Variables(First-orderrules) Y feed feed eat X Y X Y eat Zanzotto&Moschitti, Automaticlearning of textualentailments with cross-pairsimilarities, Coling-ACL, 2006

  7. AddingsemanticsDistributional Semantics S S VP NP  X VP NP Distributional Semantics VB VB NP X killed died S S VP NP  X VP NP Promising!!! VB VB NP X murdered died Mehdad, Moschitti, Zanzotto, Syntactic/SemanticStructures for TextualEntailmentRecognition, Proceedings of NAACL, 2010

  8. CompositionalDistributional Semantics (CDS) Mitchell&Lapata (2008) set a general model for bigrams thatassigns a distributionalmeaning to a sequence of twowords“x y”: • Ris the relation betweenx and y • Kis an externalknowledge An activeresearch area!

  9. CompositionalDistributional Semantics (CDS) A “distributional”semanticspace Composing “distributional” meaning movingcar car moving movinghands hands

  10. CompositionalDistributional Semantics (CDS) Mitchell&Lapata (2008) set a general model for bigrams thatassigns a distributionalmeaning to a sequence of twowords“x y”: • Ris the relation betweenx and y • Kisanexternalknowledge x y z moving hands movinghands f

  11. CDS: Full Additive Model The full additive model Matrices AR and BR can be estimated with: • positive examplestaken from dictionaries • multivariate regressionmodels contact   /ˈkɒntækt/ [kon-takt] 2. close interaction Zanzotto, Korkontzelos, Fallucchi, Manandhar, EstimatingLinear Models for CompositionalDistributional Semantics, Proceedings of the 23rd COLING, 2010

  12. CDS: Recursive Full Additive Model Let’s scale up to sentencesby recursivelyapplying the model! eat VN VN extracts cows NN Let’sapplyit to RTE animal f( Extremelypoorresults =f( = =

  13. Recursive Full Additive Model: a closer look «cowseatanimalextracts» f … … evaluating the similarity f «chickenseatbeefextracts» Ferrone&Zanzotto,Linear Compositional Distributional Semantics and Structural Kernels, Proceedings of Joint Symposium of Semantic Processing, 2013

  14. Recursive Full Additive Model: a closer look meaning meaning structure structure ? <1 structure meaning Ferrone&Zanzotto,Linear Compositional Distributional Semantics and Structural Kernels, Proceedings of Joint Symposium of Semantic Processing, 2013

  15. The prequel … RecognizingTextualEntailment Distributional Semantics FeatureSpaces of the Rules with Variables Binary CDS Recursive CDS structure addingdistributionalsemantics meaning

  16. structure Distributed TreeKernels

  17. structure S S VP NP VB NP NP NNS VP NP S VB NP NP NNS VP NP feed Farmers VB NP NP NNS NN NNS NNS Farmers cows animal extracts TreeKernels T ti tj … … … … … … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  18. S S VP NP VB NP NP NNS VP NP S VB NP NP NNS VP NP feed Farmers VB NP NP NNS NN NNS NNS Farmers cows animal extracts TreeKernels in SmallerVectors T ti tj … … … CDS desiderata - Vectors are smaller - Vectors are obtained with a CompositionalFunction … … … … … … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  19. Names for the «Distributed» World Aswe are encodingtrees in small vectors, the traditionisdistributedstructures(Plate, 1994) Distributed TreeKernels(DTK) Distributed Trees (DT) Distributed TreeFragments (DTF) … … … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  20. DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base of Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  21. DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base thatembedsRmin Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  22. Compositionally building Distributed TreeFragments Basic elements N a set of nearly orthogonal random vectors for node labels a basic vector composition function with some ideal properties A distributedtreefragmentis the application of the compositionfunctionon the nodevectors, according to the ordergiven by a depth first visit of the tree. Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  23. Building Distributed Tree Fragments Properties of the Ideal function Approximation Non-commutativity with a very high degree k Non-associativity Bilinearity wedemonstratedDTF are a nearlyorthonormal base Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  24. DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base thatembedsRmin Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  25. S S VP NP VB NP NP NNS VP NP S VB NP NP NNS VP NP feed Farmers VB NP NP NNS NN NNS NNS Farmers cows animal extracts Building Distributed Trees Given a tree T, the distributed representation of its subtrees is the vector: where S(T) is the set of the subtrees of T S( ) = { , } … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  26. Building Distributed Trees A more efficientapproach N(T) is the set of nodes of T s(n) isdefinedas: if n is terminal if nc1…cm Computing a Distributed Treeis linear with respect to the size of N(T) Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  27. DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base thatembedsRmin Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Property 1 (Nearly Unit Vectors) Property 2 (Nearly Orthogonal Vectors) Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  28. Task-based Analysis for x RecognizingTextualEntailment QuestionClassification • with these realizations of the ideal function : • Shuffled normalized element-wise product • Shuffled circular convolution Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

  29. Remarks Distributed TreeKernels(DTK) approximateTreeKernels Distributed Trees (DT) Distributed TreeFragments (DTF) can be efficientlycomputed are a nearlyorthonormal base thatembedsRmin Rd … … …

  30. Side effect: reduced time complexity • Tree kernels (TK) (Collins & Duffy, 2001) have quadratic complexity • Current techniques controlthis complexity (Moschitti, 2006), (Rieck et al., 2010), (Shin et al.,2011) DTKschangethe complexityasthey can be used with Linear KernelMachines n: # of training examples |N(T)|: # of nodes of the tree T Zanzotto&Dell'Arciprete, Distributed ConvolutionKernels on Countable Sets, Journal of Machine Learning Research, Acceptedconditioned to minor revisions

  31. Sequel • TowardsStructuredPrediction: • Distributed RepresentationParsing • Generalizing the theory: • Distributed ConvolutionKernels on Countable Sets • Adding back distributionalsemantics: • Distributed SmoothedTreeKernels

  32. Sequel • TowardsStructuredPrediction : • Distributed RepresentationParsing • Generalizing the theory: • Distributed ConvolutionKernels on Countable Sets • Adding back distributionalsemantics: • Distributed SmoothedTreeKernels

  33. Distributed RepresentationParsing (DRP): the idea Distributed Tree Encoder (DT) Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

  34. Distributed RepresentationParsing (DRP): the idea Distributed Tree Encoder (DT) SymbolicParser (SP) Distributed RepresentationParsing(DRP) Transducer (P) Sentence Encoder (D) «Webooked the flight» Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

  35. DRP: Sentence Encoder • Non-Lexicalized Sentence Models • Bag-of-postags • N-grams of postags • Lexicalized Sentence Models • Unigrams • Unigrams + N-grams of postags Distributed RepresentationParsing(DRP) Transducer (P) Sentence Encoder (D) «Webooked the flight» Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

  36. DRP: Transducer Estimation: Principal Component Analysis and Partial Least Square Estimation T=PS Approximation: Moore-Penrose pseudoinverse to derive (Penrose, 1955) where k is the number of selected singular values Distributed RepresentationParsing(DRP) Transducer (P) Sentence Encoder (D) «Webooked the flight» Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

  37. Experimental set-up • Data • English Penn Treebank with standard split • Distributed trees with 3 l (0, 0.2, 0.4) and 2 models Unlexicalized/Lexicalized • Dimension of the reduced space (4,096 and 8,192) • System Comparison • Distributed Symbolic Parser DSP(s) = DT(SP(s)) • Symbolic Parser: Bikel Parser (Bikel, 2004) with Collins Settings (Collins, 2003) • Parameter Estimation • Parameters • k for the pseudo-inverse • j for the sentence encoders D • Maximization of the similarity (see parsing performance) on Section 24 Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

  38. «Distributed» Parsing Performance Evaluation Measure UnlexicalizedTrees LexicalizedTrees Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

  39. Sequel • TowardsStructuredPrediction : • Distributed RepresentationParsing • Generalizing the theory: • Distributed ConvolutionKernels on Countable Sets • Adding back distributionalsemantics: • Distributed SmoothedTreeKernels

  40. Distributed ConvolutionKernelson Contable Sets The following general propertyholds: where • CK is a convolutionkernel • DCK is the relateddistributedconvolutionkernel Implemented Distributed ConvolutionKernels • Distributed TreeKernel • Distributed SubpathKernel • Distributed RouteKernel • Distributed StringKernel • Distributed PartialTreeKernel Zanzotto&Dell'Arciprete, Distributed ConvolutionKernels on Countable Sets, Journal of Machine Learning Research, Acceptedconditioned to minor revisions

  41. Sequel • TowardsStructuredPrediction : • Distributed RepresentationParsing • Generalizing the theory: • Distributed ConvolutionKernels on Countable Sets • Adding back distributionalsemantics: • Distributed SmoothedTreeKernels

  42. Going back to RTE and distributionalsemantics S S VP NP  X VP NP Distributional Semantics VB VB NP X killed died S S VP NP  X VP NP Promising!!! VB VB NP X murdered died Mehdad, Moschitti, Zanzotto, Syntactic/SemanticStructures for TextualEntailmentRecognition, Proceedings of NAACL, 2010

  43. A Novel Look at the Recursive Full Additive Model meaning meaning structure structure ? <1 structure meaning Ferrone&Zanzotto,Linear Compositional Distributional Semantics and Structural Kernels, Proceedings of Joint Symposium of Semantic Processing, 2013

  44. A Novel Look atthe Recursive Full Additive Model Choosing: if Structi=Structj if Structi≠Structj Zanzotto, Ferrone, Baroni,When the whole is not greater than the sum of its parts: A decompositionallook at compositional distributional semantics, re-submitted

  45. «ConvolutionConjecture» The similarity equations between two vectors/tensors obtained with CDSMs can be decomposed into operations performed on the subparts of the input phrases. Compositional Distributional Models based on linear algebra and Convolution Kernels are intimately related For example: Convolution Kernel Recursive Full Additive Model Zanzotto, Ferrone, Baroni,When the whole is not greater than the sum of its parts: A decompositionallook at compositional distributional semantics, re-submitted

  46. Distributed SmoothedTreeKernels :killed S :killed S S VP NP synt( ) = VP NP VP NP VB NP VB NP VB NP killed :killed S head( ) = killed VP NP :murdered S VB NP VP NP VB NP murdered Ferrone, Zanzotto, TowardsSyntax-awareCompositionalDistributionalSemanticModels, Proceedings of CoLing, 2014

  47. Distributed SmoothedTreeKernels In general, for a lexicalizedtree: wedefine Ferrone, Zanzotto, TowardsSyntax-awareCompositionalDistributionalSemanticModels, Proceedings of CoLing, 2014

  48. Distributed SmoothedTreeKernels Distributed SmoothedTree The resulting dot (frobenius) product Ferrone, Zanzotto, TowardsSyntax-awareCompositionalDistributionalSemanticModels, Proceedings of CoLing, 2014

  49. What’snext...

  50. Distributional Semantics RecognizingTextualEntailment FeatureSpaces of the Rules with Variables Binary CDS Prequel TreeKernels Recursive CDS addingdistributionalsemantics structure meaning Distributed TreeKernels Distributed ConvolutionKernels on Countable Sets Distributed Representation Parsing Sequel

More Related