Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa

Distributed TreeKernels and Distributional Semantics:BetweenSyntacticStructures and CompositionalDistributional Semantics Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa University of Rome ”Tor Vergata”

Prequel

P1: T1  H1 T1 “Farmersfeedcowsanimalextracts” H1 “Cowseatanimalextracts” RecognizingTextualEntailment (RTE) The task (Dagan et al., 2005)Given a text T and an hypothesis H, decide if T implies H • RTE as a classification task • Selecting the best learningalgorithm • Defining the featurespace

P1: T1  H1 T1 “Farmersfeedcowsanimalextracts” H1 “Cowseatanimalextracts” RecognizingTextualEntailment (RTE)

P3: T3 H3 P1: T1  H1 P2: T2  H2 T1 T2 T3 “They feed dolphins fish” “Mothersfeedbabies milk” “Farmersfeedcowsanimalextracts” H1 H2 H3 “Cowseatanimalextracts” “Fisheatdolphins” “Babieseat milk” Learning RTE Classifiers: the featurespace Training examples Rules with Variables(First-orderrules) RelevantFeatures feed eat feed eat feed eat X X X Y Y Y X X Y Y Y X Classification

Learning RTE Classifiers: the featurespace VP S  NP VP VB NP NP X X Y RTE 2 Results VB NP Rules with Variables(First-orderrules) Y feed feed eat X Y X Y eat Zanzotto&Moschitti, Automaticlearning of textualentailments with cross-pairsimilarities, Coling-ACL, 2006

AddingsemanticsDistributional Semantics S S VP NP  X VP NP Distributional Semantics VB VB NP X killed died S S VP NP  X VP NP Promising!!! VB VB NP X murdered died Mehdad, Moschitti, Zanzotto, Syntactic/SemanticStructures for TextualEntailmentRecognition, Proceedings of NAACL, 2010

CompositionalDistributional Semantics (CDS) Mitchell&Lapata (2008) set a general model for bigrams thatassigns a distributionalmeaning to a sequence of twowords“x y”: • Ris the relation betweenx and y • Kis an externalknowledge An activeresearch area!

CompositionalDistributional Semantics (CDS) A “distributional”semanticspace Composing “distributional” meaning movingcar car moving movinghands hands

CompositionalDistributional Semantics (CDS) Mitchell&Lapata (2008) set a general model for bigrams thatassigns a distributionalmeaning to a sequence of twowords“x y”: • Ris the relation betweenx and y • Kisanexternalknowledge x y z moving hands movinghands f

CDS: Full Additive Model The full additive model Matrices AR and BR can be estimated with: • positive examplestaken from dictionaries • multivariate regressionmodels contact /ˈkɒntækt/ [kon-takt] 2. close interaction Zanzotto, Korkontzelos, Fallucchi, Manandhar, EstimatingLinear Models for CompositionalDistributional Semantics, Proceedings of the 23rd COLING, 2010

CDS: Recursive Full Additive Model Let’s scale up to sentencesby recursivelyapplying the model! eat VN VN extracts cows NN Let’sapplyit to RTE animal f( Extremelypoorresults =f( = =

Recursive Full Additive Model: a closer look «cowseatanimalextracts» f … … evaluating the similarity f «chickenseatbeefextracts» Ferrone&Zanzotto,Linear Compositional Distributional Semantics and Structural Kernels, Proceedings of Joint Symposium of Semantic Processing, 2013

Recursive Full Additive Model: a closer look meaning meaning structure structure ? <1 structure meaning Ferrone&Zanzotto,Linear Compositional Distributional Semantics and Structural Kernels, Proceedings of Joint Symposium of Semantic Processing, 2013

The prequel … RecognizingTextualEntailment Distributional Semantics FeatureSpaces of the Rules with Variables Binary CDS Recursive CDS structure addingdistributionalsemantics meaning

structure Distributed TreeKernels

structure S S VP NP VB NP NP NNS VP NP S VB NP NP NNS VP NP feed Farmers VB NP NP NNS NN NNS NNS Farmers cows animal extracts TreeKernels T ti tj … … … … … … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

S S VP NP VB NP NP NNS VP NP S VB NP NP NNS VP NP feed Farmers VB NP NP NNS NN NNS NNS Farmers cows animal extracts TreeKernels in SmallerVectors T ti tj … … … CDS desiderata - Vectors are smaller - Vectors are obtained with a CompositionalFunction … … … … … … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

Names for the «Distributed» World Aswe are encodingtrees in small vectors, the traditionisdistributedstructures(Plate, 1994) Distributed TreeKernels(DTK) Distributed Trees (DT) Distributed TreeFragments (DTF) … … … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base of Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base thatembedsRmin Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

Compositionally building Distributed TreeFragments Basic elements N a set of nearly orthogonal random vectors for node labels a basic vector composition function with some ideal properties A distributedtreefragmentis the application of the compositionfunctionon the nodevectors, according to the ordergiven by a depth first visit of the tree. Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

Building Distributed Tree Fragments Properties of the Ideal function Approximation Non-commutativity with a very high degree k Non-associativity Bilinearity wedemonstratedDTF are a nearlyorthonormal base Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base thatembedsRmin Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

S S VP NP VB NP NP NNS VP NP S VB NP NP NNS VP NP feed Farmers VB NP NP NNS NN NNS NNS Farmers cows animal extracts Building Distributed Trees Given a tree T, the distributed representation of its subtrees is the vector: where S(T) is the set of the subtrees of T S( ) = { , } … Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

Building Distributed Trees A more efficientapproach N(T) is the set of nodes of T s(n) isdefinedas: if n is terminal if nc1…cm Computing a Distributed Treeis linear with respect to the size of N(T) Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

DTK: Expected properties and challenges • CompositionallybuildingDistributed TreeFragments • Distributed TreeFragments are a nearlyorthonormal base thatembedsRmin Rd • Distributed Treescan be efficientlycomputed • DTKsshuoldapproximateTreeKernels Property 1 (Nearly Unit Vectors) Property 2 (Nearly Orthogonal Vectors) Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

Task-based Analysis for x RecognizingTextualEntailment QuestionClassification • with these realizations of the ideal function : • Shuffled normalized element-wise product • Shuffled circular convolution Zanzotto&Dell'Arciprete, Distributed TreeKernels, Proceedings of ICML, 2012

Remarks Distributed TreeKernels(DTK) approximateTreeKernels Distributed Trees (DT) Distributed TreeFragments (DTF) can be efficientlycomputed are a nearlyorthonormal base thatembedsRmin Rd … … …

Side effect: reduced time complexity • Tree kernels (TK) (Collins & Duffy, 2001) have quadratic complexity • Current techniques controlthis complexity (Moschitti, 2006), (Rieck et al., 2010), (Shin et al.,2011) DTKschangethe complexityasthey can be used with Linear KernelMachines n: # of training examples |N(T)|: # of nodes of the tree T Zanzotto&Dell'Arciprete, Distributed ConvolutionKernels on Countable Sets, Journal of Machine Learning Research, Acceptedconditioned to minor revisions

Sequel • TowardsStructuredPrediction: • Distributed RepresentationParsing • Generalizing the theory: • Distributed ConvolutionKernels on Countable Sets • Adding back distributionalsemantics: • Distributed SmoothedTreeKernels

Sequel • TowardsStructuredPrediction : • Distributed RepresentationParsing • Generalizing the theory: • Distributed ConvolutionKernels on Countable Sets • Adding back distributionalsemantics: • Distributed SmoothedTreeKernels

Distributed RepresentationParsing (DRP): the idea Distributed Tree Encoder (DT) Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

Distributed RepresentationParsing (DRP): the idea Distributed Tree Encoder (DT) SymbolicParser (SP) Distributed RepresentationParsing(DRP) Transducer (P) Sentence Encoder (D) «Webooked the flight» Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

DRP: Sentence Encoder • Non-Lexicalized Sentence Models • Bag-of-postags • N-grams of postags • Lexicalized Sentence Models • Unigrams • Unigrams + N-grams of postags Distributed RepresentationParsing(DRP) Transducer (P) Sentence Encoder (D) «Webooked the flight» Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

DRP: Transducer Estimation: Principal Component Analysis and Partial Least Square Estimation T=PS Approximation: Moore-Penrose pseudoinverse to derive (Penrose, 1955) where k is the number of selected singular values Distributed RepresentationParsing(DRP) Transducer (P) Sentence Encoder (D) «Webooked the flight» Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

Experimental set-up • Data • English Penn Treebank with standard split • Distributed trees with 3 l (0, 0.2, 0.4) and 2 models Unlexicalized/Lexicalized • Dimension of the reduced space (4,096 and 8,192) • System Comparison • Distributed Symbolic Parser DSP(s) = DT(SP(s)) • Symbolic Parser: Bikel Parser (Bikel, 2004) with Collins Settings (Collins, 2003) • Parameter Estimation • Parameters • k for the pseudo-inverse • j for the sentence encoders D • Maximization of the similarity (see parsing performance) on Section 24 Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

«Distributed» Parsing Performance Evaluation Measure UnlexicalizedTrees LexicalizedTrees Zanzotto&Dell'Arciprete, Transducing Sentences to Syntactic Feature Vectors: an Alternative Way to "Parse"?, Proceedings of the ACL-Workshop on CVSC, 2013

Distributed ConvolutionKernelson Contable Sets The following general propertyholds: where • CK is a convolutionkernel • DCK is the relateddistributedconvolutionkernel Implemented Distributed ConvolutionKernels • Distributed TreeKernel • Distributed SubpathKernel • Distributed RouteKernel • Distributed StringKernel • Distributed PartialTreeKernel Zanzotto&Dell'Arciprete, Distributed ConvolutionKernels on Countable Sets, Journal of Machine Learning Research, Acceptedconditioned to minor revisions

Going back to RTE and distributionalsemantics S S VP NP  X VP NP Distributional Semantics VB VB NP X killed died S S VP NP  X VP NP Promising!!! VB VB NP X murdered died Mehdad, Moschitti, Zanzotto, Syntactic/SemanticStructures for TextualEntailmentRecognition, Proceedings of NAACL, 2010

A Novel Look at the Recursive Full Additive Model meaning meaning structure structure ? <1 structure meaning Ferrone&Zanzotto,Linear Compositional Distributional Semantics and Structural Kernels, Proceedings of Joint Symposium of Semantic Processing, 2013

A Novel Look atthe Recursive Full Additive Model Choosing: if Structi=Structj if Structi≠Structj Zanzotto, Ferrone, Baroni,When the whole is not greater than the sum of its parts: A decompositionallook at compositional distributional semantics, re-submitted

«ConvolutionConjecture» The similarity equations between two vectors/tensors obtained with CDSMs can be decomposed into operations performed on the subparts of the input phrases. Compositional Distributional Models based on linear algebra and Convolution Kernels are intimately related For example: Convolution Kernel Recursive Full Additive Model Zanzotto, Ferrone, Baroni,When the whole is not greater than the sum of its parts: A decompositionallook at compositional distributional semantics, re-submitted

Distributed SmoothedTreeKernels :killed S :killed S S VP NP synt( ) = VP NP VP NP VB NP VB NP VB NP killed :killed S head( ) = killed VP NP :murdered S VB NP VP NP VB NP murdered Ferrone, Zanzotto, TowardsSyntax-awareCompositionalDistributionalSemanticModels, Proceedings of CoLing, 2014

Distributed SmoothedTreeKernels In general, for a lexicalizedtree: wedefine Ferrone, Zanzotto, TowardsSyntax-awareCompositionalDistributionalSemanticModels, Proceedings of CoLing, 2014

Distributed SmoothedTreeKernels Distributed SmoothedTree The resulting dot (frobenius) product Ferrone, Zanzotto, TowardsSyntax-awareCompositionalDistributionalSemanticModels, Proceedings of CoLing, 2014

What’snext...

Distributional Semantics RecognizingTextualEntailment FeatureSpaces of the Rules with Variables Binary CDS Prequel TreeKernels Recursive CDS addingdistributionalsemantics structure meaning Distributed TreeKernels Distributed ConvolutionKernels on Countable Sets Distributed Representation Parsing Sequel

Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa

Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa

Presentation Transcript

talk-ppt - PowerPoint Presentation

Fabio Massimo Zanzotto ART Group Dipartimento di Ingegneria dell’Impresa