350 likes | 363 Views
This symposium explores the relationship between semantic features and distributional dimensions in linguistics and computational linguistics/AI, addressing theoretical and methodological challenges. It discusses the characteristics of semantic features, their applications in lexical semantics, event structures, and inferences, comparing distributional dimensions with semantic features, and analyzing distributional representation for semantic inference.
E N D
Are Distributional Dimensions Semantic Features? Katrin Erk University of Texas at Austin Meaning in Context Symposium München September 2015 Joint work with Gemma Boleda
Semantic features by example: Katz & Fodor Different meanings of a word characterized by lists of semantic features
Semantic features • In linguistics: Katz&Fodor, Wierzbicka, Jackendoff, Bierwisch, Pustejovsky, Asher, … • In computational linguistics/AI: Schank, Wilks, Masterman, Sowa… Schank, Conceptual Dependencies “drink” in preference semantics (Wilks): ((*ANI SUBJ) (((FLOW STUFF) OBJE) (MOVE CAUSE))
Semantic features: Characteristics • Primitive (not themselves defined), unanalyzable • Small set • Lexicalized in all languages • Combined, they characterize semantics of all lexical expressions in all languages • Precise, fixed meaning, which is not part of language. • Wilks: not so • Individually enable inferences • Feature lists or complex graphs Compiled from:Wierzbicka, Geeraerts, Schank
Uses of semantic features • Event structure in the lexical semantics of verbs (Levin): • change-of-state verbs:[ [ x ACT] CAUSE [BECOME [y <result-state>]] • Handle polysemy (Pustejovsky, Asher) • Characterize selectional constraints (e.g. in VerbNet) • Characterize synonyms, also cross-linguistically (application: translation) • Enable inferences:John is a bachelor John is unmarried, John is a man
Are distributional dimensions semantic features? believe-v 0.794065 american-a 2.245667 kill-v 1.946722 consider-v 0.047781 seem-v 0.410991 turn-v 0.919250 side-n 0.098926 serve-v 0.479459 involve-v 0.435661 report-v 0.483651 little-a 1.175299 big-a 1.468021 water-n 1.806485 attack-n 1.795050 much-a 0.011354 …. Alligator: Computed from UKWaC+Wikipedia + BNC + Gigaword, 2 word window, PPMI transform
Are distributional dimensions semantic features? • [The] differences between vector space encoding and more familiar accounts of meaning is easy to exaggerate. For example, a vector space encoding is entirely compatible with the traditional doctrine that concepts are ‘bundles’ of semantic features. Indeed, the latter is a special case of the former, the difference being that […] semantic dimensions are allowed to be continuous. (Fodor and Lepore1999: All at Sea in Semantic Space) (About connectionism and particularly Churchland, not distributional models)
Are distributional dimensions semantic features? • If so, they either address or inheritmethodological problems: • Coverage of a realistic vocabulary • Empirically determining semantic features • Meaning creep: Predicates used in CyC did not stay stable in their meaning over the years (Wilks 2008)
Are distributional dimensions semantic features? • If so, they inherit theoretical problems • Lewis 1970: “Markerese” • Fodor et al 1980, Against Definitions; Fodor and Lepore 1999, All at Sea in Semantic Space • Asymmetry between words and primitives: • What makes the primitives more basic? • Also, how can people communicate if their semantic spaces differ?
Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference
Semantic features: Characteristics • Primitive (not themselves defined), unanalyzable • Small set • Lexicalized in all languages • Combined, they characterize semantics of all lexical expressions in all languages • Precise, fixed meaning, not part of language. • Individually enable inferences • Feature lists or complex graphs
Neither primitive nor with a fixed meaning • Not unanalyzable: Any distributional feature can in principle be a distributional target • Compare: Target and dimensions as a graph (with similarity determined on the basis of random walks): d1 dd1 target d2 d3
Neither primitive nor with a fixed meaning • But are they treated as unanalyzed in practice? • Features in vector usually not analyzed further • SVD, topic modeling, prediction-based models: • induce latent features • exploiting distributional properties of features • Are latent features unanalyzable? No, linked to original dimensions • No fixed meaning, distributional features can be ambiguous
Then is it“Markerese”? • Inference = deriving something non-distributional from distributional representations • Inference from relation to other words • “X cause Y”, “Y trigger X” occur with similar X, Y, hence they are probably close in meaning • “alligator” appears in a subset of the contexts of “animal”, hence they are probably animals • Inference from co-occurrence with extralinguistic information • Distributional vectors linked to images for the same target • Alligators are similar to crocodiles, crocodiles are listed in the ontology as animals, hence alligators are probably animals
No individual inferences • Distributional representation as a whole, in the aggregate, allows for inferences using aggregate techniques: • Distributional similarity • Distributional inclusion • Whole-vector mappings to visual vectors
No individual inferences • Feature-based inference possible with “John Doe” features: • Take text representation • Take apart into features that are individually almost meaningless • Aggregate of such features allows for inferences
Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference
Redefining the dichotomy • Not semantic features versus distributional dimensions:Individual features versus aggregate features • Individual features: • Individually allow for inferences • May be relevant to grammar • Are introspectively salient • Not necessarily primitive • Also hypernyms and synonyms • Aggregate features • May be individually almost meaningless • Allow for aggregate inference • Two modes of inference: individual and aggregate
Individual features in distributional representations • Some distributional dimensions can be cognitively relevant features • Thill et al 2014: Because distributional models focus on how words are frequently used, they point to how humans experience concepts • Freedom: (features from Baroni&Lenci 2010) • positive events: guarantee, secure, grant, defend, respect • negative events: undermine, deny, infringe on, violate
Individual features in distributional representations • Approaches that find cognitively plausible features distributionally: • Almuhareb & Poesio 2004 • Cimiano & Wenderoth 2007 • Schulte imWalde et al 2008: German association norms • Baroni et al 2010: STRUDEL • Baroni & Lenci 2010: Distributional memory • Devereux et al 2010: dependency paths extracted from Wikipedia
Individual features in distributional representations • Difficult: only small fraction of human-elicited features can be retrieved • Baroni et al 2010: Distributional features tend to be different from human-elicited features • preference for “‘actional’ and ‘situated’ descriptions” • motorcycle: • elicited: wheels, dangerous, engine, fast • distributional: ride, sidecar, park, road
Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference
Not a competition • Use both kinds of features! • Computational perspective: • Distributional features are great • learned automatically • enable many inferences • Human-defined semantic features are great • less noisy • enable inferences with more certainty • enable inferences that distributional models do not provide • How can we integrate the two?
Speculation: Learning both individual and aggregate features • Learner makes use of features from textual environment • Some features almost meaningless, others more meaningful • Some of them relevant to grammar (CAUSE, BECOME) • Both meaningful and near-meaningless features enter aggregate inference • Only certain features allow individual inference • (Unclear: This should not be feature lists, there is structure! But where does that fit in this picture?)
Outline • Differences between distributional dimensions and semantic features • Redefining the dichotomy • No dichotomy after all • Integrated inference
Inferring individual features from aggregates • Johns and Jones 2012: • Compute weight of feature bird for nightingaleassummed similarity of nightingale to known birds • Fagarasan/Vecchi/Clark 2015: • Learn a mapping from distributional vectors to vectors of individual features • Herbelot/Vecchi 2015: • Learn a mapping from distributional space to “set-theoretic space”, vectors of quantified individual features (ALL apes are muscular, SOME apes live on coasts)
Inferring individual features from aggregates • Gupta et al 2015: • Regression to learn properties of unknown cities/countries from those of known cities/countries • Snow/Jurafsky/Ng 2006: • Infer location of a word in the WordNet hierarchy using a distributional co-hyponymy classifier
Individual features influencing aggregate representations • Andrews/Vigliocco/Vinson 2009, Roller/Schulte imWalde 2013: Topic modeling, including known individual features of words in the text • Faruqui et al 2015: Update vector representation to better match known synonymy, hypernymy, hyponymy information
Individual features influencing aggregate representations • Boyd-Graber/Blei/Zhu 2006: • WordNet hierarchy as part of a topic model. • Generate a word: choose topic, then walk down WN hierarchy based on the topic • aim: best WN sense for each word in context • Riedel et al 2013, Rocktäschel et al 2015: Universal Schema • Relation characterized by vector of Named Entity pairs (entity pairs that fill the relation) • Both human-defined and corpus-extracted relations • Matric factorization over union of human-defined and corpus-extracted relations • Predict whether a relation holds of an entity pair
Conclusion • Distributional features are not semantic features: • Not primitive • Inference from relations between word representations, co-occurrence with extra-linguistic information • Not (necessarily) individually meaningful • Inference from the aggregate of features • Two modes of inference: individual and aggregate • Use both individual and aggregate features • How to integrate the two, and infer one from the other?
References • Almuhareb, A., & Poesio, M. (2004). Attribute-based and value-based clustering: an evaluation (pp. 1–8). Presented at the EMNLP. • Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463–498. • Asher, N. (2011) Lexical meaning in context: a web of words. Cambridge University Press. • Baroni, M., Murphy, B., Barbu, E., & Poesio, M. (2010). Strudel: A Corpus-Based Semantic Model Based on Properties and Types. Cognitive Science, 34(2), 222–254 • Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based semantics. Computational Linguistics, 36(4), 673–721. • Bierwisch, M. (1969) On certain problems of semantic representation. Foundations of Language 5: 153–84. • Boyd-Graber, J., Blei, D. M., & Zhu, X. (2007). A Topic Model for Word Sense Disambiguation. Presented at the EMNLP.
References • Cimiano, Philipp and Johanna Wenderoth. 2007. Automatic acquisition of ranked qualia structures from the Web. In Proceedings of ACL, pages 888–895, Prague. • Devereux, B., Pilkington, N., Poibeau, T., & Korhonen, A. (2010). Towards Unrestricted, Large-Scale Acquisition of Feature-Based Conceptual Representations from Corpus Data. Research on Language and Computation, 7(2-4), 137–170. • Fagarasan, L., E. Vecchi, S. Clark (2015). From distributional semantics to feature norms: grounding semantic models in human perceptual data. Proceedings of IWCS. • Faruqui, M., Dodge, J., Jauhar, S., Dyer, C., Hovy, E., & Smith, N. (2015). Retrofitting Word Vectors to Semantic Lexicons. Presented at the NAACL. • Fodor, J., Garrett, M. F., Walker, E. C. T., & Parkes, C. H. (1980). Against definitions. Cognition, 8(3), 263–367. • Fodor, J., & Lepore, E. (1999). All at sea in semantic space: Churchland on meaning similarity. The Journal of Philosophy, 96(8), 381–403. • Geeraerts, D. (2009) Theories of Lexical Semantics. Oxford University Press.
References • Gupta, A., Boleda, G., Baroni, M., & Pado, S. (2015). Distributional vectors encode referential attributes. Proceedings of EMNLP. • Herbelot, A., & Vecchi, E. M. (2015). Building a shared world:Mapping distributional to model-theoretic semantic spaces. Proceedings of EMNLP. • Jackendoff, R. (1990) Semantic Structures. MIT Press. • Johns, B. T., & Jones, M. N. (2012). Perceptual Inference Through Global Lexical Similarity. Topics in Cognitive Science, 4(1), 103–120 • Katz, J. J., & Fodor, J. A. (1963). The structure of a semantic theory. Language, 39(2), 170. • Lewis, D. (1970). General semantics. Synthese, 22(1):18– 67. • Pustejovsky, J. (1991) The Generative Lexicon. Computational Linguistics 17(4).
References • RapaportHovav, M., and B. Levin (2001). An event structure account of English resultatives. Language 77(4). • Riedel, S., Yao, L., McCallum, A., & Marlin, B. (2013). Relation Extraction with Matrix Factorization and Universal Schemas. Presented at the NAACL. • Rocktäschel, T., Singh, S., & Riedel, S. (2015). Injecting Logical Background Knowledge into Embeddings for Relation Extraction. Presented at the NAACL. • Roller, S., & Schulte imWalde, S. (2013). A Multimodal LDA Model integrating Textual, Cognitive and Visual Modalities. Presented at the EMNLP. • Schank, R. (1969). A conceptual dependency parser for natural language. Proceedings of COLING 1969 • Schulte imWalde, S., A. Melinger, M. Roth, A. Weber (2008). An Empirical Characterisation of Response Types in German Association Norms. Research on Language and Computation 6(2):205-238, 2008.
References • Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence (pp. 801–808). Presented at the ACL-COLING. • Sowa, J. (1992). Logical Structures in the Lexicon. In J. Pustejovsky & S. Bergler (Eds.), Lexical semantics and knowledge representation (LNCS, Vol. 627, pp. 39–60). • Thill, S., Pado, S., & Ziemke, T. (2014). On the Importance of a Rich Embodiment in the Grounding of Concepts: Perspectives From Embodied Cognitive Science and Computational Linguistics. Topics in Cognitive Science, 6(3), 545–558. • Wierzbicka, A. (1996) Semantics. Primes and Universals. Oxford University Press. • Wilks, Y. (2008). What would a Wittgensteinian computational linguistics be like? Presented at the AISB workshop on computers and philosophy, Aberdeen.