900 likes | 1.02k Views
RANLP tutorial, September 2013, Hissar , Bulgaria. The Analytics of Word Sociology. Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva 8 September 2013. Keywords. computer science linguistics
E N D
RANLP tutorial, September 2013, Hissar, Bulgaria The Analytics of Word Sociology Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva 8 September 2013
Keywords • computer science • linguistics • computational linguistics • statistics • inferential statistics • syntactic parsing • dependency parsing • shallow parsing • chunking • POS-tagging • lemmatization • tokenisation • type vs. token • distribution • Zipflaw • hypothesis testing • statistical significance • null hypothesis • association measure • collocation extraction • mutual information • log-likelihood ratio • entropy • contingency table • co-occurrence • collocation • extraposition • long-distance dependency • n-gram • precision, recall, F-measure
Outline • Introduction • Terminology clarification • Theoretical description • Practical accounts • Behind the curtains: the maths and stats • Wrap up and outlook
Objectives • Understand the concept of collocation and its relevance for the fields of linguistics, lexicography and natural language processing. • Become aware of the definitorial and terminological issues, the description of collocations in terms of semantic compositionality, and the relation with other multi-word expressions. • Understand the basic architecture of a collocation extraction system. • Become familiar with the most influential work in the area of collocation extraction. • Get (more than) an overview of the underlying technology – in particular, the statistical computation details.
Social Analytics “Measuring + Analyzing+ Interpreting interactions and associations between people, topics and ideas.” (http://en.wikipedia.org/wiki/Social_analytics) http://www.submitedge.com http://irevolution.net
You shall know someone … … by the company they keep http://flowingdata.com
Word Sociology • Barnbrook (1996) Language and Computers, Chapt. 5 «The sociology of words»: • collocation analysis: «automatic quantitative analysis and identification of word patterns around words of interest» collocate word 2 collocate word 3 collocate word 1 collocate word 4 `node’ word collocate word n collocate word 5 …
You shall know a word… … by the companyitkeeps! (Firth, 1957) Seretan and Wehrli (2011): FipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora `node’ word = ? …
Collocation analysis: Key concepts • Node word: the word under investigation • Collocate: the “word patterns” around the node word • Association measure (AM): Evert (2004): “a formula that computes an association score from the frequency information […]” • Collocation extraction [from corpora]: the task of automatically identifying genuine associations of words in corpora
Relevance for Linguistics • Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-Meaning Theory, semantic prosody, … Words are “separated in meaning at the collocational level” (Firth, 1968, 180) Word collocation is one of the most important forms of textcohesion:is a passage of language "a unified whole or is just a collection of unrelated sentences"? (Halliday and Hassan, 1976, 1) Collocations are found at the intersection of lexicon and grammar "semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments” (Sinclair, 1991, 110); Collocations [“idioms of encoding”] are expressions “which are larger than words, which are like words in that they have to be learned separately as individual whole facts about the language" (Fillmore et al., 1988, 504) “We acquire collocations, as we acquire other aspects of language, through encountering texts in the course of our lives” (Hoey, 1991, 219).
Relevance for Linguistics (cont.) • Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-Meaning Theory, semantic prosody, … In the Meaning-Text Theory (e.g., Mel’čuk, 1998), collocations are described by means of lexical functions(associating meaning and the utterance expressing that meaning): Magn(problem) = big Magn(rain) = Magn(injury) = Collocations are often between words which share a positive or a negative connotation (semantic prosody – e.g., Louw, 1993). heavy serious FipsCoView
Relevance for Lexicography • Dictionaries of co-occurrences/collocations/cum-corpus “Collocation is the way words combine in a language to produce natural-sounding speech and writing” (Lea and Runcie, 2002) “Advanced learners of second language have great difficulty with nativelike collocation and idiomaticity. Many grammatical sentences generated by language learners sound unnatural and foreign.” (Ellis, 2008) Sinclair, 1987 Benson et al., 1986 OCDSE (Lea and Runcie, 2002)
Relevance for Lexicography (cont.) http://dictionary.reverso.net/english-cobuild
Relevance for Lexicography (cont.) • Dictionaries of co-occurrences/collocations/cum-corpus Charest et al., 2012 Beauchesne, 2001
Relevance for Natural Language Processing • Machine translation EN ask a question – FR poser `put’ une question – ES hacer `make’ unapregunta “collocations are the key to producing more acceptable output” (Orliac and Dillinger, 2003) • Natural language generation EN to brush one’s teeth –* to wash one’s teeth “In the generation literature, the generation of collocations is regarded as a problem” (Heid and Raab, 1989) “However, collocations are not only considered useful, but also a problemboth in certain applications (e.g. generation, […] machine translation […])” (Heylen et al., 1994)
Relevance for Natural Language Processing (cont.) • Syntactic parsing • Word sense disambiguation break: about 50 senses record: about 10 senses to break a world record: 1 sense verb-object collocation break – record “a polysemousword exhibits essentially only one sense per collocation” (Yarowsky, 2003) * vs.
Relevance for Natural Language Processing (cont.) • OCR distinguish between homographs: terse/tense, gum/gym, deaf/dear, cookie/rookie, beverage/leverage (Examples from Yarowski, 2003) • Speech recognition distinguish between homophones: aid/aide, cellar/seller, censor/sensor, cue/queue, pedal/petal (Examples from Yarowski, 2003) (Examples from Church and Hanks, 1990)
Relevance for Natural Language Processing (cont.) • Text summarisation collocations capture the gist of a document (the most typical and salient phrases): be city, have population, people live, county seat, known as, be capital city, large city, city population, close to, area of city, most important, city name, most famous, located on coast (Examples from Seretan, 2011) • Text classification collocations are words which are characteristic of a body of texts • Context-sensitive dictionary look-up Context: The point doesn’t bear any relation to the question we are discussing. Idea: Display the subentry bear – relation instead of the entry for bear (Example from Michiels, 1998)
Ethymology • cum ‘together’ • locare ‘to locate’ (from locus ‘place’) General meaning: collocated things (set side by side) Specific meaning: collocated words in a sentence Note: In French, two different forms exist: colocation ‘flatsharing’/collocation. http://www.collinsdictionary.com
One term – two acceptations • Broad acceptation: semantic collocation(doctor– hospital – nurse – …) “Collocation is the cooccurrence of two or more words within a short space of each other in a text. The usual measure of proximity is a maximum of four words intervening.” (Sinclair 1991:170) • Narrow acceptation: typicalsyntagm (“conventional way of saying”) “co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern” (Cowie 1978:132) Note: The currentliterature uses the termco-occurrence to refer to the first acceptation. The termcollocationisreservedexclusively for the second acceptation.
Collocation definitions • Collocations are actual words in habitual company. (Firth, 1968, 182) • We shall call collocation a characteristic combination of two words in a structure like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun. (Hausmann, 1989, 1010) • a sequence of words that occurs more than once in identical form [...] and which is grammatically well structured (Kjellmer, 1987, 133) • a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988) • A collocation is an arbitrary and recurrent word combination. (Benson, 1990) • Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair, 1991, 170) • The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222) • recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143)
Collocation definitions (cont.) • Collocation: idiosyncratic restriction on the combinability of lexical items (van der Wouden, 1997, 5) • A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151) • Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507) • We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7) • A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9) • lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)
Features: Unit • Children memorise not only single words, but also groups (chunks) of words. • Collocations are prefabricated units available as blocks (cf. the idiom principle): “The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments.” (Sinclair, 1991, 110) • “semi-finished products” of language (Hausmann, 1985, 124); “déjà-vu”.
Features: Recurrent, typical • Collocations are actual words in habitual company. (Firth, 1968, 182) • typical, specific and characteristic combination of two words (Hausmann, 1985) • We shall call collocation a characteristic combination of two words […]. (Hausmann, 1989, 1010) • a sequence of words that occurs more than once in identical form [...] and which is grammatically well structured (Kjellmer, 1987, 133) • A collocation is an arbitrary and recurrent word combination. (Benson, 1990) • recurrentcombinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143) • A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151) • Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507) • We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7)
Features: Arbitrary • typical, specific and characteristic combination of two words (Hausmann, 1985) • A collocation is an arbitrary and recurrent word combination (Benson, 1990) • The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222) • recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143) • Collocation: idiosyncratic restriction on the combinability of lexical items (van der Wouden, 1997, 5) • Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507) • lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)
Features: Unpredictable • “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988): “With an encoding idiom, by contrast, we have an expression whichlanguageusersmight or might not understandwithoutpriorexperience, but concerningwhichtheywould not know thatitis a conventionalwayof sayingwhatitsays” (Fillmore et al., 1988, 505) • […] these affinities can not be predicted on the basis of semantic or syntactic rules, but can be observed with some regularity in text (Cruse, 1986) • A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9)
Features: Made up of twoor morewords • Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair 1991:170) • co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern (Cowie 1978:132) • a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988) • Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair, 1991, 170) • A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151) • the components of a collocation can again be collocational themselves: next to the German collocation Gültigkeithaben(n + v), we have allgemeineGültigkeithaben[lit., ‘general validity have’], with allgemeineGültigkeit, a collocation (n + a), as a component (Heid, 1994, 232). • In most of the examples, collocation patterns are restricted to pairs of words, but there is no theoretical restriction to the number of words involved(Sinclair, 1991, 170).
Summing up… • prefabricated unit • made up of two or more words • reccursive • recurrent/typical • arbitrary • unpredictable • partly transparent • syntactically motivated • worth storing in a lexicon • asymmetric (base + collocate) But ultimately, the exact definition of collocations varies according to the application needs: “the practical relevance is an essential ingredient of their definition” (Evert, 2004, 75).
Prehistory • Collocations have even been known and studied by the ancient Greeks (Gitsaki, 1996). • Pedagogical interest in collocations: Harold Palmer (1877–1949): “polylogs”, “known units” Albert Sydney Hornby (1898–1978): Idiomatic and Syntactic English Dictionary (1942) A learner’s Dictionary of Current English (1948) Advanced Learner’s Dictionary of Current English (1952), Oxford Advanced Learner’s Dictionary (multiple prints) Anthony P. Cowie Peter Howarth Michael Lewis: “islands of reliability” • Linguistics interest in collocations: “groupementsusuels”, opposed to “groupementspassagers” (Bally, 1909) usualcombinationstemporary/free combinations “LexikalischeSolidaritäten” (Coseriu, 1967). lexical solidarity
Syntacticcharacterisation Distinction between lexical and grammatical collocations (Benson et al., 1986) • Lexical collocations involveopen-class wordsonly (nouns, verbs, adjectives, mostadverbs) most collocations • Grammatical collocations maycontainfunctionwords (prepositions, conjunctions, pronouns, auxiliaryverbs, articles): apathytowards, agreement that, in advance, angryat, afraidthat (Examplesfrom Benson et al., 1986)
Syntacticcharacterisation (cont.) Syntactic configurations relevant for collocations: • “We shall call collocation a characteristic combination of two words in a structure like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun.” (Hausmann, 1989, 1010) N-A, N-V, V-N, V-Adv, A-Adv, N-P-N • BBI dictionary (Benson et al., 1986): many types, including: A-N, N-N, N-P:of-N, N-V, V-N, V-P-N, Adv-A, V-Adv, N-P, N-Conj, P-N, A-P, A-Conj • Unrestricted typology: “The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure.” (Fontenelle, 1992, 222)
Semantic characterisation • The collocation is a semantic unit: “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components” (Choueka, 1988) • “the noncompositionality of a string must be considered when assessing its holism” (Moon, 1998, 8) • Is the meaning of a collocation obtained by the composition of the meanings of individual words?
Semantic characterisation (cont.) • Collocations occupy the grey area of a continuum of compositionality: • Collocations are partly compositional (Meaning-Text Theory): B: base – autosemantic (semantichead) A: collocate– synsemantic (semanticallydependent) collocations regular combinations idiomatic expressions transparent opaque ‘A B’ ‘A’ ‘B’ heavy smoker
Semantic characterisation (cont.) • “the meaning of a collocation is not a straightforward composition of the meaning of its parts” (Manning and Schütze, 1999, 172–173) “there is always an element of meaning added to the combination” (1999, 184); The meaning of a collocation like white wine contains an added element of connotation with respect to the connotation of wine and white together. • “the individual words in a collocation can contribute to the overall semantics of the compound” (McKeown and Radev, 2000, 507). ‘A B’ ‘A’ ‘B’ white wine
Semantic characterisation (cont.) • Easy to decode, difficult to encode: “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988) ‘A B’ ‘A’ ‘B’ entertain hope
Collocations vs. idioms colloca-tions idioms ? colloca-tions idioms colloca-tions idioms “fall somewhere along a continuum between free word combinations and idioms” (McKeown and Radev, 2000, 509) “The term collocation will be used to refer to sequences of lexical items which habitually co-occur, but which are nonetheless fully transparent in the sense that each lexical constituent is also a semantic constituent.” (Cruse, 1986, 40)
Collocations vs. idioms (cont.) colloca-tions idioms ? colloca-tions idioms idioms collocations “I will use the term collocation as the most general term to refer to all types of fixed combinations of lexical items; in this view, idioms are a special subclass of collocations” (van der Wouden, 1997, 9). “Idiomaticity applies to encoding for collocations, but not to decoding” (Fillmore et al., 1988).
Collocations vs. othertypes of MWEs • Multi-word expressions (MWE) cover a broad spectrum of phenomena: Named entities European Union Compounds wheel chair Verb-particle constructions give up Light-verb contructionstake a bath ... Note: While theoretically appealing, fine-grained distinctions are less important in practice. All expressions share the same fate: lexicon → special treatment. They are equally important; what changes is their share in language.
Predominance of collocations • “collocations make up the lion’s share of the phraseme[MWE] inventory, and thus deserve our special attention” (Mel’čuk 1998, 24). • “no piece of natural spoken or written English is totally free of collocation” (Lea and Runcie, vii) • “In all kinds of texts, collocations are indispensable elements with which our utterances are very largely made” (Kjellmer 1987:140) • Les députés réformistes surveilleront de près les mesures que prendra le gouvernement au sujetdu rôleque jouera le Canada dans le maintien de la paix[…] • (HansardCorpus )
Quiz agreement
English • Choueka (1988): Looking for needles in a haystack … pre-processing: - (plain text) candidates: sequences of adjacent works up to 7 word long ranking: raw frequency • Kjellmer (1994): A Dictionary of English Collocations plain text sequences of adjacent words raw frequency • Justeson and Katz (1995): Technical terminology: Some linguistic properties and an algorithm for identification in text NP chunking (patterns containing N, A, P) n-grams raw frequency EX:central processing unit