Corpora and Statistical Methods

Corpora and Statistical Methods Albert Gatt

Word Sense Disambiguation

What are word senses? • Cognitive definition: • mental representation of meaning • used in psychological experiments • relies on introspection (notoriously deceptive) • Dictionary-based definition: • adopt sense definitions in a dictionary • most frequently used resource is WordNet

WordNet • Taxonomic representation of words (“concepts”) • Each word belongs to a synset, which contains near-synonyms • Each synset has a gloss • Words with multiple senses (polysemy) belong to multiple synsets • Synsetsorganised by hyponymy (IS-A) relations • Also, other lexical relations, depending on category

How many senses? • Example: interest • pay 3% interest on a loan • showed interest in something • purchased an interest in a company. • the national interest… • have X’s best interest at heart • have an interest in word senses • The economy is run by business interests

Wordnet entry for interest (noun) • a sense of concern with and curiosity about someone or something … (Synonym: involvement) • the power of attracting or holding one’s interest… (Synonym: interestingness) • a reason for wanting something done ( Synonym: sake) • a fixed charge for borrowing money… • a diversion that occupies one’s time and thoughts … (Synonym: pastime) • a right or legal share of something; a financial involvement with something (Synonym: stake) • (usually plural) a social group whose members control some field of activity and who have common aims, (Synonym: interest group)

Some issues • Are all these really distinct senses? Is WordNet too fine-grained? • Would native speakers distinguish all these as different? • Cf. The distinction between sense ambiguity and underspecification (vagueness): • one could argue that there are fewer senses, but these are underspecified out of context

Translation equivalence • Many WSD applications rely on translation equivalence • Given: parallel corpus (e.g. English-German) • if word w in English has n translations in German, then each translation represents a sense • e.g. German translations of interest: • Zins: financial charge (WN sense 4) • Anteil: stake in a company (WN sense 6) • Interesse: all other senses

Some terminology • WSD Task: given an ambiguous word, find the intended sense in context • Sense tagging: task of labelling words as belonging to one sense or another. • needs some a priori characterisation of senses of each relevant word • Discrimination: • distinguishes between occurrences of words based on senses • not necessarily explicit labelling

Some more terminology • Two types of WSD task: • Lexical sample task: focuses on disambiguating a small set of target words, using an inventory of the senses of those words. • All-words task: focuses on entire texts and a lexicon, where every word in the text has to be disambiguated • Serious data sparseness problems!

Approaches to WSD • All methods rely on training data. Basic idea: • Given word w in context c • learn how to predict sense s of w based on various features of w • Supervised learning: training data is labelled with correct senses • can do sense tagging • Unsupervised learning: training data is unlabelled • but many other knowledge sources used • cannot do sense tagging, since this requires a priori senses

Supervised learning • Words in training data labelled with their senses • She pays 3% interest/INTEREST-MONEY on the loan. • He showed a lot of interest/INTEREST-CURIOSITY in the painting. • Similar to POS tagging • given a corpus tagged with senses • define features that indicate one sense over another • learn a model that predicts the correct sense given the features

Features (e.g. plant) • Neighbouring words: • plant life • manufacturing plant • assembly plant • plant closure • plant species • Content words in a larger window • animal • equipment • employee • automatic

Other features • Syntactically related words • e.g. object, subject…. • Topic of the text • is it about SPORT? POLITICS? • Part-of-speech tag, surrounding part-of-speech tags

Some principles proposed (Yarowsky 1995) • One sense per discourse: • typically, all occurrences of a word will have the same sense in the same stretch of discourse (e.g. same document) • One sense per collocation: • nearby words provide clues as to the sense, depending on the distance and syntactic relationship • e.g. plant life: all (?) occurrences of plant+life will indicate the botanic sense of plant

Training data • SENSEVAL • Shared Task competition • datasets available for WSD, among other things • annotated corpora in many languages • (NB: SENSEVAL now merged with the broader SEMEVAL tasks) • Pseudo-words • create training corpus by artificially conflating words • e.g. all occurrences of man and hammer with man-hammer • easy way to create training data • Multi-lingual parallel corpora • translated texts aligned at the sentence level • translation indicates sense

SemCor corpus • Corpus consisting of files from the Brown Corpus (1m words) • SemCor contains 352 files, 360k words total: • 186 files contain sense tags for all content words • The remainder only have sense tags for verbs • Originally created using the WordNet 1.6 sense inventory; since updated to the more recent versions of WordNet

SemCor Example (= Brown file a01) <s snum=1> [...] <wfcmd=ignore pos=DT>an</wf> <wfcmd=done pos=NN lemma=investigation wnsn=1 lexsn=1:09:00::>investigation</wf> <wfcmd=ignore pos=IN>of</wf> <wfcmd=done pos=NN lemma=atlantawnsn=1 lexsn=1:15:00::>Atlanta</wf> <wfcmd=ignore pos=POS>'s</wf> <wfcmd=done pos=JJ lemma=recent wnsn=2 lexsn=5:00:00:past:00>recent</wf> <wfcmd=done pos=NN lemma=primary_electionwnsn=1 lexsn=1:04:00::>primary_election</wf> [...] </s>

SemCor example <wfcmd=done pos=NN lemma=investigation wnsn=1lexsn=1:09:00::>investigation</wf> • Senses in WordNet: • S: (n) probe, investigation (an inquiry into unfamiliar or questionable activities) "there was a congressional probe into the scandal" • S: (n) investigation, investigating (the work of inquiring into something thoroughly and systematically) • This occurrence involves the first sense

Data representation • Example sentence: An electric guitar and bass player stand off to one side... • Target word: bass • Possible senses: fish, musical instrument... • Relevant features are represented as vectors, e.g.:

Part 1 Supervised methods I: Naive Bayes

Naïve Bayes Classifier • Identify the features (F) • e.g. surrounding words • other cues apart from surrounding context • Combine evidence from all features • Decision rule: decide on sense s’ iff • Example: drug. F = words in context • medication sense: price, prescription, pharmaceutical • illegal substance sense: alcohol, illicit, paraphernalia

Naive Bayes Classifier • Problem: We usually don’t know the probability of a sense given the features: P(sk|F) • In our corpus, we have access to P(F|sk) • We can compute P(sk|F) from P(F|sk) • For this we use Bayes’ theorem

Deriving Bayes’ rule from the multiplication rule • Recall that:

Deriving Bayes’ rule from the multiplication rule • Recall that: • Given symmetry of intersection, multiplication rule can be written in two ways:

Deriving Bayes’ rule from the multiplication rule • Recall that: • Given symmetry of intersection, multiplication rule can be written in two ways: • Bayes’ rule involves the substitution of one equation into the other, to replace P(A and B)

Deriving P(A) • Often, it’s not clear where P(A) should come from • we start out from conditional probabilities! • Given that we have two sets of outcomes of interest, A and B, P(A) can be derived from the following observation: • i.e. The events in A are made up of those which are only in A (but not in B) and those which are in both A and B.

B A Finding P(A) -- I P(A) must either be in one or the other (or both), since A is composed of these two sets.

Finding P(A) -- II Step 1: Applying the addition rule: Step 2: Substituting into Bayes’ equation to replace P(A):

Using Bayes’ rule fro WSD • We usually don’t know P(sk|F)but we can compute from training data: P(sk) (the prior) and P(F|sk) • P(f)can be eliminated because it is constant for all senses in the corpus

The independence assumption • It’s called “naïve” because: • i.e. all features are assumed to be independent • Obviously, this is often not true. • e.g. finding illicit in the context of drug may not be independent of finding pusher. • cf. our discussion of collocations! • Also, topics often constrain word choice.

Training the naive Bayes classifier • We need to compute: • P(s) for all senses s of w • P(fj|s) for all features fj

Part 2 Supervised methods II: Information-theoretic and collocation-based approaches

Information-theoretic measures • Find the single, most informative feature to predict a sense. • E.g. using a parallel corpus: • prendre(FR) can translate as take or make • prendreunedécision: make a decision • prendreunemesure: take a measure [to…] • Informative feature in this case: direct object • mesure indicates take • décision indicates make • Problem: need to identify the correct value of the feature that indicates a specific sense.

Brown et al’s algorithm • Given: translations T of word w • Given: values X of a useful feature (e.g. mesure, décisionas values of DO) • Step 1: random partition P of T • While improving, do: • create partition Q of X that maximises I(P;Q) • find a partition P of T that maximises I(P;Q) comment: relies on mutual info to find clusters of translations mapping to clusters of feature values

Using dictionaries and thesauri • Lesk (1986): one of the first to exploit dictionary definitions • the definition corresponding to a sense can contain words which are good indicators for that sense • Method: • Given: ambiguous word w with senses s1…sn with glosses g1…gn. • Given: the word w in context c • compute overlap between c & each gloss • select the maximally matching sense

Expanding a dictionary • Problem with Lesk: • often dictionary definitions don’t contain sufficient information • not all words in dictionary definitions are good informants • Solution: use a thesaurus with subject/topic categories • e.g. Roget’s thesaurus • http://www.gutenberg.org/cache/epub/22/pg22.html.utf8

Using topic categories • Suppose every sense sk of word w has subject/topic tk • wcan be disambiguated by identifying the words related to tk in the thesaurus • Problems: • general-purpose thesauri don’t list domain-specific topics • several potentially useful words can be left out • e.g. … Navratilova plays great tennis … • proper name here useful as indicator of topic SPORT

Expanding a thesaurus: Yarowsky 1992 • Given: context c and topic t • For all contexts and topics, compute p(c|t) using Naïve Bayes • by comparing words pertaining to t in the thesaurus with words in c • if p(c|t) > α, then assign topic t to context c • For all words in the vocabulary, update the list of contexts in which the word occurs. • Assign topic t to each word in c • Finally, compute p(w|t) for all w in the vocabulary • this gives the “strength of association” of w with t

Yarowsky 1992: some results SENSE ROGET TOPICS ACCURACY star space object UNIVERSE 96% celebrity ENTERTAINER 95% shape INSIGNIA 82.% sentence punishment LEGAL_ACTION 99% set of words GRAMMAR 98%

Bootstrapping • Yarowsky (1995) suggested the one sense per discourse/collocation constraints. • Yarowsky’s method: • select the strongest collocational feature in a specific context • disambiguate based only on this feature • (similar to the information-theoretic method discussed earlier)

One sense per collocation • For each sense s of w, initialise F, the collocations found in the dictionary definition for s • One sense per collocation: • identify the set of contexts containing collocates of s • for each sense s of w, update F to contain those collocates such that, for forall s’ ≠ s (where alpha is a threshold)

One sense per discourse • For each document: • find the majority sense of w out of those found in previous step • assign all occurrences of w the majority sense • This is implemented as a post-processing step. Reduces error rate by ca. 27%.

Part 3 Unsupervised disambiguation

Preliminaries • Recall: unsupervised learning can do sense discrimination not tagging • akin to clustering occurrences with the same sense • e.g. Brown et al 1991: cluster translations of a word • this is akin to clustering senses

Brown et al’s method • Preliminary categorisation: • Set P(w|s) randomly for all words w and senses s of w. • Compute, for each context c of w the probability P(c|s) that the context was generated by sense s. • Use (1) and (2) as a preliminary estimate. Re-estimate iteratively to find best fit to the corpus.

Characteristics of unsupervised disambiguation • Can adapt easily to new domains, not covered by a dictionary or pre-labelled corpus • Very useful for information retrieval • If there are many senses (e.g. 20 senses for word w), the algorithm will split contexts into fine-grained sets • NB: can go awry with infrequent senses

Part 4 Some issues with WSD

The task definition • The WSD task traditionally assumes that a word has one and only one sense in a context. • Is this true? • Is it possible to see evidence of co-activation (one word displaying more than one sense)? • this would bring competition to the licensed trade • competition = “act of competing” • competition = “people/organisations who are competing”

Systematic polysemy • Not all senses are so easy to distinguish. E.g. competition in the “agent competing” vs “act of competing” sense. • The polysemy here is systematic • Compare bank/bank where the senses are utterly distinct (and most linguists wouldn’t consider this a case of polysemy, but homonymy) • Can translation equivalence help here? • depends if polysemy is systematic in all languages

Corpora and Statistical Methods