Alexander Gelbukh www.Gelbukh.com

Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution Alexander Gelbukh www.Gelbukh.com

Previous Chapter: Conclusions • Reducing synonyms can help IR • Better matching • Ontologies are used. WordNet • Morphology is a variant of synonymy • widely used in IR systems • Precise analysis: dictionary-based analyzers • Quick-and-dirty analysis: stemmers • Rule-based stemmers. Porter stemmer • Statistical stemmers

Previous Chapter: Research topics • Constructing and application of ontologies • Building of morphological dictionaries • Treatment of unknown words with morphologicalanalyzers • Development of better stemmers • Statistical stemmers?

Contents • Tagging: for each word, determine its POS (Part of Speech: noun, ...) and grammatical characteristics • WSD (Word Sense Disambiguation):for each word, determine which homonym is used • Anaphora resolution:For a pronoun (it, ...), determine what it refers to

Tagging: The problem • Ambiguity of parts of speech • rice flies like sand • = insects living in rice consider send good? • = rice can fly similarly to sand? • ... insect of a container with rice...? • We can fly like sand ... We think fly like sand... • Ambiguity of grammatical characteristics • He have read the book • He will read the book... He read the book • Very frequent phenomenon, nearly at each word!

Tagger... • A program that looking at the context and decides what the part of speech (and other characteristics) are • Input: • He will read the book • Morphological analysis • He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...> ? ? ? ? ? Ns = noun singular, Tags: Tagger Va = verb auxiliary, Vpa = verb past Vpp = verb past participle, Vinf = verb infinitive, ...

...Tagger • Input of tagger • He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...> • Task: Choose one! • Output: • He<...> will<Va> read<Vinf> the<...> • How we do it? • He will<N> not possible  Va • will<Va> read  Vinf • This is simple, but imagine He is ambiguous... Explosion

Applications • Used for word sense disambiguation: • Oil well in Mexico is used. • Oil is used well in Mexico. • For stemming and lemmatization • Important for matching in information retrieval • Greatly speed ups syntactic analysis • Tagging is local • No need to process the whole sentence to find that a certain tag is incorrect

How: Parsing? • We can find all the syntactic structures • Only the correct variants will enter the syntactic structure • will + Vinf form a syntactic unit • will + Vpa do not • Problems • Computationally expensive • What to do with ambiguities? • fly rice like sand • Depends on what you need

Statistical tagger • Example: TnT tagger • Based on Hidden Markov Model (HMM) • Idea: • Some words are more probable after some other words • Find these probabilities • Guess the word if you know the nearby ones • Problem: • Letter strings denote meanings • “x is more probable after y” are meanings, not strings • so guess what you cannot see: meanings

Hidden Markov Model: Idea • A system changes its state • What a person thinks • Random... but not completely (how?) • In each state, it emits an output • What he says when he thinks something • Random... but somehow (?) depends on what he thinks • We know the sequence of produced outputs • Text: we can see it! • Guess what were the underlying states • Hidden: we cannot see them

Hidden Markov Model: Hypotheses • A finite set of states: q1 ... qN (invisible) • POS and grammatical characteristics (language) • A finite set of observations: v1 ... vM • Strings we see in the corpus (language) • A random sequence of states xi • POS in the • Probabilities of state transitions P(xi+1| xi) • Language rules and use • Probabilities of observations P(vk| xi) • words expressing the meanings: Vinf: ask, V3: asks

Hidden Markov Model: Problem • Same observation corresponds to different meaning • Vinf: read, Vpp: read • Looking at what we can see, guess what we cannot • This is why hidden • Given a sequence of observations oi • The text: sequence of letter strings. Training set • Guess the sequence of states xi • The POS of each word • Our hypotheses on xi depend on each other • Highly combinatorial task

Hidden Markov Model: Solutions • Need to find the parameters of the model: • P(xi+1| xi) • P(vk| xi) • Optimal way! To maximize the probability of generation this specific output • Optimization methods from Operation Research are used • More details? Not so simple...

Brill Tagger (rule-based) • Erik Brill • Makes an initial assumption aboutPOS tags in the text • Uses context-dependent rewritingrules to correct some tags • Applies them iteratively • Learns the rules from a training corpus • The rules are in human-understandable form • You can correct them manually to improve the tagger • Unlike HMM which are not understandable

Word Sense Disambiguation • Query: international bank in Seoul • Bank: 한 원 • financial institution Korean $ • river shore superior official • place to store something 한상용 ... • ... ... ... • Hotel located at the beautiful bank of Han river. • Relevant for the query? • POS is the same. Tagger will not distinguish them

Applications • Translation • 대원군 Great Governor of the Court • 만원10 thousand won • international bankbanco internacional • river bankorilla del río • Information retrieval • Document retrieval: is really useful? Same info • Passage retrieval: can prove very useful! • Semantic analysis

Representation of word senses • Explanations. Semantic dictionaries • Bank1 is an institution to keep money • Bank2 is a sloppy edge of a river • Synsets and ontology: WordNew (HanNet: Chinese) • Synonyms: {bank, shore} • WordNet terminology: synset #12345 • Corresponds to all ways to call a concept • Relationships: #12345 IS_PART_OF #67890 {river, stream}#987 IS_A #654 {institution, organization} • WordNet has also glosses

Task • Given a text (probably POS-tagged) • Tag each word with its synset number #123 or dictionary number bank1 • Input: • Mary keeps the money in a bank. • Han river’s bank is beautiful. • Output • Mary keeps<1> the money<1> in a bank<1> • Han river’s bank<2> is beautiful.

Lesk algorithm • Michael Lesk • Explanatory dictionary • Bank1 is an institution to keep money • Bank2 is a sloppy edge of a river • Mary keeps her money (savings) in a bank. • Choose the sense which has more words in common with immediate context • Improvements (Pedersen, Gelbukh & Sidorov) • Use synonyms when no direct matches • Use synonyms of synonyms, ...

Other word relatedness measures • Lexical chains in WordNet • The length of the path in the graph of relationships • Mutual information: frequent co-occurrences • Collocations (Bolshakov & Gelbukh) • Keep in bank1 • Bank2of river • Very large dictionary of such combinations • Number of words in common between explanations • Recursive: common words or related words(Gelbukh & Sidorov)

Other methods • Hidden Markov Models • Logical reasoning

Yarowsky’s Principles • David Yarowsky • One sense per text! • One sense per collocation • I keep my money in the bank1. This is an international bank1 with a great capital. The bank2 is located near Han river. • 3 words vote for ‘institution’, one for ‘shore’ • Institution! bank1 is located near Han river.

Anaphora resolution • Mainly pronouns. • Also co-reference: when two words refer to the same? • John took cake from the table and ate it. • John took cake from the table and washed it. • Translation into Spanish: la ‘she’ table / lo ‘he’ cake • Methods: • Dictionaries • Different sources of evidence • Logical reasoning

Applications • Translation • Information retrieval: • Can improve frequency counts (?) • Passage retrieval: can be very important

Mitkov’s knowledge poor method • Ruslan Mitkov • Rule-based and statistical-based approach • Uses simple information on POS and general word classes • Combines different sources of evidence

Hidden Anaphora • John bought a house. The kitchen is big. • = that house’s kitchen • John was eating. The food was delicious. • = “that eating” ’s food • John was buried. The widow was mad with grief. • “that burying” ’s death’s widow • Intersection of scenarios of the concepts(Gelbukh & Sidorov) • house has a kitchen • burying results from death & widow results from death

Evaluation • Senseval and TREC international competitions • Korean track available • Human annotated corpus • Very expensive • Inter-annotator agreement is often low! • A program cannot do what humans cannot do • Apply the program and compare with the corpus • Accuracy • Sometimes the program cannot tag a word • Precision, recall

Research topics • Too many to list • New methods • Lexical resources (dictionaries) • = Computational linguistics

Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning Useful in translation, information retrieval, and textundertanding Dictionary-based methods good but expensive Statistical methods cheap and sometimes imperfect... but not always (if verylarge corpora are available) Conclusions

Thank you! Till May 31? June 1?6 pm

Alexander Gelbukh www.Gelbukh.com