310 likes | 453 Views
Special Topics in Computer Science Advanced Topics in Information Retrieval Lecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution . Alexander Gelbukh www.Gelbukh.com. Previous Chapter: Conclusions. Reducing synonyms can help IR Better matching
E N D
Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 9: Natural Language Processing and IR. Tagging, WSD, and Anaphora Resolution Alexander Gelbukh www.Gelbukh.com
Previous Chapter: Conclusions • Reducing synonyms can help IR • Better matching • Ontologies are used. WordNet • Morphology is a variant of synonymy • widely used in IR systems • Precise analysis: dictionary-based analyzers • Quick-and-dirty analysis: stemmers • Rule-based stemmers. Porter stemmer • Statistical stemmers
Previous Chapter: Research topics • Constructing and application of ontologies • Building of morphological dictionaries • Treatment of unknown words with morphologicalanalyzers • Development of better stemmers • Statistical stemmers?
Contents • Tagging: for each word, determine its POS (Part of Speech: noun, ...) and grammatical characteristics • WSD (Word Sense Disambiguation):for each word, determine which homonym is used • Anaphora resolution:For a pronoun (it, ...), determine what it refers to
Tagging: The problem • Ambiguity of parts of speech • rice flies like sand • = insects living in rice consider send good? • = rice can fly similarly to sand? • ... insect of a container with rice...? • We can fly like sand ... We think fly like sand... • Ambiguity of grammatical characteristics • He have read the book • He will read the book... He read the book • Very frequent phenomenon, nearly at each word!
Tagger... • A program that looking at the context and decides what the part of speech (and other characteristics) are • Input: • He will read the book • Morphological analysis • He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...> ? ? ? ? ? Ns = noun singular, Tags: Tagger Va = verb auxiliary, Vpa = verb past Vpp = verb past participle, Vinf = verb infinitive, ...
...Tagger • Input of tagger • He<...> will<Ns | Va> read<Vpa | Vpp | Vinf> the<...> • Task: Choose one! • Output: • He<...> will<Va> read<Vinf> the<...> • How we do it? • He will<N> not possible Va • will<Va> read Vinf • This is simple, but imagine He is ambiguous... Explosion
Applications • Used for word sense disambiguation: • Oil well in Mexico is used. • Oil is used well in Mexico. • For stemming and lemmatization • Important for matching in information retrieval • Greatly speed ups syntactic analysis • Tagging is local • No need to process the whole sentence to find that a certain tag is incorrect
How: Parsing? • We can find all the syntactic structures • Only the correct variants will enter the syntactic structure • will + Vinf form a syntactic unit • will + Vpa do not • Problems • Computationally expensive • What to do with ambiguities? • fly rice like sand • Depends on what you need
Statistical tagger • Example: TnT tagger • Based on Hidden Markov Model (HMM) • Idea: • Some words are more probable after some other words • Find these probabilities • Guess the word if you know the nearby ones • Problem: • Letter strings denote meanings • “x is more probable after y” are meanings, not strings • so guess what you cannot see: meanings
Hidden Markov Model: Idea • A system changes its state • What a person thinks • Random... but not completely (how?) • In each state, it emits an output • What he says when he thinks something • Random... but somehow (?) depends on what he thinks • We know the sequence of produced outputs • Text: we can see it! • Guess what were the underlying states • Hidden: we cannot see them
Hidden Markov Model: Hypotheses • A finite set of states: q1 ... qN (invisible) • POS and grammatical characteristics (language) • A finite set of observations: v1 ... vM • Strings we see in the corpus (language) • A random sequence of states xi • POS in the • Probabilities of state transitions P(xi+1| xi) • Language rules and use • Probabilities of observations P(vk| xi) • words expressing the meanings: Vinf: ask, V3: asks
Hidden Markov Model: Problem • Same observation corresponds to different meaning • Vinf: read, Vpp: read • Looking at what we can see, guess what we cannot • This is why hidden • Given a sequence of observations oi • The text: sequence of letter strings. Training set • Guess the sequence of states xi • The POS of each word • Our hypotheses on xi depend on each other • Highly combinatorial task
Hidden Markov Model: Solutions • Need to find the parameters of the model: • P(xi+1| xi) • P(vk| xi) • Optimal way! To maximize the probability of generation this specific output • Optimization methods from Operation Research are used • More details? Not so simple...
Brill Tagger (rule-based) • Erik Brill • Makes an initial assumption aboutPOS tags in the text • Uses context-dependent rewritingrules to correct some tags • Applies them iteratively • Learns the rules from a training corpus • The rules are in human-understandable form • You can correct them manually to improve the tagger • Unlike HMM which are not understandable
Word Sense Disambiguation • Query: international bank in Seoul • Bank: 한 원 • financial institution Korean $ • river shore superior official • place to store something 한상용 ... • ... ... ... • Hotel located at the beautiful bank of Han river. • Relevant for the query? • POS is the same. Tagger will not distinguish them
Applications • Translation • 대원군 Great Governor of the Court • 만원10 thousand won • international bankbanco internacional • river bankorilla del río • Information retrieval • Document retrieval: is really useful? Same info • Passage retrieval: can prove very useful! • Semantic analysis
Representation of word senses • Explanations. Semantic dictionaries • Bank1 is an institution to keep money • Bank2 is a sloppy edge of a river • Synsets and ontology: WordNew (HanNet: Chinese) • Synonyms: {bank, shore} • WordNet terminology: synset #12345 • Corresponds to all ways to call a concept • Relationships: #12345 IS_PART_OF #67890 {river, stream}#987 IS_A #654 {institution, organization} • WordNet has also glosses
Task • Given a text (probably POS-tagged) • Tag each word with its synset number #123 or dictionary number bank1 • Input: • Mary keeps the money in a bank. • Han river’s bank is beautiful. • Output • Mary keeps<1> the money<1> in a bank<1> • Han river’s bank<2> is beautiful.
Lesk algorithm • Michael Lesk • Explanatory dictionary • Bank1 is an institution to keep money • Bank2 is a sloppy edge of a river • Mary keeps her money (savings) in a bank. • Choose the sense which has more words in common with immediate context • Improvements (Pedersen, Gelbukh & Sidorov) • Use synonyms when no direct matches • Use synonyms of synonyms, ...
Other word relatedness measures • Lexical chains in WordNet • The length of the path in the graph of relationships • Mutual information: frequent co-occurrences • Collocations (Bolshakov & Gelbukh) • Keep in bank1 • Bank2of river • Very large dictionary of such combinations • Number of words in common between explanations • Recursive: common words or related words(Gelbukh & Sidorov)
Other methods • Hidden Markov Models • Logical reasoning
Yarowsky’s Principles • David Yarowsky • One sense per text! • One sense per collocation • I keep my money in the bank1. This is an international bank1 with a great capital. The bank2 is located near Han river. • 3 words vote for ‘institution’, one for ‘shore’ • Institution! bank1 is located near Han river.
Anaphora resolution • Mainly pronouns. • Also co-reference: when two words refer to the same? • John took cake from the table and ate it. • John took cake from the table and washed it. • Translation into Spanish: la ‘she’ table / lo ‘he’ cake • Methods: • Dictionaries • Different sources of evidence • Logical reasoning
Applications • Translation • Information retrieval: • Can improve frequency counts (?) • Passage retrieval: can be very important
Mitkov’s knowledge poor method • Ruslan Mitkov • Rule-based and statistical-based approach • Uses simple information on POS and general word classes • Combines different sources of evidence
Hidden Anaphora • John bought a house. The kitchen is big. • = that house’s kitchen • John was eating. The food was delicious. • = “that eating” ’s food • John was buried. The widow was mad with grief. • “that burying” ’s death’s widow • Intersection of scenarios of the concepts(Gelbukh & Sidorov) • house has a kitchen • burying results from death & widow results from death
Evaluation • Senseval and TREC international competitions • Korean track available • Human annotated corpus • Very expensive • Inter-annotator agreement is often low! • A program cannot do what humans cannot do • Apply the program and compare with the corpus • Accuracy • Sometimes the program cannot tag a word • Precision, recall
Research topics • Too many to list • New methods • Lexical resources (dictionaries) • = Computational linguistics
Tagging, word sense disambiguation, andanaphora resolution are cases of disambiguation ofmeaning Useful in translation, information retrieval, and textundertanding Dictionary-based methods good but expensive Statistical methods cheap and sometimes imperfect... but not always (if verylarge corpora are available) Conclusions
Thank you! Till May 31? June 1?6 pm