Alexander Gelbukh Gelbukh

Special Topics in Computer ScienceAdvanced Topics in Information RetrievalLecture 8: Natural Language Processing and IR. Synonymy, Morphology, and Stemming Alexander Gelbukh www.Gelbukh.com

Previous Chapter: Conclusions • Parallel computing can improve • response time for each query and/or • throughput: number of queries processed with same speed • Document partitioning is simple • good for distributed computing • Term partitioning is good for some data structures • Distributed computing is MIMD computing with slow communication • SIMD machines are good for Signature files • Both are out of favor now

Previous Chapter: Research topics • How to evaluate the speedup • New algorithms • Adaptation of existing algorithms • Merging the results is a bottleneck • Meta search engines • Creating large collections with judgements • Is recall important?

Problem • Recall image retrieval: • Find images similar in color, size, ... • Find photos of Korean President ? • Find nice girls ? (Don’s show ugly ones!) • Looks very stupid • Lacks understanding • Too difficult • Text retrieval is no exception • Find stories with sad beginning and happy end ? • Lacks understanding • Difficult but possible

Possible? • Text is intended to facilitate understanding • Supposedly, even partial understanding should help • Degrees of understanding: • Character strings (what is used now): well, geese, him • Words (often used now): goose, he • Concepts: hole in the ground (well), Roh Moo-Hyun • Complex concepts: oil well, hot dog • Situations (sentences, paragraphs) • The story (direct meaning) • The message (pragmatics, intended impact)

Easy? • Main problems: • Multiple ways to say the same • Query does not match the doc • Difficult to specify all variants • Ambiguity of the text • False alarms in matching • Lack of implicit knowledge of the computer • The computer “does not understand” the message • Difficult to make inferences • Natural Language Processing tries to solve them

Solutions • Multiple ways to say the same? • Normalizing: transforming to a “standard” variant • Ambiguity of the text? • Ambiguity resolution • Normalizing to one of the variants • Perhaps the main problem in natural language processing • Lack of implicit knowledge of the computer? • Dictionaries, grammars • Knowledge on language structure is needed in all tasks • Knowledge of world is useful for advanced task • Knowledge on language use is a substitute

Synonymy • Multiple ways to say the same • Or at least when the difference does not matter • Can be substituted in any (many?) context • Lexical synonymy • Woman / female, professor / teacher • Dictionaries • Phrase-level or sentence-level synonymy • They game a book / I was given a book by them • Syntactic analyzers • Semantic-level synonymy • Reasoning

Not only synonymy • Multiple ways to say • the same (synonymy) • less: more general (hypernymy) • more: more specific (hyponymy) • Complete synonyms are rare • professor teacher • Abbreviations are usually (almost) complete synonyms • When the differences do not matter, can be treated as synonymy • But: different data structures and methods

Lexical-level synonymy • Lexical synonymy • Woman / female • Mixed-type synonymy: USA / United States • Morphology is a kind of synonymy (actually hyponymy) • ‘geese’ = ‘goose’ + ‘many’ • Russian ‘knigu’ = ‘kniga’ + ‘dative role’ • the “second” part of the meaning is either not important or is another term • Morphology is a very common problem in IR

Lexical synonymy • Woman / female • Dictionaries • Synonym dictionaries • WordNet • Automatic learning of synonymy • Clustering of contexts • If the contexts are very similar, then possible synonyms • Problem: preserves meaning? Monday / Tuesday • An interesting solution: compare dictionary definitions

Uses in IR • Query expansion • Add synonyms of the word to the query and process normally • Flexible, slow • Best for lexical synonymy: few synonyms, doubtful • Reducing at index time • When reading the documents, reduce each word to a “standard” synonym • Fast, rigid • Best for morphology: many synonyms, less doubtful • Hierarchical indexing

Hierarchical indexing (Gelbukh, Sidorov, Guzman-Arenas 2002) • Tree of concepts • Living things • Animals • a. Cat, b. cats • a. Dog, b. dogs • Persons • a. Professor, b. professors • a. Student, b. students • Order vocabulary by the order of the leaves of tree • Query expansion is done by ranges: • cat: 1, living things: 1-4

Morphology • One of the large concerns in IR • Can be done • precisely • approximately (quick-and-dirty) • Level of generalization • inflection: student – students • derivation: study – student • Ambiguity • all variants • one variant

... morphology • Result is • The unique ID • The dictionary form • A “stem”: part of the same string

Morphological analyzers • Precise analysis • Ambiguous • Give all variants • Tables: to table or the table? • Spanish charlas: charla ‘talk’or charlar ‘to talk’ • Russian dush: dush ‘shower’ or dusha ‘soul’ • Common in languages with developed morphology • For short words, some 3 – 5 – 10 variants • Dictionaries are used

Morphological system • Dictionary specifies: • Stem: bak-, ask- • POS (part of speech): verb • Inflection class (what endings it accepts): 1, 2 • Tables of endings specify • Paradigms: • -e -es -ed -ed -ing • -, -s -ed -ed -ing • Meanings: participle, ...

... morphological system • Algorithm • Decompose the word into an existing stem and ending • Check compatibility of stem and ending • Give the stem ID and ending meaning • Ambiguous • Many variants of decompositions • Many stems with different IDs • Many endings with different meaning • -ed: past or participle • Problem: words absent in dictionary

Stemming • Substitute for real analysis • Both inflection and derivation • Quick-and-dirty • Only one variant • Result: a part of the string • gene, genialgen- • Cheap development • bad results • simple description. Standard • Often used in academic research • Used to be used in real systems, but now less

Porter stemmer • Martin Porter, 1980 • Standard stemmer • Provides equal basisfor evaluation ofdifferent IR programs • Uses “measure” m: • [C](VC){m}[V]. • m=0 TR, EE, TREE, Y, BY. • m=1 TROUBLE, OATS, TREES, IVY. • m=2 TROUBLES, PRIVATE, OATEN, ORRERY.

... Porter stemmer • Step 1a • SSES -> SS caresses -> caress • IES -> I ponies -> poni ties -> ti • SS -> SS caress -> caress • S -> cats -> cat

... Porter stemmer • Step 1b • (m>0) EED -> EE feed -> feed agreed -> agree • (*v*) ED -> plastered -> plaster bled -> bled • (*v*) ING -> motoring -> motor sing -> sing

... Porter stemmer • If 2nd or 3rd rule successful • AT -> ATE conflat(ed) -> conflate • BL -> BLE troubl(ed) -> trouble • IZ -> IZE siz(ed) -> size • (*d and not (*L or *S or *Z)) -> single letter • hopp(ing) -> hop • tann(ed) -> tan • fall(ing) -> fall • hiss(ing) -> hiss • fizz(ed) -> fizz • (m=1 and *o) -> E • fail(ing) -> fail • fil(ing) -> file

... Porter stemmer • Step 1c • (*v*) Y -> I • happy -> happi • sky -> sky

... Porter stemmer • Step 2 • (m>0) ATIONAL -> ATE relational -> relate • (m>0) TIONAL -> TION conditional -> condition rational -> rational • (m>0) ENCI -> ENCE valenci -> valence • (m>0) ANCI -> ANCE hesitanci -> hesitance • (m>0) IZER -> IZE digitizer -> digitize • (m>0) ABLI -> ABLE conformabli -> conformable • (m>0) ALLI -> AL radicalli -> radical • (m>0) ENTLI -> ENT differentli -> different • (m>0) ELI -> E vileli - > vile • (m>0) OUSLI -> OUS analogousli -> analogous • (m>0) IZATION -> IZE vietnamization -> vietnamize • (m>0) ATION -> ATE predication -> predicate • (m>0) ATOR -> ATE operator -> operate • (m>0) ALISM -> AL feudalism -> feudal • (m>0) IVENESS -> IVE decisiveness -> decisive • (m>0) FULNESS -> FUL hopefulness -> hopeful • (m>0) OUSNESS -> OUS callousness -> callous • (m>0) ALITI -> AL formaliti -> formal • (m>0) IVITI -> IVE sensitiviti -> sensitive • (m>0) BILITI -> BLE sensibiliti -> sensible

... Porter stemmer • Step 3 • (m>0) ICATE -> IC triplicate -> triplic • (m>0) ATIVE -> formative -> form • (m>0) ALIZE -> AL formalize -> formal • (m>0) ICITI -> IC electriciti -> electric • (m>0) ICAL -> IC electrical -> electric • (m>0) FUL -> hopeful -> hope • (m>0) NESS -> goodness -> good

... Porter stemmer • Step 4 • (m>1) AL -> revival -> reviv • (m>1) ANCE -> allowance -> allow • (m>1) ENCE -> inference -> infer • (m>1) ER -> airliner -> airlin • (m>1) IC -> gyroscopic -> gyroscop • (m>1) ABLE -> adjustable -> adjust • (m>1) IBLE -> defensible -> defens • (m>1) ANT -> irritant -> irrit • (m>1) EMENT -> replacement -> replac • (m>1) MENT -> adjustment -> adjust • (m>1) ENT -> dependent -> depend • (m>1 and (*S or *T)) ION -> adoption -> adopt • (m>1) OU -> homologou -> homolog • (m>1) ISM -> communism -> commun • (m>1) ATE -> activate -> activ • (m>1) ITI -> angulariti -> angular • (m>1) OUS -> homologous -> homolog • (m>1) IVE -> effective -> effect • (m>1) IZE -> bowdlerize -> bowdler

... Porter stemmer • Step 5a • (m>1) E -> probate -> probat rate -> rate • (m=1 and not *o) E -> cease -> ceas • Step 5b • (m > 1 and *d and *L) -> single letter • controll -> control • roll -> roll

Statistical stemmers • Take a list of words • Construct a model of language that “generates” it • The “best” one • The simplest one? How to find? • List of stems, list of endings • Determine their probabilities • Usage statistics • Decompose any input string into a stem and an ending • Take the most probable variant

Research topics • Constructing and application of ontologies • Building of morphological dictionaries • Treatment of unknown words with morphologicalanalyzers • Development of better stemmers • Statistical stemmers?

Reducing synonyms can help IR Better matching Ontologies are used. WordNet Morphology is a variant of synonymy widely used in IR systems Precise analysis: dictionary-based analyzers Quick-and-dirty analysis: stemmers Rule-based stemmers. Porter stemmer Statistical stemmers Conclusions

Thank you! Till May 24? 25?, 6 pm

Alexander Gelbukh Gelbukh