Managing Morphologically Complex Languages in Information Retrieval

Managing Morphologically Complex Languages in Information Retrieval Kal Järvelin & Many Others University of Tampere

1. Introduction • Morphologically complex languages • unlike English, Chinese • rich inflectional and derivational morphology • rich compound formation • U. Tampere experiences 1998 - 2008 • monolingual IR • cross-language IR • focus: Finnish, Germanic languages, English

Methods for Morphology Variation Management Reductive Methods Generative Methods Stemming Lemmatiz- ation Infl stem Generation Word Form Generation Rules + Dict Rules + Dict Inflectional Stems FCG Rule- based Rule- based Inflectional Stems enhanced Generating All Forms

Agenda • 1. Introduction • 2. Reductive Methods • 3. Compounds • 4. Generative Methods • 5. Query Structures • 6. OOV Words • 7. Conclusion

2. Normalization • Reductive methods, conflation • stemming • lemmatization • + conflation -> simpler searching • + smaller index • + provides query expansion • Stemming available for many languages (e.g. Porter stemmer) • Lemmatizers less available and more demanding (dictionary requirement)

Alkula 2001 • Boolean environment, inflected index, Finnish: • manual truncation vs. automatic stemming • stemming improves P and hurts R • many derivatives are lost • Boolean environment, infl vs. lemma index, Finnish: • manual truncation vs. lemmatization • lemmatization improves P and hurts R • many derivatives are lost, others correctly avoided • Differences not great between automatic methods

Kettunen & al 2005 • Ranked retrieval, Finnish: • Three problems • how lemmatization and inflectional stem generation compare in a best-match environment? • is a stemmer realistic for the handling Finnish morphology? • feasibility of simulated truncation in a best-match system? • Lemmatized vs inflected form vs. stemmed index.

Kettunen & al. 2005 • Method Index MAP Change % • FinTWOL lemmas 35.0 -- • Inf Stem Gen inflform 34.2 - 2.3 • Porter stemmed 27.7 - 20.9 • Raw inflform 18.9 - 46.0 • But very long queries for inflectional stem generation & expansion (thousands of words); weaker generations shorter but progressively deteriorating results. • (InQuery/TUTK/graded-35/regular; )

Kettunen & al. 2005

MonoIR: Airio 2006 InQuery/CLEF/TD/TWOL&Porter&Raw

CLIR: Inflectional Morphology • NL queries contain inflected form source keys • Dictionary headwords are in basic form (lemmas) • Problem significance varies by language • Stemming • stem both the dictionary and the query words • but may cause all too many translations • Stemming in dictionary translation best applied after translation.

Lemmatization in CLIR • Lemmatization • easy to access dictionaries • but tokens may be ambiguous • dictionary translations not always in basic form • lemmatizer’s dictionary coverage • insufficient -> non-lemmatized source keys, OOVs • too broad coverage -> too many senses provided

CLIR Findings: Airio 2006 English -> X InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

3. Compounds • Compounds, compound word types • determinative: Weinkeller, vinkällare, life-jacket • copulative: schwartzweiss, svartvit, black-and-white • compositional: Stadtverwaltung, stadsförvaltning • non-compositional: Erdbeere, jordgubbe, strawberry • Note on spelling : compound word components are spelled together (if not -> phrases)

Compound Word Translation • All compounds are not in dictionary • some languages are very productive • small dictionaries: atomic words, old non-compositional compounds • large dictionaries: many compositional compounds added • Compounds remove phrase identification problems, but cause translation and query formulation problems

Joining morphemes complicate compound analysis & translation Joining morpheme types in Swedish <omission> flicknamn -s rättsfall -e flickebarn -a gästabud -u gatubelysning -o människokärlek Joining morpheme types in German -s Handelsvertrag -n Affenhaus -e Gästebett -en Fotographenaus- bildung -er Gespensterhaus -es Freundeskreis -ens Herzensbrecher <omission> Sprachwissen-schaft Joining Morphemes Suggestive finding that the treatment of joining morphemes improves MAP by 2 % - Hedlund 2002, SWE->ENG, 11 Qs

A Finnish natural language query: lääkkeet sydänvaivoihin (medicines for heart problems) The output of morphological analysis lääke sydänvaiva, sydän, vaiva Dictionary translation and the output of component tagging: lääke ---> medication drug sydänvaiva - ”not in dict” sydän ---> heart vaiva ---> ailment, complaint, discomfort, inconvenience, trouble, vexation Many ways to combine components in query Compound Processing, 2

Compound Processing, 3 • Sample English CLIR query: • #sum( #syn( medication drug )heart #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation )) • i.e. translating as if source compounds were phrases • Source compound handling may vary here: • #sum( #syn( medication drug ) #syn(#uw3( heart ailment ) #uw3( heart complaint ) #uw3( heart discomfort ) #uw3( heart inconvenience ) #uw3( heart trouble ) #uw3( heart vexation ))) • #uw3 = proximity operator for three intervening words, free word order • i.e. forming all proximity combinations as synonym sets.

Compound Processing, 4 • No clear benefits seen from using proximity combinations. • We did neither observe a great effect in changing the proximity operator (OD vs. UW) • Some monolingual results follow (Airio 2006)

InQuery/CLEF/Raw&TWOL&Porter

English Swedish Finnish Morphological complexity increases

Hedlund 2002 • Compound translation as compounds: • 47 German CLEF 2001 topics, English docs collection. • comprehensive dictionary (many compounds) vs. small dict (no compounds) • mean AP 34.7% vs. 30.4% • dictionary matters ... • Alternative approach: if not translatable, split and translate components

CLEF Ger -> Eng InQuery/UTAClir/CLEF/Duden/TWOL/UW 5+n

CLIR Findings: Airio 2006 English -> InQuery/UTAClir/CLEF/GlobalDix/TWOL&Porter

Eng->Fin Eng->Ger Eng->Swe

4. Generative Methods Variation handling Reductive Methods Generative Methods Stemming Lemmatiz- ation Infl stem Generation Word Form Generation Rules + Dict Rules + Dict Inflectional Stems FCG Rule- based Rule- based Inflectional Stems, ench Generating All Forms

Generative Methods: inf stems • Instead of normalization, generate inflectional stems for an inflectional index. • then using stems harvest full forms from the index • long queries ...

... OR ... • Instead of normalization, generate full inflectional forms for an inflectional index. • Long queries? Sure! • Sounds absolutely crazy ...

... BUT! • Are morphologically complex languages that complex in IR in practice? • Instead of full form generation, only generate sufficient forms -> FCG • In Finnish, 9-12 forms cover 85% of all occurrences of nouns

Kettunen & al 2006: Finnish IR MAP for relevance level Method Liberal Normal Stringent TWOL 37.8 35.0 24.1 FCG12 32.7 30.0 21.4 FCG6 30.9 28.0 21.0 Snowball 29.8 27.7 20.0 Raw 19.6 18.9 12.4 ... monolingual ...

Kettunen & al 2007: Other Langs IR MAP for Language Method Swe Ger Rus TWOL 32.6 39.7 .. FCG 30.6 /4 38.0 /4 32.7 /2 FCG 29.1 /2 36.8 /2 29.2 /6 Snowball 28.5 39.1 34.7 Raw 24.0 35.9 29.8 Results for long queries ... monolingual ...

CLIR Findings: Airio 2008

5. Query Structures • Translation ambiguity such as ... • Homonymy: homophony, homography • Examples: platform, bank, book • Inflectional homography • Examples: train, trains, training • Examples: book, books, booking • Polysemy • Examples: back, train • ... a problem in CLIR.

Ambiguity Resolution • Methods • Part-of-speech tagging (e.g. Ballesteros & Croft ‘98) • Corpus-based methods Ballesteros & Croft ‘96; ‘97; Chen & al. ‘99) • Query Expansion • Collocations • Query structuring - the Pirkola Method (1998)

Query Structuring Concepts? • From weak to strong query structures by recognition of ... • concepts • expression weights • phrases, compounds • Queries may be combined ... query fusion no yes Weighting ? Weighting ? no yes no yes ~~ ~~ Phrases ? Phrases ? no yes no yes ~~ ~~ #wsum(1 3 #syn(a #3(b c)) 1 #syn(d e)) #sum(a b c d e)

Structured Queries in CLIR • CLIR performance (Pirkola 1998, 1999) • English baselines, manual Finnish translations • Automatic dictionary translation FIN -> ENG • natural language queries (NL) vs. concept queries (BL) • structured vs. unstructured translations • single words (NL/S) vs. phrases marked (NL/WP) • general and/or special dictionary translation • 500.000 document TREC subcollection • probabilistic retrieval (InQuery) • 30 health-related requests

The Pirkola Method • All translations of all senses provided by the dictionary are incorporated in the query • All translations of each source language word are combined by the synonym operator, synonym groups by #and or #sum • this effectively provides disambiguation

An Example • Consider the Finnish natural language query: • lääke sydänvaiva [= medicine heart_problem] • Sample English CLIR query: • #sum( #syn( medication drug ) heart #syn( ailment, complaint, discomfort, inconvenience, trouble, vexation ) ) • Each source word forming a synonym set

TREC Query Translation Test Set-up Translated Finnish Request English Request Finnish NL Query Finnish BL Query General Dict Med. Dict. General Dict Med. Dict. Baseline Queries Translated English Queries InQuery Unix-server

Unstructured NL/S Queries Baseline Only 38% of the average baseline precision (sd&gd) #sum(tw11, tw12, ... , tw21, tw22, ... twn1, ... , twnk)

Structured Queries w/ Special Dictionary 77% of the average baseline precision (sd & gd) Structure doubles precision in all cases Baseline #and(#syn(tw11, tw12, ... ), #syn(tw21, tw22, ...), #syn( twn1, ..., twnk))

Query Structuring, More Results

Transit CLIR – Query Structures Average precision for the transitive, bilingual and monolingual runs of CLEF 2001 topics (N = 50)

Transitive CLIR Results, 2

Transitive CLIR Effectiveness Lehtokangas & al 2008

TransCLIR + pRF effectiveness

Managing Morphologically Complex Languages in Information Retrieval

Managing Morphologically Complex Languages in Information Retrieval

Presentation Transcript

Information retrieval

Information Retrieval in Practice

Managing Morphologically Complex Languages in Information Retrieval

Information Retrieval in Context

Information Retrieval

Managing Diversity in Knowledge Representation and Information Retrieval

Probabilistic Models in Information Retrieval SI650: Information Retrieval

Information Retrieval

Topic Models for Morphologically Rich Languages

Skills in Information Retrieval

Evaluation in Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval