180 likes | 287 Views
VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002. Terminology Retrieval: towards a synergy between thesaurus and free text searching. Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED. Overview. Motivation Objectives
E N D
VIII Iberoamerican Conference on Artificial Intelligence Sevilla, 2002 Terminology Retrieval: towards a synergy between thesaurus and free text searching Anselmo Peñas, Felisa Verdejo and Julio Gonzalo Dpto. Lenguajes y Sistemas Informáticos UNED
Overview • Motivation • Objectives • Proposed approach: Terminology Retrieval • Website Term Browser • Evaluation • Conclusions
MultilingualThesaurus 60. EDUCATIONAL SYSTEM Education NT1 adult education RT adult (10) RT lifelong learning NT1 basic education RT* transition from basic to secondary education RT didactic continuity (50) NT1 distance education UF distance learning UF distance study UF distance training UF ODL UF open and distance learning NT1 informal education NT1 lifelong learning UF continuing education UF lifelong education UF recurrent education RT adult education (…) Designed for • Indexing and searching in a specific subject area • Vocabulary control • Promoting consistency • Cross-language • Guiding users about which terms to use • Navigate the thesaurus
Multilingual Thesaurus Problems • Construction & management (high cost) • Indexing • Manual keyword assessment • Errors in automatic keyword assessment • Domain specific • New domain needs a new thesaurus • Specialist oriented (know preferred descriptors) • Less specialized audience get poorer results
Objectives • Develop a model • to help users to express and precise their information needs • to help users to overcome language barriers • Bringing to users the collection terminology • Morpho-syntactic, semantic & translingual variations • Without needs of thesauri construction • Establish an appropriate evaluation framework
Free Text Searching Automatic Terminology Extraction Terminology Retrieval & Term browsing (Website Term Browser) Proposed approach NLP Techniques Information Retrieval Controlled Vocabulary Searching Controlled Vocabulary Searching Free Text Searching
Terminology Retrieval From Automatic Terminology Extraction... Obtain lists of terms relevant for a specific domain • Term Extraction • Term Weighting • Term Selection ... to Terminology Retrieval Retrieve terms relevant for an information need • User query points the relevant terms • No terminology lists truncation • Favor recall relaxing term extraction patterns ... & Browsing • Navigate through relevant terminology • Access information from retrieved terms • Bridge the gap between query and collection vocabularies • Cross-Language
Terminology Retrieval Requires • Phrase indexing and retrieval • Query expansion and translation • To retrieve terminology variations • Morpho-syntactic variations • Semantic variations • Translingual variations • Noise in retrieval • Ambiguity reduction • Co-ocurrence of expansion words in the same phrase
Lemma Document Phrase Lemma Document Phrase Indexing Steps • Text pre-processing and listing of words • Word tagging (oriented to phrase detection) • Phrase detection & lemmatization of components • Document indexing & statistics (document frequency) Phrase selection (Subsumption & Lexicalization degree) Phrase indexing
Ambiguity Reduction Nuclear taste proscription process? Nuclear test ban treaty? Query expansion and translation de Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription de Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try Nucleares nuclear nuclear Expansion Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Translation
Tokenising Lexicon tok1 tok2 tok3 lem11 lem12 ... lem31 lem32 ... Lemmatising lem11 lem21 lem31 lem12 lem22 lem32 ··· ··· ··· EWN & Dic. Phrase index Document index Expansion / Translation exp31 exp32 ... tran31 tran32 ... Phrase retrieval Document retrieval exp21 exp22 ... tran21 tran22 ... exp11 exp12 ... tran11 tran12 ... Term ranking Document ranking terms documents Retrieval query
Query in Spanish Hierarchy of terms Ranking of documents English Spanish Catalan
- Translingual - Morpho-syntactic variations (permutation, insertion) - Semantic variations
Evaluation of Terminology Retrieval Compare • Terminology Retrieval over 42,406 web pages (200 Mb) • Hand-crafted Multilingual Thesaurus (1051 descriptors)
Evaluation of Terminology Retrieval Recall of mono-lexical terms (lemmas) • Monolingual: 85% - 95% • Translingual: 55% - 65% Recall of poly-lexical terms (phrases) • Monolingual: 40% - 65% • Translingual: 10% - 45% Loss of recall due to • Phrase extraction (mainly POS tagging): 3% - 17% • Phrase indexing (mainly lemmatization): 2% - 34% • Phrase selection: 12% - 37% • Lack of connections between different languages in EWN • Lack in EWN adjective hierarchies
Conclusions A search model based on extraction, retrieval and browsing of terminology has been developed • User oriented • Interaction over terminological information • Intermediate way between free-searching and thesaurus-guided searching • Without needs of thesaurus construction • Bringing to users the collection terminology • Morpho-syntactic & semantic variations • Translinguality
Conclusions An evaluation framework for Terminology Retrieval and Term Browsing has been established • Points the way to improve Terminology Retrieval • Users appreciate Term Browsing • WTB phrasal information can substantially complement the document ranking provided by the search engines