1 / 35

Corpus-based Terminology Extraction applied to Information Access

Corpus Linguistics 2001, Lancaster, UK. Corpus-based Terminology Extraction applied to Information Access. Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas Informáticos, UNED , Spain. Content. Introduction Resources, Tools and Corpora

lorne
Download Presentation

Corpus-based Terminology Extraction applied to Information Access

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Corpus Linguistics 2001, Lancaster, UK Corpus-based Terminology Extraction applied to Information Access Anselmo Peñas, Felisa Verdejo and Julio Gonzalo NLP Group, Dpto. Lenguajes y Sistemas Informáticos, UNED, Spain

  2. Content • Introduction • Resources, Tools and Corpora • Terminology Extraction (TE) • Evaluation of the TE procedure • Terminology-based Information Access • Conclusions

  3. Introduction: Framework • The European Treasury Browser (ETB) project • Web site of Educational Resources (primary and secondary school) • Context of New Technologies • Objective: to build the structures to organise and retrieve educational resources • Similar systems • The Educational Resources Information Centre • The British Education Index

  4. Introduction: use ofThesauri Thesauri Definition:controlled vocabulary, structured in relations Structure:descriptors and relations (NT, BT, RT) Existing educational thesauri • Don’t cover primary and secondary school vocabulary within the new technologies context • Construction of a multilingual thesaurus is needed for the ETB project purposes Terminology Lists

  5. Objectives of the work • To build the Spanish list of candidate terms for the ETB multilingual thesaurus. • To develop a general procedure to obtain terminology lists • In an automatic way • Independently of the application domain • To explore effective ways of Information Retrieval • using the terminology lists instead of thesaurus • to bridge the gap between users’ and collection languages

  6. Content • Introduction • Resources, Tools and Corpora • Terminology Extraction (TE) • Evaluation of the TE procedure • Terminology based Information Access • Conclusions

  7. Resources and Tools Resources • Semantic network: EuroWordNet • Monolingual dictionary (VOX) • Bilingual dictionary (VOX) Tools • Tokeniser • Morphological analyser • POS tagger • Shallow parser (based on syntactic patterns)

  8. Corpora • Corpus of educational resources 1,075 documents (670,646 words) from • Programa de Nuevas Tecnologías (http://www.pntic.mec.es/main_recursos.html) • Aldea Global (http://sauce.pntic.mec.es/~alglobal) • Corpus of international news 7,364 documents (2.9 million words) • (http://www.elpais.es/internac) Pre-processing (html tags treatment, language detection, detection of repeated pages and chunks, etc.)

  9. Content • Introduction • Resources, Tools and Corpora • Terminology Extraction (TE) • Evaluation of the TE procedure • Terminology based Information Access • Conclusions

  10. Terminology Extraction (TE) Terminology List: List of mono-lexical and poly-lexicaltermswhich are usual in aspecific domain Steps of Terminology Extraction 1. Term detection 2. Term weighting 3. Term selection

  11. 1. Term Detection(mono-lexical) (Over both corpora, Educational Resources and International News) Processing • Tokenising • Lemmatising,Tagging • Removal of erroneous strings, abbreviations and words from other languages • Extraction of nouns, verbs and adjectives Result • List of candidate lemmas with its: • Term frequency (any form)in both collections • Document frequency in both collections

  12. ... como/CS en/Prep la/Art educación/N a/Prep distancia/N ,/Punc el/Art ministerio/N ... Detected term: educación a distancia Pattern:N Prep N 1. Term Detection (poly-lexical) (Over Educational Resources corpus) Processing • Tokenising, Lemmatising,Tagging • Shallow parsing (Syntactic pattern recognition) Syntactic Patterns for Spanish terminological phrases N N N A N [A] Prep N [A] N [A] Prep Art N [A] N [A] Prep V N [A] Prep V N [A] Result • List of candidate terminological phrases: • Term frequency in the collection • Document frequency in the collection

  13. in the domain corpus 2. Term weighting Empirical measure • Proportional to • term frequency • document frequency • Inversely proportional to • term frequencyin other domain • Normalisation

  14. 3. Term Selection • Removal of unfrequent terms in the study domain • Removal of very frequent terms in other domains • Ranking of terms according to their weight • Selection of top terms in the terminology list (thresholds to obtain 2,000 / 3,000 terms from the 75,000 detected terms) • Addition of phrases with relevant components

  15. Content • Introduction • Resources, Tools and Corpora • Terminology Extraction (TE) • Evaluation of the TE procedure • Terminology based Information Access • Conclusions

  16. Evaluation: Visual exploration Automatic generation of result pages in HTML Purpose • To help in the decisions of the prototype development • To evaluate the measures and techniques and to suggest improvements or modifications • To give further information to documentalists in order to assist final decisions in thesaurus construction

  17. Evaluation: Visual exploration

  18. Evaluation: Precision Proyectos curriculares (Proyecto curricular) Manual classification of the 2,856 selected terms Proyecto curricular Ciencias sociales Sistema operativo Profesorado materiales ¿? Alumnos ingleses Biblioteca nacional With a low effort, a large number of accurate terms is proposed to documentalists

  19. precision Precision, % of selected terms which are appropriate terms Higher precision on the top of the ranking number of selected candidates Evaluation: Precision With a lower number of candidates, the precision increases

  20. Content • Introduction • Resources, Tools and Corpora • Terminology Extraction (TE) • Evaluation of the TE procedure • Terminology-based Information Access • Conclusions

  21. Terminology-based Information Access Terminology Extraction in Information Retrieval provides: • At Indexing: to add poly-lexical terms to the indexes without the explosion of n-grams • Term browsing: to navigate through the terminology and access the documents from the terms (without the use of thesauri)

  22. Terminology-based Information Access • A difference with TE: terminology list truncation (as query gives the relevant terms, now the task is concerned with recall rather than precision of terms) • A new task: to retrieve terminology • Poly-lexical terms are retrieved from mono-lexical ones

  23. Terminology-based Information Access Terminology retrieval To bridge the gap between • Collection terminology • Query terms Requires • Query expansion • Query translation But produces noise in the retrieval However phrases provides an excellent way for ambiguity reduction (Ballesteros & Croft, 1998)

  24. Terminology-based Information Access de Prohibición embargo entredicho interdicción interdicto proscripción ban interdiction prohibition proscription de Pruebas cata, catadura degustación ensayo escandallo experimento gustación muestreo, tanteo demonstrate establish, exhibit experiment experimentation fall, fitting indicate, point present, proof prove, run sample, sampling shew,show, taste test, trial, try Nucleares nuclear nuclear Tratados acuerdo capitulación concertación convenio cuidar, pacto manejar procesar accord discourse handle manage pact process treat treatise treaty Expansion Translation Nuclear test ban treaty? Nuclear fitting interdiction manage? Nuclear taste proscription process?

  25. Content • Introduction • Resources, Tools and Corpora • Terminology Extraction (TE) • Evaluation of the TE procedure • Terminology based Information Access • Conclusions

  26. Conclusions • Extraction of relevant terms in Spanish for the ETB project domain (primary and secondary school / new technologies) • Automatic process from free resources as web pages • Exploring contexts and statistical data via Internet • Development of a search engine based on terminology extraction • Using terminology lists in an intermediate way between free-searching and thesaurus-guided searching • Without needing of thesaurus construction • Bridging the distance between the terms used in the query and the terminology used in the collection (even in different languages)

  27. Thanks for your attention

More Related