1 / 26

Automatic multi-label subject indexing in a multilingual environment :

Automatic multi-label subject indexing in a multilingual environment :. Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany. ECDL 2003: Trondheim, Norway 18 th August 2003. Agenda. Introduction

Rita
Download Presentation

Automatic multi-label subject indexing in a multilingual environment :

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic multi-label subject indexing in a multilingual environment: Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany ECDL 2003: Trondheim, Norway 18th August 2003

  2. Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion

  3. Subject Indexing Introduction AutomaticIndexing Evaluation Outlook Discussion • “Subject indexing is the act of describing a document (or any information resource) in terms of its subject content” • Purpose: Facilitate high precision retrieval of references on a particular subject Full text search • Retrieval only based on word occurrences in text  often leads to low precision results

  4. Multilingual ! Multiple Labels ! Professional Indexer Subject Indexing at the FAO Introduction AutomaticIndexing Evaluation Outlook Discussion Controlled Vocabulary Resources • RICE Word Tree • BT cereals • BT plant products • UF paddy • RT oryza • RT rice flour • RT rice straw Metadata record Title: Indian rice production Author: … Subject: Rice flour,… Geographic Cov.: Bihar … • INDIA Word Tree • BT south asia • BT asia • NT andhra pradesh • NT arunachal pradesh • NT assam • NT bihar …

  5. Subject Indexing at the FAO Introduction AutomaticIndexing Evaluation Outlook Discussion Large amounts of information Professional Indexing • Over 400,000 web pages • Numerous repositories of online publications • Bibliographical databases •  Rapidly growing! • Labor intensive • Expensive • Information grows faster than professional indexing is possible Need for automatic help in indexing and classification

  6. Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion

  7. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Human Indexer documents Automatic Classifier documents

  8. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Representation method Documentword vector SupportVectorMachines (SVM) Pre-classifieddocuments Automatic Classifier document

  9. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Word Vector Representation The riceproduction……India…farmers grow…water irrigation… producerice flour and…new productionlines… Wordstemming Document Word Vector

  10. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Word Vector Processing Stopwords Pruning Word Vector Word Vector Word Vector

  11. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Bag of Words Representation Word vectorof document 1 Weighing of word vectors with term frequency – inverted document frequency |D| number of documents df(t) number of documents, word occurred in

  12. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Integration of Background Knowledge • Background knowledge represented in form of an ontology O: • Set of Concepts C • Concept hierarchy ≤C • Lexicon Lex Root Plant products Asia Cereals India Rice related Rice flour China EN: paddy AGROVOC as ontology EN: Rice FR: Riz ES: Arroz

  13. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Integration of Background Knowledge Word vector with ontology integration Parameter Maximum Integration Depth: 1 Add Concepts! • Other strategies: • Replace • Only (document is represented only by its concepts  language independent!) Integrationstrategy

  14. Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Binary Support Vector Machines Class c Document word vectors Maximum Margin Hyperplane Class ĉ

  15. Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion

  16. Evaluation Introduction AutomaticIndexing Evaluation Outlook Discussion Bag of words representation, Training of SVM Goal: To achieve the best possible Approximation ! Trainingdocuments Support Vector Machines Testdocuments

  17. Evaluation: Performance measures Introduction AutomaticIndexing Evaluation Outlook Discussion Class

  18. The test document set Introduction AutomaticIndexing Evaluation Outlook Discussion FAO library catalogue AGROVOC Multilingual thesaurus (> 16000 classes) • Journals • Proceedings • Articles • Many other resources Indexed withkeywords from In 3 languages • English • French • Spanish Requirement for test set: > 50 documents per class

  19. Evaluation: Introduction AutomaticIndexing Evaluation Outlook Discussion • 3 evaluation settings • Single-label vs. multi-label classification • Language recognition (single-label case, the only label is the language of the document) • Integration of background knowledge for the single-label case

  20. Evaluation: ResultsSingle-label vs. multi-label classification Introduction AutomaticIndexing Evaluation Outlook Discussion

  21. Evaluation: ResultsIntegration of background knowledge Introduction AutomaticIndexing Evaluation Outlook Discussion • English document set • single-label case only Reference value (no integration)

  22. Evaluation: Conclusion Introduction AutomaticIndexing Evaluation Outlook Discussion • Support vector machines behave robust towards different languages • Results comparatively good concerning human indexer inconsistency • Ontology integration provides promising future possibilities

  23. Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion

  24. Outlook Introduction AutomaticIndexing Evaluation Outlook Discussion Representing a document’s word vector only with its conceptsfound in the ontology ! Language independentdocument representation • Possibility to • train SVM in one language only • classify documents in any language (provided by the multilingual ontology) • classify multilingual documents Language independent Text classifier • Further investigation necessary on • performance loss in case of total concept representation • performance with other document sets

  25. Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion

  26. References Introduction AutomaticIndexing Evaluation Outlook Discussion • More on automatic classification http://www.aifb.uni-karlsruhe.de/WBS/aho/ • More on knowledge managementhttp://www.fzi.de/wim/index.html • More on ontologies and ontology engineeringhttp://kaon.semanticweb.org • More on FAOAGROVOC online: http://www.fao.org/agrovocWaicent Portal: http://www.fao.org/waicent/index_en.asp

More Related