260 likes | 529 Views
Automatic multi-label subject indexing in a multilingual environment :. Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany. ECDL 2003: Trondheim, Norway 18 th August 2003. Agenda. Introduction
E N D
Automatic multi-label subject indexing in a multilingual environment: Boris Lauser Food and Agriculture Organization (FAO) of the UN, Rome, Italy Andreas Hotho University of Karlsruhe, Karlsruhe, Germany ECDL 2003: Trondheim, Norway 18th August 2003
Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion
Subject Indexing Introduction AutomaticIndexing Evaluation Outlook Discussion • “Subject indexing is the act of describing a document (or any information resource) in terms of its subject content” • Purpose: Facilitate high precision retrieval of references on a particular subject Full text search • Retrieval only based on word occurrences in text often leads to low precision results
Multilingual ! Multiple Labels ! Professional Indexer Subject Indexing at the FAO Introduction AutomaticIndexing Evaluation Outlook Discussion Controlled Vocabulary Resources • RICE Word Tree • BT cereals • BT plant products • UF paddy • RT oryza • RT rice flour • RT rice straw Metadata record Title: Indian rice production Author: … Subject: Rice flour,… Geographic Cov.: Bihar … • INDIA Word Tree • BT south asia • BT asia • NT andhra pradesh • NT arunachal pradesh • NT assam • NT bihar …
Subject Indexing at the FAO Introduction AutomaticIndexing Evaluation Outlook Discussion Large amounts of information Professional Indexing • Over 400,000 web pages • Numerous repositories of online publications • Bibliographical databases • Rapidly growing! • Labor intensive • Expensive • Information grows faster than professional indexing is possible Need for automatic help in indexing and classification
Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Human Indexer documents Automatic Classifier documents
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Representation method Documentword vector SupportVectorMachines (SVM) Pre-classifieddocuments Automatic Classifier document
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Word Vector Representation The riceproduction……India…farmers grow…water irrigation… producerice flour and…new productionlines… Wordstemming Document Word Vector
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Word Vector Processing Stopwords Pruning Word Vector Word Vector Word Vector
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Bag of Words Representation Word vectorof document 1 Weighing of word vectors with term frequency – inverted document frequency |D| number of documents df(t) number of documents, word occurred in
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Integration of Background Knowledge • Background knowledge represented in form of an ontology O: • Set of Concepts C • Concept hierarchy ≤C • Lexicon Lex Root Plant products Asia Cereals India Rice related Rice flour China EN: paddy AGROVOC as ontology EN: Rice FR: Riz ES: Arroz
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Integration of Background Knowledge Word vector with ontology integration Parameter Maximum Integration Depth: 1 Add Concepts! • Other strategies: • Replace • Only (document is represented only by its concepts language independent!) Integrationstrategy
Automatic Text Categorization Introduction AutomaticIndexing Evaluation Outlook Discussion Binary Support Vector Machines Class c Document word vectors Maximum Margin Hyperplane Class ĉ
Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion
Evaluation Introduction AutomaticIndexing Evaluation Outlook Discussion Bag of words representation, Training of SVM Goal: To achieve the best possible Approximation ! Trainingdocuments Support Vector Machines Testdocuments
Evaluation: Performance measures Introduction AutomaticIndexing Evaluation Outlook Discussion Class
The test document set Introduction AutomaticIndexing Evaluation Outlook Discussion FAO library catalogue AGROVOC Multilingual thesaurus (> 16000 classes) • Journals • Proceedings • Articles • Many other resources Indexed withkeywords from In 3 languages • English • French • Spanish Requirement for test set: > 50 documents per class
Evaluation: Introduction AutomaticIndexing Evaluation Outlook Discussion • 3 evaluation settings • Single-label vs. multi-label classification • Language recognition (single-label case, the only label is the language of the document) • Integration of background knowledge for the single-label case
Evaluation: ResultsSingle-label vs. multi-label classification Introduction AutomaticIndexing Evaluation Outlook Discussion
Evaluation: ResultsIntegration of background knowledge Introduction AutomaticIndexing Evaluation Outlook Discussion • English document set • single-label case only Reference value (no integration)
Evaluation: Conclusion Introduction AutomaticIndexing Evaluation Outlook Discussion • Support vector machines behave robust towards different languages • Results comparatively good concerning human indexer inconsistency • Ontology integration provides promising future possibilities
Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion
Outlook Introduction AutomaticIndexing Evaluation Outlook Discussion Representing a document’s word vector only with its conceptsfound in the ontology ! Language independentdocument representation • Possibility to • train SVM in one language only • classify documents in any language (provided by the multilingual ontology) • classify multilingual documents Language independent Text classifier • Further investigation necessary on • performance loss in case of total concept representation • performance with other document sets
Agenda Introduction AutomaticIndexing Evaluation Outlook Discussion • Introduction: • Subject Indexing • Automatic Indexing • Document representation model • Integration of background knowledge • Evaluation • Test document set • Results • Outlook • Questions and Discussion
References Introduction AutomaticIndexing Evaluation Outlook Discussion • More on automatic classification http://www.aifb.uni-karlsruhe.de/WBS/aho/ • More on knowledge managementhttp://www.fzi.de/wim/index.html • More on ontologies and ontology engineeringhttp://kaon.semanticweb.org • More on FAOAGROVOC online: http://www.fao.org/agrovocWaicent Portal: http://www.fao.org/waicent/index_en.asp