190 likes | 299 Views
Computer Aided Document Indexing System ( CADIS ) with Eurovoc Bojana Dalbelo Ba šić Faculty of Electrical Engineering and Computing University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb marko.tadic@ffzg.hr. Project AIDE.
E N D
Computer Aided Document Indexing System (CADIS) with EurovocBojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social SciencesUniversity of Zagrebmarko.tadic@ffzg.hr Bruxelles, 2006-03-10
Project AIDE • idea for a project • September 2004, conference at JRC, Ispra • interdisciplinary collaboration of 3 institutions • Croatian Information Documentation Referral Agency (HIDRA) • Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb • Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb Bruxelles, 2006-03-10
AIDE – collaborating institutions • HIDRA • collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia • coordinator Maja Cvitaš, M.A. • ZEMRIS • research in the field of artificial intelligence, neural networks, machine learning, data and text mining • coordinators prof. Bojana Dalbelo Bašić andJan Šnajder • ZZL • computational linguistic research and building language technologies for Croatian • coordinator prof. Marko Tadić Bruxelles, 2006-03-10
AIDE – project objective Development of intelligentsystem for automatic indexingof the official documentationof the Republic of Croatiawith descriptors from Eurovoc thesaurus Bruxelles, 2006-03-10
AIDE – how? • automatic indexing, how? • program which “learns to index” • Joint Research Center of EC (JRC), Ispra, Italy • at least 10,000 manually indexed documents • 3-5 descriptors per document • 10-15 documents per descriptor • indexed documents stored in XML format • Steinberger (2003) • compiling a corpus of Croatian indexed documentsfor machine learning of automatic indexing with Eurovoc descriptors • situation with Croatian documentation in 2004. • there were only few hundreds of documents indexed • manual indexing: painfully slow Bruxelles, 2006-03-10
AIDE – how? • how could we speed up the manual indexing? • plan: • to develop a workstation for computer aided document indexing • conduct the research and development of algorithms in the field of computational linguistics/language technologies • insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS) Bruxelles, 2006-03-10
CADIS: two windows Eurovoc browser window Document window Bruxelles, 2006-03-10
Document Window Bruxelles, 2006-03-10
CADIS features • Enhanced user interface • list of descriptors appearing in document Bruxelles, 2006-03-10
CADIS features • Descriptors and non-descriptors marked in document Bruxelles, 2006-03-10
CADIS features • Lists of n-grams Bruxelles, 2006-03-10
CADIS features • Integration of corpus analysis • greyed n-grams are statistically relevant in the corpus Bruxelles, 2006-03-10
CADIS features • Manual marking of significant n-grams — important step towards automatic indexing Bruxelles, 2006-03-10
Eurovoc browser window Bruxelles, 2006-03-10
Further development • CADIS for other languages? • already for Croatian and English • usable for other languages without linguistic module • cooperation needed with respective language technology experts for development of linguistic module for other languages • partners for EU project proposals for the next step • AIDE • research on machine learning and text-mining • use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc • establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia Bruxelles, 2006-03-10
http://textmining.zemris.fer.hr Bruxelles, 2006-03-10
Conclusion • CADIS is unique in Europe • Web info at: • HIDRA: www.hidra.hr/hidra/aide/aide.htm • ZEMRIS: textmining.zemris.fer.hr • for download contact: bojana.dalbelo@fer.hr Bruxelles, 2006-03-10
Computer Aided Document Indexing System (CADIS) with EurovocBojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social SciencesUniversity of Zagrebmarko.tadic@ffzg.hr Bruxelles, 2006-03-10