1 / 19

Project AIDE

Computer Aided Document Indexing System ( CADIS ) with Eurovoc Bojana Dalbelo Ba šić Faculty of Electrical Engineering and Computing University of Zagreb bojana.dalbelo@fer.hr Marko Tadić Faculty of Humanities and Social Sciences University of Zagreb marko.tadic@ffzg.hr. Project AIDE.

Download Presentation

Project AIDE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer Aided Document Indexing System (CADIS) with EurovocBojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social SciencesUniversity of Zagrebmarko.tadic@ffzg.hr Bruxelles, 2006-03-10

  2. Project AIDE • idea for a project • September 2004, conference at JRC, Ispra • interdisciplinary collaboration of 3 institutions • Croatian Information Documentation Referral Agency (HIDRA) • Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb • Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb Bruxelles, 2006-03-10

  3. AIDE – collaborating institutions • HIDRA • collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia • coordinator Maja Cvitaš, M.A. • ZEMRIS • research in the field of artificial intelligence, neural networks, machine learning, data and text mining • coordinators prof. Bojana Dalbelo Bašić andJan Šnajder • ZZL • computational linguistic research and building language technologies for Croatian • coordinator prof. Marko Tadić Bruxelles, 2006-03-10

  4. AIDE – project objective Development of intelligentsystem for automatic indexingof the official documentationof the Republic of Croatiawith descriptors from Eurovoc thesaurus Bruxelles, 2006-03-10

  5. AIDE – how? • automatic indexing, how? • program which “learns to index” • Joint Research Center of EC (JRC), Ispra, Italy • at least 10,000 manually indexed documents • 3-5 descriptors per document • 10-15 documents per descriptor • indexed documents stored in XML format • Steinberger (2003) • compiling a corpus of Croatian indexed documentsfor machine learning of automatic indexing with Eurovoc descriptors • situation with Croatian documentation in 2004. • there were only few hundreds of documents indexed • manual indexing: painfully slow Bruxelles, 2006-03-10

  6. AIDE – how? • how could we speed up the manual indexing? • plan: • to develop a workstation for computer aided document indexing • conduct the research and development of algorithms in the field of computational linguistics/language technologies • insert that knowledge in the workstation and turn it into Computer Aided Document Indexing System (CADIS) Bruxelles, 2006-03-10

  7. CADIS: two windows Eurovoc browser window Document window Bruxelles, 2006-03-10

  8. Document Window Bruxelles, 2006-03-10

  9. Bruxelles, 2006-03-10

  10. CADIS features • Enhanced user interface • list of descriptors appearing in document Bruxelles, 2006-03-10

  11. CADIS features • Descriptors and non-descriptors marked in document Bruxelles, 2006-03-10

  12. CADIS features • Lists of n-grams Bruxelles, 2006-03-10

  13. CADIS features • Integration of corpus analysis • greyed n-grams are statistically relevant in the corpus Bruxelles, 2006-03-10

  14. CADIS features • Manual marking of significant n-grams — important step towards automatic indexing Bruxelles, 2006-03-10

  15. Eurovoc browser window Bruxelles, 2006-03-10

  16. Further development • CADIS for other languages? • already for Croatian and English • usable for other languages without linguistic module • cooperation needed with respective language technology experts for development of linguistic module for other languages • partners for EU project proposals for the next step • AIDE • research on machine learning and text-mining • use that knowledge to turn the workstation into an intelligent system for Automatic Indexing of Documents with Eurovoc • establishing the publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia Bruxelles, 2006-03-10

  17. http://textmining.zemris.fer.hr Bruxelles, 2006-03-10

  18. Conclusion • CADIS is unique in Europe • Web info at: • HIDRA: www.hidra.hr/hidra/aide/aide.htm • ZEMRIS: textmining.zemris.fer.hr • for download contact: bojana.dalbelo@fer.hr Bruxelles, 2006-03-10

  19. Computer Aided Document Indexing System (CADIS) with EurovocBojana Dalbelo BašićFaculty of Electrical Engineering and ComputingUniversity of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social SciencesUniversity of Zagrebmarko.tadic@ffzg.hr Bruxelles, 2006-03-10

More Related