E N D
Computer Aided Document Indexing System for Accessing LegislationA Joint Venture of Flanders and CroatiaBojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social Sciences, University of Zagrebmarko.tadic@ffzg.hrMarie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuvenmarie-francine.moens@law.kuleuven.ac.be Leuven, 2007-05-22
Talk overview • document indexing and computer aided document indexing • project AIDE • CADIS workstation: features • project CADIAL • eCADIS workstation: additional features • machine learning techniques • future developments • conclusions Leuven, 2007-05-22
Computer Aided Document Indexing • document indexing • attachment of descriptors from a controlled thesaurus to a document • descriptors = labels representing the content of a document • necessary for document retrieval in many document collections • parliamentary documentation • legislation • technical documentation • … • usually done manually • tedious, error prone, slow (max. 30-40 documents/day) • could computers be of any help in this process? • if we build a Computer Aided Document Indexing System (CADIS) Leuven, 2007-05-22
Project AIDE in Croatia • idea for a project • September 2004 • interdisciplinary collaboration of 3 institutions • Croatian Information Documentation Referral Agency (HIDRA) • Department of Electronics, Microelectronics, Computer and Intelligent Systems (ZEMRIS)Faculty of Electrical Engineering and ComputingUniversity of Zagreb • Institute of Linguistics (ZZL)Faculty of Humanities and Social SciencesUniversity of Zagreb Leuven, 2007-05-22
AIDE – collaborating institutions • HIDRA • collecting, processing, providing public access and promotion of the official documentation of the Republic of Croatia • coordinator Maja Cvitaš, M.A. • ZEMRIS • research in the field of artificial intelligence, neural networks, machine learning, data and text mining • coordinators prof. Bojana Dalbelo Bašić and Jan Šnajder, M.Sc. • ZZL • computational linguistic research and building language technologies for Croatian • coordinator prof. Marko Tadić Leuven, 2007-05-22
AIDE – project objective Development of intelligentsystem for automatic indexingof the official documentationof the Republic of Croatiawith descriptors from Eurovoc thesaurus Leuven, 2007-05-22
AIDE – how? • AIDE = Automatic Indexing of Documents with Eurovoc • automatic indexing, how? • program which “learns to index” documents • conference in Joint Research Center of EC (JRC), Ispra, Italy, 2004-09 • at least 10,000 manually indexed documents • 3-5 descriptors per document • 10-15 documents per descriptor • indexed documents stored in XML format • Steinberger (2003) • compiling a corpus of Croatian manually indexed documentsfor machine learning of automatic indexing with Eurovoc descriptors • situation with Croatian documentation in 2004-09 • there were only few hundreds of documents indexed • manual indexing: painfully slow • how could we speed up the manual indexing? Leuven, 2007-05-22
AIDE – activities • investigate and develop algorithms in the field of computational linguistics/language technologies • include that knowledge into the Computer Aided Document Indexing System (CADIS) • demonstration of CADIS in European parliament (2006-03-10) Leuven, 2007-05-22
CADIS: two parallel windows Eurovoc browser window Document window Leuven, 2007-05-22
Document Window Leuven, 2007-05-22
CADIS features • Enhanced user interface • list of descriptors literary appearing in document Leuven, 2007-05-22
CADIS features • Descriptors and non-descriptors marked in document Leuven, 2007-05-22
CADIS features • Lists of n-grams Leuven, 2007-05-22
CADIS features • Integration of corpus analysis • greyed n-grams are statistically relevant in the corpus i.e. collocations Leuven, 2007-05-22
CADIS features • Manual marking of significant n-grams • important step towards further refinment of automatic indexing Leuven, 2007-05-22
Eurovoc browser window Leuven, 2007-05-22
AIDE – activities • investigate and develop algorithms in the field of computational linguistics/language technologies • include that knowledge into the Computer Aided Document Indexing System (CADIS) • demonstration of CADIS in European parliament (2006-03-10) • ca 10,000 Croatian documents indexed in HIDRA using CADIS workstation during 2006 • joint project proposal with Katholieke Universiteit Leuven for CADIAL project Leuven, 2007-05-22
CADIAL project • Computer Aided Document Indexing for Accessing Legislation • a joint Flemish-Croatian project • Department International Flanders, grant no. KRO/009/06 • partners: • Katholieke Universiteit Leuven (prof. Marie-Francine Moens) • University of Zagreb, Hidra (prof. Bojana Dalbelo Bašić) • started: 2007-03 • duration: 2 years • web: www.cadial.org • the goal: publicly accessible service for automatic indexing of the official documentation of the Republic of Croatia • new version of CADIS (eCADIS) is one of modules in this project • planned as a web-based service Leuven, 2007-05-22
CADIAL project 2 • used the 10,000 manually indexed documents to train the system for automatic indexing of documents in Croatian • used the 20,000 manually indexed documents from Acquis to train the system for automatic indexing of documents in English • included that training data into the next version: eCADIS (-version) Leuven, 2007-05-22
eCADIS () features • Automatic suggestion of relevant descriptorsi.e. automatic indexing • application of machine learning techniques Leuven, 2007-05-22
eCADIS () features • Compare it to manually attached indexes… Leuven, 2007-05-22
eCADIS () features • Manual marking of inappropriate suggestions • another step in further refinment of automatic indexing Leuven, 2007-05-22
eCADIS () on document in English Leuven, 2007-05-22
eCADIS () on document in English • Automatic suggestion of relevant descriptorsi.e. automatic indexing Leuven, 2007-05-22
eCADIS () on document in English • Compare it to manually attached indexes… Leuven, 2007-05-22
Training the classifiers • already existing classifiers • profile classifier (Steinberger 2003) • K-nearest neighbours • binary classifiers • SVM, Logistic Regression, Rocchio, Bayes, … • classifiers used for the preliminary training • ca 3500 independent binary classifiers • need to be further evaluated • Logistic Regression used for 10,000 documents in Croatian • SVM used for 20,000 documents in English • features • tokens, lemmas, stems, character n-grams • various feature selection methods and their combinations: 2, ig, mi… Leuven, 2007-05-22
Further development of eCADIS • training with new features and feature selection methods • collocations, word n-grams, chunks • new measures for evaluation of results • sensitive to thesaurus hierarchy • web-interface for eCADIS for inclusion into the CADIAL system • eCADIS for other languages • now only Croatian and English (-version) covered • usable for other languages as it is, but without the linguistic module less efficient • no list of lemmas, but types • poor statistics for n-grams • cooperation with language technology experts in different languages for development of linguistic modules Leuven, 2007-05-22
Further development of eCADIS • … eCADIS for other languages • training the automatic indexing system for other languages • enables automatic suggestions of relevant descriptors in new, unseen documents • analysis of manual markings • descriptors, word n-grams, suggestions • promote the use of eCADIS in other countries beyond the scope of CADIAL project • e.g. Belgium (Flanders) • linguistic module for Dutch and French needed • computational lingustics expertise • training data from Acquis can be used to make an automatic indexing system for Dutch and French • machine learning expertise Leuven, 2007-05-22
Conclusion • CADIAL • a joint Flemish-Croatian project sponsored by Flemish government • better public access to Croatian official documentation • faster and improved document indexing • automatic content metadata generation (Semantic Web) • easier document retrieval and exploration of legislation • multilingual access via standardized EU thesaurus Eurovoc • a test-case for the usage of such a system in Flanders • Web information on CADIAL project and eCADIS • www.cadial.org • contact: • bojana.dalbelo@fer.hr • marie-france.moens@law.kuleuven.ac.be Leuven, 2007-05-22
Computer Aided Document Indexing System for Accessing LegislationA Joint Venture of Flanders and CroatiaBojana Dalbelo BašićFaculty of Electrical Engineering and Computing, University of Zagrebbojana.dalbelo@fer.hrMarko TadićFaculty of Humanities and Social Sciences, University of Zagrebmarko.tadic@ffzg.hrMarie-Francine MoensCentre for Law and IT / Dept. of Computer Science, Katholieke Universiteit Leuvenmarie-francine.moens@law.kuleuven.ac.be Leuven, 2007-05-22