140 likes | 215 Views
AGORA 2007 : Adaptative software for layout analysis of document images. Contact : JY Ramel ( ramel@univ-tours.fr ) N. Journet ( journet@univ-tours.fr ). Context of this work. With the CESR during BVH project During French national research projects
E N D
AGORA2007 : Adaptative software for layout analysis of document images Contact : JY Ramel (ramel@univ-tours.fr) N. Journet (journet@univ-tours.fr)
Context of this work • With the CESR during BVH project • During French national research projects • Construction of automatic indexation tools • Adapted to degraded documents • For segmentation Text/Graphic separation • For automatic Information retrieval • For Text transcription Automatic meta-data extraction and indexation
Meta-data production • Manually • Author, year, editor… • Semantical description of the graphical parts iconclass • Specific meta-data for experts • Keywords CESR website • Automatically with AGORA • Positions of EoC: dropcaps, portraits, paragraphs, … • Transcription of the text parts • Specific information about EoC • EoC = Element of Content AGORA software
First step: EoC separation Text Text Noise Image 3 types of EoC Text Image Text
Second step : Recognition of EoC Title margin Noise Caption Labels Text Dropcap Text
AGORA : Interactive extraction of EoC User gives information About the size of characters Proposition of text/graphics segmentation About the size of graphics About spaces between letters, between words, …
AGORA : Interactive recognition of EoC How to associate a label to each EOC ? Impossible to foresee all the user needs AGORA needs the user to learn how to recognize the desired EoC AGORA philosophy A user show examples of EoC interactive construction of scenarios of recognition A scenario is a set of extraction rules A manual modification of scenario is still possible 7
Global methodology Automatic insertion of the rules in the scenario Previsualization Exemple selection As many exemples as necessary 4 exemples selected in 3 images Obtained Results Manual modification of the rules
Automatic creation of recognition rules Distance from the top Distance from the left avg = 0,46 std = 0,41 avg = 0,51 std = 0,07
Manual modification of the rules ERROR 10
AGORA outputs Set of EdC (here EOC = TEXT) 1 XML file for 1 EOC Information about the position and the orignal image corresponding to the block Information about the lines in a text block Information about the words in a line Informations about the characters in the words
AGORA outputs Extracted label Extracted images
Strengthness of AGORA2007 An Assistant drives the user through the different steps Processing of complete works automatically (using a scenario) Constant visualisation of processing results Easy to use (no specific knowledge is necessary)