180 likes | 314 Views
Simultaneous detection of vertical and horizontal text lines based on perceptual organisation. Claudie Faure CNRS-LTCI TELECOM-ParisTech. Nicole Vincent CRIP5 Université Paris Descartes. Context. Historical Medical Digital Library: Medic@
E N D
Simultaneous detection of vertical and horizontal text lines based on perceptual organisation Claudie Faure CNRS-LTCI TELECOM-ParisTech Nicole Vincent CRIP5 Université Paris Descartes
Context • Historical Medical Digital Library: Medic@ BIUM: Bibliothèque InterUniversitaire Médicale http://www.bium.univ-paris5.fr/histmed/medica.htm • Document image analysis • Information search • Visualisation of the collections
The readers’ needs • Find documents • Find information in the documents • Textual information • Visual information (illustrations, decoration, drop caps, …)
Figure&Caption detection Detection of vertical and horizontal text lines Origins of the method: • The caption lines • Perceptual grouping
Preprocessing Web image Binarisation Connected components
Graphics segmentation • Size: graphics (CCG) • Shape: rules, frames • Location: merge CCG • Text components (?)
NNE NNS Labelled connected components(1) Ex. 1 Ex. 2 Grouping by proximity
Labelled connected components(2) • Complementary labels: • No East neighbour • No South neighbour • Dot • Nearest neighbour of several CCs
1. Creating alignments: CC labels 2. Expanding alignments 3. Merging alignments Incremental grouping process: • Easier to control • No jump from local CCs to Text lines • Several levels of decision Grouping by proximity and continuity of direction
Conflicts ConflitHV: CC.Vline null AND CC.Hline null Vline is eliminated if : #CCs in Vline (6) < #CCs in Hlines (13)
Typographic conditions • Word spacing Character size D D > 2 * lineHeight • Continuity h1 h2 < 1.5 * h1 h2 h1 h2 > 1.5 * h1 h2
Separators Text lines do not straddle separators
Caption lines For each side of a Figure: the nearest text line Confidence of text lines: +1 for the closest line to the Figure +1 if the Figure and the text line are centred Caption line candidates: confidence > 0
Results • 52 pages with vertical and horizontal lines • Web images (different sizes and resolutions) Medic@ • 22 books (XIX century) • First caption lines (102) • 31 horizontal lines - 71 vertical lines
Conclusion • How do readers detect text lines? • Perceptually-based method • Reliable results • Material for further investigations • How do readers associate Figure and Caption? • Spatial reasonning • Visual contrast • Word spotting (« Fig---- »)