Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique

Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique 5 décembre 2003 Christian Wolf http://rfv.insa-lyon.fr/~wolf Directeur de thèse: Jean-Michel Jolion Laboratoire d'Informatique en Images et Systèmes d'information LIRIS, FRE 2672 CNRS Bât. Jules Verne, INSA de Lyon 69621 Villeurbanne cedex

The framework of the thesis 2 Industrial contracts with France Télécom: ECAV I, ECAV II “Enrichissement du Contenu Audio-Visuel” Collaboration with the Language and Media Processing Laboratory, University of Maryland. 2 research internships: 2001: character segmentation 2002: video indexing (TREC)

Indexing using Text Result Key word keyword-based Search Patrick Mayhew Indexing phase Patrick Mayhew Min. chargé de l´irlande de Nord ISRAEL Jerusalem montage T.Nouel ... ... ... ... ...

Plan Detection in stillimages Introduction Detection in video sequences Character segmentation Experimental Results Conclusion Introduction Still images Videos Character segmentation Results Conclusion

Introduction Still images Videos Character segmentation Results Conclusion Videos vs. scanned documents Temporal aspects Complex and moving background Artificial shadows Low resolution

Introduction Still images Videos Character segmentation Results Conclusion What is text? - character segmentation Scene text Artificial text

Introduction Still images Videos Character segmentation Results Conclusion What is text? - texture Original image Filter tuned to the example text Example: Gabor energy features on a text image Gabor energy Thresholded Gabor energy

Introduction Still images Videos Character segmentation Results Conclusion What is text? - contrast & geometry Example image Accumulated horizontal Sobel edges

Initial frame integration (averaging) Detection per single frame Text occurrences Tracking Suppression of false alarms Image Enhancement - Multiple frame integration Binarization “Soukaina Oufkir” OCR Introduction Still images Videos Character segmentation Results Conclusion A text detection system for videos

Introduction Still images Videos Character segmentation Results Conclusion Plan Detection in stillimages Introduction Detection in video sequences Character segmentation Experimental Results Conclusion

Introduction Still images Videos Character segmentation Results Conclusion 2 Algorithms for still images Calculate a text probability image according to a text model (1 value/ pixel) Calculate a text feature image (N values/pixel) Separate the probability values into 2 classes. Classify each pixel in the feature image Find the optimal threshold Post processing Post processing

Introduction Still images Videos Character segmentation Results Conclusion The local contrast method Calculate a text probability image according to a text model (1 value/ pixel) F. LeBourgeois Separate the probability values into 2 classes. Fisher/Otsu Post processing • Mathematical morphology • Geometrical constraints • Verification of special cases • Combination of rectangles

Introduction Still images Videos Character segmentation Results Conclusion Properties of the local contrast method • High detection accuracy (accurate localization). • Not very sensitive to the type of text. • Low computational complexity (very fast!). • False alarms due to the assumption of text presence. • Geometrical constraints are imposed in the post-processing step.

Introduction Still images Videos Character segmentation Results Conclusion Method 2: why learning? • Hope to increase the precision (decrease the number of false alarms) of the detection algorithm by learning the characteristics of text. • More complex text models are very difficult to derive analytically. • The discovery of support vector machine (SVM) learning and its ability to generalize even in high dimensional spaces opened the door to complex decision functions and feature models. • Inconvenience: • Specialization to a specific type of text (generalization)? Text exists in wide varies of forms, fonts, sizes, orientations and deformations (especially scene text).

Introduction Still images Videos Character segmentation Results Conclusion Geometrical features Learning gray values and edge maps alone may not generalize enough. Texture alone is not reliable, especially if the text is short. Geometry is a valuable feature. State of the art: enforce geometrical constraints in the post-processing step (mathematical morphology) We propose the usage of geometrical features very early in the detection process, i.e. not during post-processing.

Introduction Still images Videos Character segmentation Results Conclusion Geometrical features: baseline • Text consists of: • A high density of strokes in direction of the text baseline. • A consistent baseline (a rectangular region with an upper and lower border). • Two detection philosophies: • Detection of the baseline directly before detecting the text region. • Detection of the baseline as the boundary area of the detected text region in order to refine the detection quality.

Introduction Still images Videos Character segmentation Results Conclusion Estimation of the text rectangle height Original image Accumulated gradients

Introduction Still images Videos Character segmentation Results Conclusion Features Mode width (=rectangle height) Mode height (=Contrast) Difference height left-right Mode mean Mode standard deviation Difference in mode width

Training image database Introduction Still images Videos Character segmentation Results Conclusion Learning with Support Vector Machines positive samples negative samples Bootstrapping, cross-validation • Classification step: a reduction of the computational complexity is necessary: • Sub-sampling of the pixels to classify (4x4) • Approximation of the SVM model by SVM-regression.

Introduction Still images Videos Character segmentation Results Conclusion Plan Detection in still images Introduction Detection in video sequences Character segmentation Experimental Results Conclusion

Frame nr. (time) Text occurrences Introduction Still images Videos Character segmentation Results Conclusion Tracking the text appearances List of rectangles detected for the current frame List containing the most recent rectangle of each text occurrence The integration is done using greedy search in the overlap matrix.

Frequently text occurrences appear at the same location without significant temporal pause between them Introduction Still images Videos Character segmentation Results Conclusion Tracking: content verification Verification of the text box contents: L2 comparison of a signature vector (vertical projection profile of the Sobel edges). Different text Fading text Same text

Super-resolution (interpolation) Bi-linear interpolation Bi-cubic splines Multiple frame integration: Averaging Introduction Still images Videos Character segmentation Results Conclusion Enhancement Detected text occurence

Introduction Still images Videos Character segmentation Results Conclusion Adaptive binarization Niblack’s adaptive method: Sauvola’s improvement:

We keep the following pixels: Threshold: Introduction Still images Videos Character segmentation Results Conclusion Our solution: contrast maximization Contrast at the center of the image The contrast of the window The maximum local contrast

Introduction Still images Videos Character segmentation Results Conclusion Character segmentation: examples Original image Fisher/Otsu Fisher/Otsu (windowed) Yanowitz-B. Yanowitz-B. +post-proc. Niblack Sauvola et al. Contrast maximiz.

Introduction Still images Videos Character segmentation Results Conclusion Modeling text with a Markov random field Collaboration with Laboratory for language and Media Processing, University of Maryland (David Doermann) • Binarization as a Bayesian maximum a posteriori estimation problem using a Markov random field model. Prior models the prior knowledge on the spatial relationships in the image as a MRF. Likelihood of the observation depends on the observation and noise model. In our case: Gaussian Noisecorrected by Niblack’s threshold surface.

Introduction Still images Videos Character segmentation Results Conclusion The prior knowledge • The clique energies (4x4) are learned and interpolated from training data. • Optimization of the energy function with simulated annealing. The clique labelings of the repaired pixel before and after flipping it. All 16 cliques favor the change of the pixel.

Introduction Still images Videos Character segmentation Results Conclusion Evaluation measures Detection Ground truth • ICDAR: • 1-1 matches • overlap information only • CRISP: • 1-1, 1-M, M-1 matches • thresholded matches • no overlap information • AREA: • 1-1, 1-M, M-1 matches • thresholded matches • overlap information

Introduction Still images Videos Character segmentation Results Conclusion AIM2 Commercials AIM3 News AIM4 Cartoons, News AIM5 News

Introduction Still images Videos Character segmentation Results Conclusion Detection in still images Local contrast SVM learning

Introduction Still images Videos Character segmentation Results Conclusion Local contrast SVM learning

Introduction Still images Videos Character segmentation Results Conclusion The influence of falling generality Local contrast SVM learning

Introduction Still images Videos Character segmentation Results Conclusion Detection in video sequences

Baysian estimation using a Markov random field prior Sauvola et al. MRF Introduction Still images Videos Character segmentation Results Conclusion OCR results Local contrast based binarization Recognition by Abby Finereader 5.0

“Oil” “Air plane” “Airline” “Dance” Introduction Still images Videos Character segmentation Results Conclusion TREC 2002 “Music” The type of videos present in the collection does not favor the use of recognized text: text is only rarely present. “Energy Gas”

Introduction Still images Videos Character segmentation Results Conclusion Conclusion • We developed a new system for detection, tracking, enhancement and binarisation of text. • Detection performance is high due to the integration of several types of features in a very early stage. The learning method is less sensitive to textured noise in the image. • We proposed a new evaluation method which takes into account several measures of detection quality. • We derived a new binarisation method adapted to the type of text found in videos. • 2 patents 2 publications in international journals (+1 submitted) 3 publications in international conferences 6 publications in national conferences

Introduction Still images Videos Character segmentation Results Conclusion Outlook • Possible improvement of the features (e.g. contrast normalization, non-linear texture filters). • Integration of different feature types (statistical, structural, ...) • Multi orientation processing is not yet complete (new training set, implementation of the post processing) • Adaptation of the tracking algorithm to general types of motion. • OCR on low resolution grayscale images. • Usage of a priori knowledge on text in order to decrease the number of false alarms • Integration of the detected text into a indexing/browsing/segmentation framework

Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique

Détection des textes dans les images issues d ’un flux vidéo pour l´indexation sémantique

Presentation Transcript

D tection zone1

D O S

Interface syntaxe-s mantique pour l extraction d information

Vid o d introduction

M o o d D i s o r d e r s

B L O O D

O d d O w l s

C O L O R S A N D M O O D S !

M O D A L S

M o d a l s - 2

解： l 处取 d l ， d  = ( v  B ) · d l = - vB d l = - Bω l d l  OA =  O L - Bω l d l