180 likes | 370 Views
On Combining Multiple Segmentations in Scene Text Recognition. Lukáš Neumann and Ji ří Matas Centre for Machine Perception, Department of Cybernetics Czech Technical University, Prague. T alk Overview. End-to-End Scene Text Recognition - Problem Introduction The TextSpotter System
E N D
On Combining Multiple Segmentations in SceneText Recognition Lukáš Neumann and Jiří Matas Centre for Machine Perception, Department of Cybernetics Czech Technical University, Prague
Talk Overview • End-to-End Scene Text Recognition - Problem Introduction • The TextSpotter System • Character Detection as Extremal Region (ERs) Selection • Line formation & Character Recognition • Character Ordering • Optimal Sequence Selection • Experiments
End-to-End Scene Text Recognition Input: Digital image (BMP, JPG, PNG) / video (AVI) Lexicon-free method Output: Set of words in the imageword = (horizontal)rectangular bounding box, text content Bounding Box=[240;1428;391;1770] Content="TESCO"
System Overview • Multi-scale Character Detection [1] with Gaussian Pyramid (new) • Text Line Formation [2] • Character Recognition [3] • Optimal Sequence Selection (new) [1] L. Neumann, J. Matas, “Real-time scene text localization and recognition”, CVPR 2012 [2] L. Neumann, J. Matas, “Text localization in real-world images using efficiently pruned exhaustive search”, ICDAR 2011 [3] L. Neumann, J. Matas, “A method for text localization and recognition in real-world images”, ACCV 2010
Character Detection - Thresholding Input image (PNG, JPEG, BMP) 1Dprojection <0;255> (grey scale, hue,…) Extremal regions with threshold ( =50, 100, 150, 200)
Extremal Regions (ER) • Let image I be a mapping I: Z2 S Let S be a totally ordered set, e.g. <0, 255> • Let A be an adjacency relation (e.g. 4-neigbourhood) • Region Q is a contiguous subset w.r.t. A • (Outer) Region Boundary δQ is set of pixels adjacent but not belonging to Q • Extremal Region is a region where there exists a threshold that separates the region and its boundary : pQ,qQ : I(p) < I(q) = 32 • Assuming character is an ER, 3 parameters still have to be determined: • Threshold • Mapping to a totally order set (colour space projection) • Adjacency relation
ER Detection - Threshold Selection • Character boundaries are often fuzzy • It is very difficult to locally determine the threshold value, typical document processing pipeline (image binarization OCR) leads to inferior results • Thresholds that most probably correspond to a character segmentation are selected using a CSER classifier [1], multiple hypotheses for each character are generated [1] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012
ER Detection – Threshold Selection • p(r|character) estimated at each threshold for each region • Only regions corresponding to local maxima selected by the detector • Incrementally computed descriptors used for classification [1] • Aspect ratio • Compactness • Number of holes • Horizontal crossings • Trained AdaBoost classifier with decision trees calibrated to output probabilities • Linear complexity, real-time performance (300ms on an 800x600px image) [1] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012
ER Detection - Color Space Projection • Color space projection maps a color image into a totally ordered set • Trade-off between recall and speed (although can be easily parallelized) • Standard channels (R, G, B, H, S, I) of RGB / HSI color space • 85.6% characters detected in the Intensity channel, combining all channels increases the recall to 94.8% Source Image Intensity Channel (no threshold exists for the letter “A”) Red Channel
ER Detection - Gaussian Pyramid • Pre-processing with a Gaussian pyramid alters the adjacency relation • At each level of the pyramid only a certain interval of character stroke widths is amplified • Not a major overhead as each level is 4 times faster than the previous one, total processing takes ~ 4/3 of the first level (1 + ¼ + ¼2…) Characters formed of multiple small regions Multiple characters joint together
Character Recognition • Regions agglomerated into text lines hypotheses by exhaustive search [1] • Each segmentation (region) labeled by a FLANN classifier trained on synthetic data [2] • Multiple mutually exclusive segmentations with different label(s) present in each text line hypothesis P ilI n f f m A n [1] Neumann, Matas, Text localization in real-world images using efficiently pruned exhaustive search, ICDAR 2011 [2] Neumann, Matas, A method for text localization and recognition in real-world images”, ACCV 2010
Character Ordering • Region A is a predecessor of a region B if A immediately precedes B in a text line • Approximated by a heuristic function based on text direction and mutual overlap • The relation induces a directed graph for each text line
Optimal Sequence Selection • The final region sequence of each text line is selected as an optimal path in the graph, maximizing the total score • Unary terms • Text line positioning (prefers regions which “sit nicely” in the text line) • Character recognition confidence • Binary terms (regions pair compatibility score) • Threshold interval overlap (prefers that neighboring regions have similar threshold) • Language model transition probability (2nd order character model) Accommodation
Experiments ICDAR 2011 Dataset – Text Localization [1] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform”, CVPR 2010
Experiments ICDAR 2011 Dataset – Text Localization [1] C. Shi, C. Wang, B. Xiao, Y. Zhang, and S. Gao, “Scene text detection using graph model built upon maximally stable extremalregions”, Pattern Recognition Letters, 2013 [2] A. Shahab, F. Shafait, and A. Dengel, “ICDAR 2011 robust reading competition challenge 2: Reading text in scene images”, ICDAR 2011 [3] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012 [4] C. Yi and Y. Tian, “Text string detection from natural scenes by structure-based partition and grouping”, Image Processing, 2011 [5] S. M. Hanif and L. Prevost, “Text detection and localization in complex scene images using constrained adaboostalgorithm”, ICDAR 2009
Experiments ICDAR 2011 Dataset – End-to-End Text Recognition Percentage of words correctly recognized without any error – case-sensitive comparison (ICDAR 2003 protocol) [1] L. Neumann and J. Matas, “Real-time scene text localization and recognition”, CVPR 2012
Sample Results on the ICDAR 2011 Dataset FREEDON chips cut CABOT PLACF
Conclusions • Multi-scale processing / Gaussian Pyramid improves text localization results without a significant impact on speed • Combining several channels and postponing the decision about character detection parameters (e.g. binarization threshold) to a later stage improves localization and OCR accuracy • Method current state • The method placed second in ICDAR 2013 Text Localization competition, 1.4% worse than the winner (f-measure)(unfortunately, end-to-end text recognition is not part of the competition) • Online demo available at http://www.textspotter.org/ • OpenCV implementation of the character detector in progress by the open source community • Future work • OCR accuracy improvement • Overcoming limitations of CC-based methods (e.g. non-linearity non-robustness caused by a single pixel)