150 likes | 294 Views
A New Approach for Video Text Detection and Localization. M. Cai, J. Song and M.R. Lyu VIEW Technologies The Chinese University of Hong Kong. Related work. Text Area Detection Uncompressed domain methods Texture-based Color-based Edge-based Compressed domain methods DCT coefficients
E N D
A New Approach for Video Text Detection and Localization M. Cai, J. Song and M.R. Lyu VIEW Technologies The Chinese University of Hong Kong
Related work • Text Area Detection • Uncompressed domain methods • Texture-based • Color-based • Edge-based • Compressed domain methods • DCT coefficients • Number of intra-coded blocks on P- / B- frames • Text String Localization • Bottom-up scheme • Top-down scheme
Language-independent characteristics • Contrast • An adaptive contrast threshold according to the background complexity • Color • Color bleeding caused by compression • Orientation • Well-defined size and orientation make it easy to understand • Stationary location • Appear a certain long time
Sampling & color space conversion Video text detection and localization on every sampled frame Multi-frame comparison Workflow
Original image Edge map Text regions Text area Detection Text string Localization Edge detection Size/ f(l) Size f(l) Original coordinates of text regions Level = 1 Level = 2 Level = n-1 Text area Detection Text string Localization Edge map Text regions Size/ f(l) Size f(l) Original coordinates of text regions Level = n Final text regions with original coordinates A sequential multi-resolution paradigm
Text detection • Edge detection • Sobel edge detector • Local thresholding • Adaptive to background complexity • Text-like area recovery • Enhance the density of text areas
Count Low part High part Edge strength 0 MAX (a) Concentric kernel and window (b) A window on the multi-line text area and the horizontal projection in it. (c) Local threshold selection Kernel P3h . . . . 3h h Window P1 Local Thresholding • Use a small kernel (gray) to scan the whole edge map row by row. • In the bigger window surrounding the kernel, check the background type: “Clear” or “Noisy”. • For Clear background and Noisy background, determined the local threshold by low and high parts, respectively, of the edge strength histogram in the bigger window.
Video image Global thresholding results Local thresholding results Thresholding result comparison
Before recoveryAfter recovery Text-like area recovery • Labeling: Classify current edge pixels as “TEXT” and “NON_TEXT” based on its local density. • Recovery/Suppression: • Bring back neighboring lower-strength edge pixels of the TEXT edge pixels. • The NON_TEXT edge pixels are suppressed.
Sub-regions Add to the processing array Y Pop the first region from the processing array Each sub-region Y Horizontal projection Vertical projection Divisible? Divisible? N N Indivisible regions Initialization The whole edge map is the only region in the processing array. The region If the array is empty, terminate. Check aspect ratio N Y Discard false regions Add to the resulting text regions Coarse-to-fine Text localization • Projection-based top-down localization. • To handle complex text layout.
(1) (2) (3) (4) Localization steps
Performance statistics Statistics of 10 news videos: • Processing time per frame: 0.25 s (PIII 1GCPU) • Detection rate = = 93.6% • Detection accuracy = = 87.2% • Localization accuracy = > 90%