400 likes | 777 Views
TEXT EXTRACTION FROM IMAGES AND VIDEOS. ΜΑΡΙΟΣ ΑΝΘΙΜΟΠΟΥΛΟΣ. Εργαστήριο υπολογιστικής ευφυΐας. Ινστιτούτο πληροφορικής και τηλεπικοινωνιών. Outline. VOCR - overview Text Detection Text Tracking Text Segmentation Proposed methodology – Overview Text areas detection Text lines detection
E N D
TEXT EXTRACTION FROM IMAGES AND VIDEOS ΜΑΡΙΟΣ ΑΝΘΙΜΟΠΟΥΛΟΣ Εργαστήριο υπολογιστικής ευφυΐας Ινστιτούτο πληροφορικής και τηλεπικοινωνιών
Outline • VOCR - overview • Text Detection • Text Tracking • Text Segmentation • Proposed methodology – Overview • Text areas detection • Text lines detection • Multiresolution analysis • Text Segmentation • Evaluation strategy • Experimental results • Publications • Future work
VOCR overview • VOCR: A research area which attempts to develop a computer system with the ability to automatically read from videos the textual content visually embedded in complex backgrounds • Artificial (superimposed, graphic, caption, overlay) Text: artificially superimposed on images or video frames at the time of editing. Artificial text usually underscores or summarizes the video’s content. This makes artificial text particularly useful for building keyword indexes.
VOCR overview • VOCR: A research area which attempts to develop a computer system with the ability to automatically read from videos the textual content visually embedded in complex backgrounds • Scene text: naturally occurs in the field of view of the camera during video capture. Scene text occurring on signs, banners, etc. may also give keywords that describe the content of a video sequence
VOCR overview • Challenges for VOCR: • Lower resolution: video frames are typically captured at resolutions of 320 × 240 or 640 × 480 pixels, while document images are typically digitized at resolutions of 300 dpi or greater • Unknown text color: text can have arbitrary and non-uniform color. • Unknown text size, position, orientation, layout: captions lack the structure usually associated with documents. • Unconstrained background: the background can have colors similar to the text color. The background may include streaks that appear very similar to character strokes. • Color bleeding: lossy video compression may cause colors to run together. • Low contrast: low bit-rate video compression can cause loss of contrast between character strokes and the background.
VOCR overview • Basic steps of a VOCR system: Video Spatial text detection in every frame. Text detection a box for every text line Temporal text detection from frame to frame. Multi-frame integration for image enhancement Text tracking +enhancement an enhanced image for every text line Binarization and resolution enhancement Text segmentation a b/w image for every text line ASCII characters for every text line Text recognition Text
Text Detection Text detection generally can be classified into two categories: Bottom-up methods: they segment images into regions and group “character” regions into words. The methods, to some degree, can avoid performing text detection. Due to the difficulty of developing efficient segmentation algorithms for text in complex background, the methods are not robust for detecting text in many camera-based images and videos. Top-down methods: they first detect text regions in images using filters and then perform bottom-up techniques inside the text regions. These methods are able to process more complex images than bottom-up approaches. Top-down methods are also divided into two categories: - Heuristic methods: they use heuristic filters - Machine learning methods: they use trained filters
Text Detection Bottom-up method: Lienhart et al. [1] regard text regions as CCs with the same or similar color and size, and apply motion analysis to enhance the text extraction results for a video sequence. The input image is segmented based on the monochromatic nature of the text components using a split-and-merge algorithm. Segments that are too small and too large are filtered out. After dilation, motion information and contrast analysis are used to enhance the extracted results. [1] Rainer Lienhart and Frank Stuber, «Automatic text recognition in digital videos», Technical Report / Department for Mathematics and Computer Science, University of Mannheim ; TR-1995-036
Text Detection Top-down methods: Du et al. [2] use the multistage pulse code modulation (MPCM) to locate potential text regions in colour video images. A sequence of spatial filters is applied to remove noisy regions. [2] Du, Yingzi, Chang, Chein-I Thouin, Paul D. “Automated system for text detection in individual video Images”, Journal of Electronic Imaging, 12(3), 410 - 422. 2003.
Text Detection Top-down methods: Zhong et al.[3] use the DCT coefficients of compressed jpeg or mpeg files in order to distinguish the texture of textual regions from non-textual regions. [3] Yu Zhong, HongJiang Zhang, Anil K. Jain, Automatic Caption Localization in Compressed Video, IEEE Trans. Pattern Analysis Machine Intelligence, 22(4): 385-392 (2000)
Text Detection Top-down methods: Lienhart et al. [4] use gradient features fed to a neural network. For each 20x10 pixel window at each scale, its confidence value for text is added to the saliency map S which is finally binarized. [4] Rainer Lienhart and Axel Wernicke,« Localizing and Segmenting Text in Images and Videos», IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 12, NO. 4 (2002)
Text Tracking Text tracking: Temporal detection of text in video sequences Every-frame detection: All pairs of bounding boxes with non-zero overlap must have: - The difference in size, below a certain threshold. - The difference in position, below a certain threshold - The size of the overlap area, higher than a certain threshold. Periodical detection: A box in frame t is recognized in frame t+k, moved by a vector (Δx,Δy) if Cor is lower than a threshold
Text Tracking Image Enhancement - Multiframe Integration: If Fi i=1,…,T are the tracked boxes of T different frames the final image will be:
Text Segmentation Global binarization : - Otsu method: The most effective global thresholding Local binarization : - Niblack: very fast, proved effective for VOCR Uses a shifting window covering at least 1-2 characters Applies a threshold T=m+k*s where k= -0.2 m=mean, s=standard deviation. Problems with areas with no text - Sauvola: solves problem, assumes that text is dark in bright background T=m*(1-k*(1-S/R)) where R=128, k=0.5 Problems when the hypothesis is not true (even after reversing)
Proposed methodology - Overview • The proposed methodology exploits the fact that text lines produce strong vertical edges horizontally aligned and follow specific shape restrictions. Using edges as the prominent feature of our system gives us the opportunity to detect characters with different fonts and colors since every character present strong edges, despite its font or color, in order to be readable. • The whole algorithm is applied in a multiresolution fashion to ensure text detection with a size variability.
Proposed methodology - Overview • Flowchart of the text detection methodology : • Map generation: The edge map is generated using the Canny edge detector. • Dilation : A dilation by a cross-shaped element 5x21 is performed to connect the character contours of every text line. • Opening : A morphological opening is used to remove the noise and smooth the shape of the candidate text areas. The element used here is also cross-shaped with size 11x45. • Projections analysis : Edge projections are computed, and rows or columns with values under a threshold are discarded. Boxes with more than one text line are divided and some noisy areas are eliminated. • Scale integration : The methodology described above is applied to the image in different scales and finally the results are fused to the initial scale.
Proposed methodology - Text areas detection • We use Canny edge detector applied in greyscale images. Canny uses Sobel masks in order to find the edge magnitude of the image, in gray scale, and then uses non-Maxima suppression and hysteresis thresholding. With these two post-processing operations Canny edge detector manage to remove nonmaxima pixels, preserving the connectivity of the contours.
Proposed methodology - Text areas detection • After computing the Canny edge map, a dilation by an element 5x21 is performed to connect the character contours of every text line. Experiments showed that a cross-shaped element has better results.
Proposed methodology - Text areas detection • Then a morphological opening is used, removing the noise and smoothing the shape of the candidate text areas. The element used here is also cross-shaped with size 11x45.
Proposed methodology - Text areas detection • Every component with height less than 11 or width less than 45 is suppressed. • A connected component analysis help us to compute the initial bounding boxes of the candidate text areas.
Proposed methodology - Text lines detection • To increase the precision and reject the false alarms we use a method based on horizontal and vertical projections. • the horizontal edge projection of every box is computed and lines with projection values below a threshold are discarded. • Boxes with more than one text line are divided and some lines with noise are also discarded
Proposed methodology - Text lines detection • Boxes which do not contain text are usually split in a number of boxes with very small height and discarded by a next stage due to geometrical constraints. A box is discarded if: • Height is lower than a threshold (set to 12) • Height is greater than a threshold (set to 48) • Ratio width/ height is lower than a threshold (set to 1.5)
Proposed methodology - Text lines detection • Then, a similar procedure with vertical projection follows • The vertically divided parts remain connected if the distance between them is less than a threshold which depends on the height of the candidate text line (set to 1.5*height)
Proposed methodology - Text lines detection • The whole procedure with horizontal and vertical projections is repeated three times in order to segment even the most complicated text areas and results to the final bounding boxes
Proposed methodology - Multiresolution analysis • The size of the elements for the morphological operations and the geometrical constraints give to the algorithm the ability to detect text in a specific range of character sizes (12-48 pixels). • To overcome this problem we adopt a multiresolution approach. The algorithm described above is applied to the image in different resolutions and finally the results are fused to the initial resolution.
Proposed methodology - Multiresolution analysis • We chose to use two resolutions for this approach: the initial, and the one with a scale factor of 0.6. In this way the system can detect characters with height up to 80 pixels which was considered to be satisfying. Coarse resolution Fine resolution
Proposed methodology - Text Segmentation Text segmentation: Produce a b/w image for every detected text block. Normal Text We calculate the mean intensity value inside and outside the yellow box and compare the two values to decide about normal or inverse text Inverse Text Invert
Proposed methodology - Text Segmentation We based text segmentation on the adaptive binarization technique [5]: [5] B. Gatos, I. Pratikakis and S. J. Perantonis, "Adaptive Degraded Document Image Binarization", Pattern Recognition, Vol. 39, pp. 317-327, 2006.
Proposed methodology - Text Segmentation Original image Gray scale image First draft binarization Background surface Final binary image
Proposed methodology - Evaluation strategy • A text line must have influence to the final evaluation measure proportional to the number of containing characters and not to the number of its pixels • The number of characters in a box cannot be defined by the algorithm but it can be approximated by the ratio width/height of the bounding box
Proposed methodology - Evaluation strategy • Recognition : • Edit Distance is used in order to compare the detected and the correct text. We calculate the minimum number of character insertions, deletions and replacements in order to correct the resulting text. • We normalize the edit distance from 0 to 100.
Proposed methodology - Experimental results • Corpus : 3 sets of video frames (720x480) have been used, captured from TRECVID 2005 and 2006 (http://wwwnlpir.nist.gov/projects/trecvid/) • Set1 contains text in many different sizes as well as some scene text • Set2 contains images with very large fonts and also some scene text • Set3 contains artificial text with small fonts
Proposed methodology - Publications • M. Anthimopoulos, M. Gatos, I. Pratikakis"Multiresolution text detection in video frames“, Second international conference on computer vision theory and applications (VISAPP).Barcelona, Spain March 8-11, 2007 • M. Anthimopoulos, M. Gatos, I. Pratikakis, S.J.Perantonis “Detecting text in video frames”, The Fourth IASTED International Conference on Signal Processing, Pattern Recognition, and Applications (SPPRA), Innsbruck, Austria, February 2007. • M. Anthimopoulos, M. Gatos, I. Pratikakis”Text Detection in video frames” accepted for publication in the proc. of the 11th Pan-Hellenic Conference on Informatics (PCI 2007) ,Patras,May 2007.
Proposed methodology - Future work • Future work: • We plan to exploit the color homogeneity of text • Temporal text detection from frame to frame. • Multi-frame integration for image enhancement