130 likes | 355 Views
Localization, Extraction and Recognition of Text in Telugu Document Images. Chandra Kanth Chereddi Department of CSE College of Engineering Osmania University Hyderabad 500007, India chandra-kanth@ieee.org. K. Nikhil Shanker, Department of CSE, Mahatma Gandhi Institute of Technology,
E N D
Localization, Extraction and Recognition of Text in Telugu Document Images Chandra Kanth Chereddi Department of CSE College of Engineering Osmania University Hyderabad 500007, India chandra-kanth@ieee.org K. Nikhil Shanker, Department of CSE, Mahatma Gandhi Institute of Technology, Hyderabad, India nikhil.shanker@acm.org Atul Negi* Department of CIS University of Hyderabad Hyderabad 500046, India atulcs@uohyd.ernet.in *Author for correspondence This work is supported by Resource Center for Indian Language Technology Solutions (Telugu) University of Hyderabad established by Ministry of Communications and Information Technology, New Delhi, Government of India.
Acchulu Vowel Sound Symbols (16) Hallulu Consonant Sound Symbols (38) Maatra Vowel Sound Modifying Symbols for Hallulu (16) Voththulu Core Consonant Sound Symbols About Telugu Script • Consists of Rounded Shapes (no vertical strokes) • Characters may be basic vowel/consonant shapes or could be composed by compounding shapes (Negi et al ICDAR 2001 shows examples) • Example below shows glyphs in bounded boxes in a word pronounced as “Maa-tru-gee-ta”
Telugu Document Image Overall Process Segmentation Recognition (OCR) 16 Dim. Zoning Vector k=5NN Thresholded Match Isolated Character Recognized Character If Candidate is Found Cavity Vectors Cavity Vector Match/ Relaxed Matching Recognized Text Non-linear Shape Normalization Template Matching
Segmentation Example Top Right Input Image Left: Gradient Magnitude Image Right: Hough Transformed Text Words in Bounding Boxes Notice Non-Manhattan layout, noisy element in picture region, isolated text on bottom left
Segmentation Example Continued Top Left Input Image Bottom: Cleaned image output with word and glyph bounding boxes isolated Notice removal of noisy element in picture region, and retention of isolated text on bottom left in the output
Recognition-Zoning 4x4 Grid shown at left. Each grid zone counts the foreground pixel density to generate 16 dimensional feature vector We find k=5 Nearest Neighbors as candidates and those which are at a distance greater than 2.5 times distance of the nearest are eliminated
Recognition-Cavity Vectors Cavities are Structural features which can discriminate between shapes which other wise are very similar like the above confusing pair For surviving candidates a a binary valued cavity vector is generated based on the positional information of cavities as shown (top left)
Recognition- Controlled Nonlinear Shape Normalization Left Glyph “chE” scaled (Clock wise from Top left) (a) Linearly (b) Nonlinearly (c) Nonlinear with =0.04 (d) Nonlinear with =0.06 Non Linearity Control for parameter is based on the ratio Standard Deviation and Mean of the crossing count statistic used for normalization. (follows Lee and Park 1994)
Results and Discussion • Text Extraction and Segmentation • Text Segmentation copes with Non-Manhattan layouts • Accomplishes graphics from Text separation • Isolates words and glyph bounding boxes effectively • Recognition • Tested on India Today magazine pages • More than 94% accuracy • Novel “Cavity” Structural features used Telugu OCR from your browser: visit http://www.lihkin.net/velugu