1 / 58

Document Processing Methods for Telugu and other SE Asian Scripts

Document Processing Methods for Telugu and other SE Asian Scripts. Authors: Atul Negi, VSR Sowri, K Mohan Rao Presented by: Atul Negi, Dept of CIS, University of Hyderabad atulcs@uohyd.ernet.in. SE Asian Scripts. Complex arrangement of connected components Problems

jacqulyn
Download Presentation

Document Processing Methods for Telugu and other SE Asian Scripts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Processing Methods for Telugu and other SE Asian Scripts Authors: Atul Negi, VSR Sowri, K Mohan Rao Presented by: Atul Negi, Dept of CIS, University of Hyderabad atulcs@uohyd.ernet.in

  2. SE Asian Scripts • Complex arrangement of connected components • Problems • difficulty in identifying the words and text line boundaries • touching characters • Nature of scripts: consonants with vowels and large number of distinct symbols

  3. SE Asian scripts-Contd. • SE Asian scripts such as Telugu, Kannada, Simhala are rounded in nature. • We base our work on Telugu Script which is orthographically similar to many SE Asian scripts.

  4. Acchulu Vowel Sound Symbols (16) Hallulu Consonant Sound Symbols (38) Maatra Vowel Sound Modifying Symbols for Hallulu (16) Voththulu Core Consonant Sound Symbols About Telugu Script • Consists of Rounded Shapes (no vertical strokes) • Characters may be basic vowel/consonant shapes or could be composed by compounding shapes ([NCK 01] shows examples) • Example above shows glyphs in bounded boxes in a word pronounced as “Maa-tru-gee-ta”

  5. Some Features of Telugu Script • Telugu is a phonetic script with each character representing a spoken syllable. • Contains curved letters with no vertical linear strokes and shirorekha (head line). • 16 Vowels, 36 consonants, Telugu OCR system [NCK 01] reduced possible 10,000 symbols to about 400 glyphs • Glyph represents a single connected component, but is NOT a character

  6. More Features of Telugu Script • Orthography is compositional with vowel sound symbols (matraas) modifying basic consonants. • Pure consonants sounds can be symbolized as vottus and can be combined with other consonant/vowel modified consonant symbols. • A character is made from a combination of the above • Vottus and matraas can be positioned at locations surrounding the base character

  7. OCR Efforts in Telugu Brief Review Recognition Approaches • [RD 77] Rajasekharan and Deekshatulu 1977 • [SSP 95] Sukhswami, Seetharamulu , and Pujari 1995 • [NCK 01] Negi, Chakravarthy and Krishna 2001 • [NCS 02] Negi, Chakravarthy and Suresh Kumar 2002 • [ P 02] Pujari et al 2002 • [C R M N] Chakravarthy et al. 2002 • [VP 02] Vasantha and Patvardhan 2002 • [ NKC 03] Negi, Kasinadhuni, Chandrakanth 2003

  8. Focus on Text Line and Character Segmentation Issues • In this presentation our contribution is focussed on • Text line Extraction: By clustering of connected components based upon their spatial properties. • Character segmentation- Drop Fall method and White stream method

  9. Text Line Segmentation

  10. Motivation Text-line and text column extraction are crucial in PLA (Text Line Segmentation) • Affects the word and character level analysis. • Helps in logical grouping of individual glyphs into characters. • Simplifies the determination of logical sequence of characters. • Can be used to reduce the search space of OCR.

  11. Overview of Text Line Segmentation Approaches • Approach as shown in [NKC 03] very complex, high time complexity • Pixel Projection Profile Approaches • Simpler, but do not work well with complex layouts and overlapping lines, or presence of skew • Bounding Box Projection Approaches • More efficient, work well in certain conditions • Limitations due to unevenness of white spacing • Bounding Box Co-ocurrances (this work)

  12. Text Line Segmentation Using BB Projections • Heuristics Based on BB Projections • Concept is to extract adjacent zero BB count scan lines between BB peak lines • White space in between text-lines is broken, uneven and not contiguous because of the vottus and maatras in between text lines. • Touching characters from adjacent text-lines • More heuristics to improve results by estimating interfering characters from BB projections but results are not very good due to difficulty of estimation

  13. Co-occurrence “A measure of OVERLAP between different connected components.” • It is based on the spatial relationships of connected components. • It’s symmetric in nature. • Two types: • Horizontal co-occurrence • Vertical co-occurrence • Co-occurrence defines 3 different spatial relationships between components.

  14. Horizontal Co-occurrence • Total Inclusion

  15. Horizontal Co-occurrence • Partial-inclusion

  16. Horizontal Co-occurrence • No Relation

  17. Vertical Co-occurrence

  18. Vertical Co-occurrence

  19. Text-line extraction using co-occurrence • Text-line extraction problem is formulated as: Identifying all the connected components which belong to the same text-line and obtaining the boundaries of text-lines by considering the bounding boxes of components. • Two major steps: • Computation of horizontal co-occurrence matrix for each pair of components. • Clustering of connected components based on the h-cooccurrence matrix.

  20. Text-line extraction - Clustering • Let P,Q be two CC in the document image. P<Q, P,QC, PTk, Q  ? • If h-co-occurrence(P,Q) = total inclusion add Q • h-cooccurrence(P,Q) = partial inclusion, add Q * • h-cooccurrence(P,Q) = no relation, check next • *-conditional to overlap being greater than ½(height) • Post processing step

  21. Text-line extraction - Results

  22. Experimental Results – Hand Written Document Image

  23. Experimental Results – Kannada Document Image

  24. Experimental Results – Tamil Document Image

  25. Character Segmentation

  26. Character Segmentation • Is an operation that seeks to decompose an image of sequence of characters into sub images of individual symbols.

  27. Character Segmentation methods

  28. How can we segment characters? Successful segmentation mainly involves two steps: 1.Locating a segmentation point 2.Generating a segmentation path Drop Fall Methods attempt to do both

  29. Hybrid Drop Fall Method • Segments the characters by following the contour of the image. • Advanced version of Hit and Deflect strategy. • Follows a set of rules that maximizes the chances that it will hit and deflect its way to an accurate path.

  30. Drop Fall builds a path by mimicking an object falling or rolling in between the two characters • There are 8 varieties of Drop Fall methods which differ in directions, starting points and set of rules. • Path generated by a drop fall can be seen in fig given below

  31. Locating the segmentation point • Pixels are scanned row-by-row until a black boundary pixel with another black boundary pixel to the right of it is detected, where the two pixels are seperated by atleast one white space. • This white pixel is then used as the starting point from which the marble is rolled down

  32. Starting point for Drop fall

  33. Incorrect segmentation of touching characters can be seen in the figure shown below. • Incorrect starting points leads to incorrect segmentation path.

  34. Drop Fall Path Generation • The algorithm first looks out for a white pixel in its surroundings and if unable to find a white pixel then only cuts through the black pixel. • The directions that the algorithm will move in according to the present pixel positions and its surroundings is shown below

  35. Top Left Drop Fall • Input: Image • 1.Binarize the input image • 2.Locate the Segmentation point (x, y) using drop fall • 3. Generate the segmentation path using the rules specified in the previous slide. • Output: Segmented Image

  36. Characters segmented using top left Drop Fall: (standard drop fall)

  37. Top left fails to segment the touching characters when the first character contains a Talakattu or is of concave shape. • Eg :Incorrect segmentation of Touching characters using Top left drop fall

  38. Top Right Drop Fall • Identical to Top left drop fall except that it initiates from the top-right of the connected component. • Input: Touching character Image • Binarise the input image • Flip the image vertically • Locate the segmentation point • Generate the segmentation path • Re-flip the Image and obtain the segmented image.

  39. Top Right Drop Fall

  40. Bottom Left Drop Fall • Identical to standard drop fall except that it initiates from the bottom left drop fall • Input:Touching characters • Binarise the input image • Flip the image horizontally • Locate the segmentation point • Generate the segmentation path • Re-flip the Image horizontally and obtain the segmented image.

  41. Bottom Left Drop Fall Method • Touching Characters segmented using Bottom left drop fall

  42. Bottom left Drop Fall • Fails to segment the touching characters when the bottom half of the first character consists of curves or grooves

  43. Bottom right drop fall Method

  44. Characters segmented using Bottom right drop fall • Cases where Bottom right drop fall fails to segment the touching characters

  45. Advanced Drop fall methods • Similar to Drop fall method in locating the segmentation point but while generating the segmentation path follows different set of rules. • While generating the segmentation path it will be look out for white pixels and when unable to find a while pixel it will move for black pixels and when it is on black pixels it will only look for black pixels.

  46. Difference between drop fall and Advanced drop fall segmentation paths

  47. Advanced Top left Drop Fall • Characters using Advanced top left drop fall • Incorrectly segmented characters using Advanced top left drop fall

  48. Advanced Top right drop fall • Identical to Top right drop fall except the segmentation path generated is different. • Characters segmented using Advanced top right drop fall

  49. Advanced bottom left drop fall • Characters segmented using Advanced bottom left drop fall • Incorrectly segmented characters using Advanced bottom left drop fall

  50. Advanced bottom right drop fall (ABRD) • Characters segmented using ABRD • Incorrect segmented characters using ABRD

More Related