1 / 56

Layout Analysis for Document Image Segmentation and Classification

This lecture introduces the concepts of structural and functional layout analysis for document images, including segmentation and classification. It covers techniques for finding structural blocks, handling noise and variations in font, and interactive zoning. The lecture also discusses the challenges of automatic math zone identification and statistics for skew and spacing. Examples of document analysis using Docstrum and area Voronoi diagrams are presented.

sharpd
Download Presentation

Layout Analysis for Document Image Segmentation and Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Image AnalysisLecture 21: Introduction to Layout Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center UC Berkeley CS294-9 Fall 2000

  2. Page layout analysis • Structural (Physical, Geometric) Layout Analysis [Segmentation] • Functional (Syntactic, Logical) Layout Analysis [Classification] • Read-order determination UC Berkeley CS294-9 Fall 2000

  3. Structural • Isolation of columns, paragraphs, lines words, tables, figures. Maybe letters. • Without some layout analysis, much of the previous work would be impossible! • Without layout analysis, what is the sequence of words in a multi-column format? UC Berkeley CS294-9 Fall 2000

  4. Functional • Typically domain dependent • May require merging or splitting of syntactic components • Encoding into ODA (object oriented document architecture) or SGML (DTD describes components like section, title..) UC Berkeley CS294-9 Fall 2000

  5. Functional Components • First page of a technical article may have • Title • Author • Abstract, body/column1 body/column2 footnotes • Pagination • Journal name/volume/date… • Business letter might have • Sender • Date • Logo • Recipient • Body • Signature UC Berkeley CS294-9 Fall 2000

  6. Finding structural blocks UC Berkeley CS294-9 Fall 2000

  7. Common Approaches • Top Down analysis • Horizontal and vertical profiles • Recursive: columns, paragraphs/lines/words • As illustrated earlier • Bottom Up analysis • Use adjacency based on • Pixels / morphology of dilation (millions) • RLE/ merge lines (thousands) • Connected Components (hundreds) • Look at the background (shape-directed covers) • Also, human hints. UC Berkeley CS294-9 Fall 2000

  8. Standard images…the Scanned Input UC Berkeley CS294-9 Fall 2000

  9. Smear character boxes UC Berkeley CS294-9 Fall 2000

  10. Smear words to get lines UC Berkeley CS294-9 Fall 2000

  11. Smear lines to get paragraphs UC Berkeley CS294-9 Fall 2000

  12. Issues: • Sensitivity to noise. Solutions: • Clean up via kfill or similar filtering, ruthlessly • Divide the page (artificially) and keep the noise from affecting the document globally • Slanted lines. Solution(s): • Deskew (since it is not too hard(?)) • Use nearest neighbors “docstrum” • Concave regions (text flow around a box). Solution(?) look at background • Variation in font, spacing can throw off analysis • Allow for local analysis UC Berkeley CS294-9 Fall 2000

  13. Interactive semi-automatic zoning (RJF) UC Berkeley CS294-9 Fall 2000

  14. Zoom in UC Berkeley CS294-9 Fall 2000

  15. Scroll around UC Berkeley CS294-9 Fall 2000

  16. View individual pixels UC Berkeley CS294-9 Fall 2000

  17. Semi… Turn up the noise filter until we start to kill some of the punctuation. How? As we turn up the threshold, the number of connected components drops, then reaches a stable plateau after the noise is gone, and then drops again as we remove punctuation, the dots above the “i” etc. UC Berkeley CS294-9 Fall 2000

  18. auto… Turn the horizontal smear knob until the number of components drops suddenly from about 3000 to about 600. Character boxes have been merged into wordboxes Turn the horizontal smear knob until the number of components drop from about 600 to about 100. Wordboxes have become lineboxes. UC Berkeley CS294-9 Fall 2000

  19. matic.. Tweek the vertical smear knob. Lines become paragraphs. (Turn further, and paragraphs become columns). UC Berkeley CS294-9 Fall 2000

  20. Specify read order UC Berkeley CS294-9 Fall 2000

  21. Interactive functional tagging:mark subject/author/etc? Here we attempt automatic id of math… Automatic math zone. This is a challenge because the zone is in two parts, containing the math … f(p)=F(p) UC Berkeley CS294-9 Fall 2000

  22. Docstrum/ L.O’Gorman UC Berkeley CS294-9 Fall 2000 5 nearest neighbors (ogorman93)

  23. Example of “spectrum” Each point represents distance and angle of a cc. N^2, but not so bad. UC Berkeley CS294-9 Fall 2000

  24. Statistics for skew and spacing Set the knobs? UC Berkeley CS294-9 Fall 2000

  25. Extract Lines, group to paragraphs • Statistically close enough horizontally to be words, then lines • Statistically close enough and parallel enough and the same length as… group two lines into the same text block. • (arguably saving time by not deskewing; dealing with non-constant skew) Example follows.. UC Berkeley CS294-9 Fall 2000

  26. Sections with different skew UC Berkeley CS294-9 Fall 2000 6 business cards, nearest neighbors vectors

  27. Extracted text lines, blocks UC Berkeley CS294-9 Fall 2000 Useful? General?

  28. Does Docstrum work? • Great on this page of business cards • An attempt to remove the assumption of most previous work that layout was “Manhattan” • Largely skew-independent. but • Useless if characters are not separated UC Berkeley CS294-9 Fall 2000

  29. Area Voronoi Diagram (Kise) Start with connected components Compute area ratios from pairs of neighboring connected components Adaptively compute thresholds of intercharacter and interline gaps UC Berkeley CS294-9 Fall 2000

  30. Point Voronoi diagram UC Berkeley CS294-9 Fall 2000

  31. Area Voronoi Diagram • Define the distance d between a point p and a figure g to be the minimum distance of p from any point in g UC Berkeley CS294-9 Fall 2000

  32. Computing an approximate area Voronoi diagram • Compute the point Voronoi diagram from a sampled set of points on the boundary of each figure. • Delete Voronoi edges generated from point-to-point on the same figure Advantage: we are not abstracting shapes into points (centroids) or into rectangles. UC Berkeley CS294-9 Fall 2000

  33. Example The points don’t show here… All we have to do now is decide which of the (many) Voronoi edges are appropriate for segmentation. UC Berkeley CS294-9 Fall 2000

  34. Features for selecting edges • Delete edges in narrow spaces, because they are merely separating characters or words. • Delete edges which divide two components of about equal area • Delete edges that don’t form loops. Characters in the same font but in different columns will be in different segments. Characters, even if they are close to a (large) halftone figure, will be separated from the figure. Find the threshold based on a frequency of distances UC Berkeley CS294-9 Fall 2000

  35. Example UC Berkeley CS294-9 Fall 2000

  36. Area Voronoi diagram UC Berkeley CS294-9 Fall 2000

  37. After deleting edges UC Berkeley CS294-9 Fall 2000

  38. Imposing loop conditions, pasting back the text (etc). UC Berkeley CS294-9 Fall 2000

  39. Errors Fragmentation Over-merging UC Berkeley CS294-9 Fall 2000

  40. Impressive UC Berkeley CS294-9 Fall 2000

  41. Reminder: Without layout analysis • Reading across columns • Misplacing captions • Misplacing footnotes • Misunderstanding page numbers (which should be REMOVED in the reformatting process) • Need extraction of biblio data: title, author, abstract, keywords • Nearly every subsequent step is compromised by lack of context. UC Berkeley CS294-9 Fall 2000

  42. A Diversion: Separating Math from Text • Why separate math from text? • Types of mathematics encountered • Previous Work • Two approaches • post-processing commercial OCR • character-based (details!) • Errors and their correction • Ambiguities UC Berkeley CS294-9 Fall 2000

  43. Why separate math/text/images/.. • OCR programs do not work for math becomes, in Textbridge, • Designation as a “picture” is only a partial solution UC Berkeley CS294-9 Fall 2000

  44. Mathematics on a Page Inline is harder to pick out because it may look like italics text UC Berkeley CS294-9 Fall 2000

  45. Previous Work • Isolation by hand (most math parser papers) • Texture/ statistics based heuristics • useful for display math “paragraphs” • not useful for in-line math • Character based pseudo-parsing (but without font information or true parsing feedback) • Incomplete UC Berkeley CS294-9 Fall 2000

  46. Proposal: Post-Processing of OCR • Start with commercial best-effort recognition • Reprocess the intermediate data structure (e.g. for TextBridge, the XDOC file) • Accept recognition of text zones with high recognition certainty. (Lines with no errors surrounded by lines with no errors are considered solved) UC Berkeley CS294-9 Fall 2000

  47. Separate uncertain areas • Re-consider “the rest of the image” as potential mathematics zones: uncertain regions (including nearby “certain” characters/lines) • Isolate characters, identify fonts, etc. • Play out heuristic rules for separating text and math zones. • Consider eradicating math and re-submitting text; separately recognizing math and reinserting in XDOC UC Berkeley CS294-9 Fall 2000

  48. Alternatively, Starting from our own naïve OCR • Connected component recognition • Separate characters by initial classification • Repeatedly re-examine via rules • Determine text zones, remove math / feed remainder to commercial OCR • How best to blank-out math? XXX • Most likely human interaction remains UC Berkeley CS294-9 Fall 2000

  49. Two bags: Math vs Text • Initially Math • + - = / Greek, scientific symbols, 0-9, italics, bold, (), [], sin, cos, tan, dots, commas, decimal points • Initially Text • Roman Letters, junk UC Berkeley CS294-9 Fall 2000

  50. Sample Text Bag UC Berkeley CS294-9 Fall 2000

More Related