1 / 42

Word Spotting: Indexing Handwritten Manuscripts

Word Spotting: Indexing Handwritten Manuscripts. Michael D. Fecina IST 497/597 22-JAN-02. History. OCR was used in the past for indexing machine typed letters and documents OCR does not work well with handwritten documents because of: noise (ink marks, perhaps)

jera
Download Presentation

Word Spotting: Indexing Handwritten Manuscripts

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Word Spotting: Indexing Handwritten Manuscripts Michael D. Fecina IST 497/597 22-JAN-02

  2. History • OCR was used in the past for indexing machine typed letters and documents • OCR does not work well with handwritten documents because of: • noise (ink marks, perhaps) • variations among writing styles • inconsistencies in formation of letters/words

  3. More history … • OCR is used to segment a page into words, then break each word into it’s characters • OCR successful with clean machine fonts against clean background • Character segmentation is too difficult with handwritten documents

  4. Motivation • To efficiently index historical hand written documents • To simplify reading documents where the handwriting is particularly hard to read • Eventually, just as with images, it is hoped that automatic indexing of documents will be available

  5. Specific important documents to be indexed … • W.E.B. Dubois • Washington and other Presidents’ writings located in Library of Congress • Over 6,400 scanned 8-bit grey level images of Washington’s manuscripts • Serve as valuable resources for scholars as well as others who wish to consult original source material

  6. What is word spotting? • A method by which handwritten material can be indexed • Assumes documents are written by same person • Assumes that variations between same-word occurrences is minimal • The above assumption does not always hold true (significant contrib. to error)

  7. More about word spotting • Avoid recognizing the words • Use word images • What is difficult about it?? • Segmenting the page into words • Ascenders, descenders • Noise, inconsistencies • Matching the words effectively

  8. Methodology • Obtain grey level image of document. • Reduce image by ½ using Gaussian filtering and sub-sampling. • Image is then binarized by thresholding. (characters=white/bg = black) • Binary image segmented into words (word images).

  9. Methodology • Each word image is tested against every other word image; yet pruning takes place dependent upon image area and aspect ratios. • Matching produces equivalence classes. • Top n equivalence classes chosen. Top s classes are removed; noted as stop words. Then, user provides ASCII equivalent for remaining top m classes.

  10. Details of Word Segmentation • Spacing between characters is smaller than that of between words • If two white pixels are separated by less than a certain distance k, the intermediate pixels are made white • Done in horizontal and vertical direction to obtain descenders

  11. Word segmentation … • Errors do occur using this algorithm (dot over the i,j) • However, minimum length is required. This removes the dots of the i/j becoming separate word images • If large gaps are left in some instances of a word, but not in another, segmented as different word

  12. Senior Document

  13. Segmented Senior Document

  14. Two primary algorithms used for word matching: • EDM (Euclidian Distance Mapping (D. 1980)) • Fast, but assumes that no distortions have occurred except for relative translation • Does well matching words with relatively low variations in reference to the template • SLH (Scott and Longuet-Higgins (1991)) • Assumes an affine transformation between the words • Slow, computationally expensive in current implementations

  15. Matching with EDM • Aligning – vertical alignment by baseline, horizontal by coinciding left sides. (thus vertical al. > horizontal al.) • XOR image is computed – XOR corresponding pixels to produce the difference between the images • Not good for sole use in determining image difference since equal weight is given to isolated pixels and blobs …

  16. XOR for Lloyd What’s in each one, but not both, of the images …

  17. EDM Step • EDM computed by assigning to each white pixel in the image its minimum distance to a black pixel • A white pixel inside a blob will get a larger distance than isolated white pixel • An error measure, (EEDM) can now be calculated by summing the distance measures for each pixel

  18. Forming Blobs using EDM • The distance between every white pixel and the nearest black pixel is computed • distance < threshold, assumed to be noise.

  19. Problems with EDM • EDM does not discriminate well between good and bad matches • Fails when there is significant distortion in the words • Need for matching algorithm that models some variation -> SLH

  20. SLH Matching technique • Affine transformation allows for scaling and shear deformations in both directions • Much more accurate than the Euclidian Distance Mapping technique • Computationally slow and expensive because the SVD (Singular Value Decomposition) must be computed for large matrix

  21. Differences (+/-) • EDM does not account for any distortions and thus performs poorly when handwriting is bad • SLH almost always produces the correct rankings even if the handwriting is bad • Two areas need to be improved with both: • Speed and word validity discrimination

  22. Tests . . . • Two documents, “Senior” and “Hudson”, were compared using both matching algorithms • The statistical information for both documents is as follows:

  23. Test Information • Since the SLH algorithm is slow but more accurate than the EDM, EDM was applied first • A cut off threshold was used to limit the number of classes (words) displayed, and remained constant in both tests

  24. Senior Document - classes stop words

  25. EDM on Senior • The EDM algorithm performed quite well on Senior Document • Average precision of 78% • Since the handwriting is good, this performance was expected • Remember that EDM does not account for much variation in word images

  26. EDM matches for Lloyd 50% correct

  27. EDM problems . . . • The algorithm performs poorly in that it cannot discriminate well between valid and invalid words.

  28. The Hudson Document

  29. EDM on Hudson • Average precision was 57.9%, much lower than that of the Senior Document (78%) • Difference in precision attributed to the handwriting • Difficult to read even for humans looking at greyscale images at 300 dpi

  30. Problems with EDM/Hudson

  31. SLH matches for Lloyd 62.5 % correct 3/8 incorrect

  32. SLH on Senior • Proved to be very accurate, yet as mentioned before, slow … • Average precision of SLH was 86.3%, compared to 78.7% for EDM • SLH recorded the word rankings correctly, and also showed a much greater discrimination in match error

  33. SLH on Hudson • Very difficult because of writing, but ranking proved to be much better than EDM • Performance on templates like “they” was good – probably because they are simple, repetitive words • Correct ranking for the word “Standard”

  34. “Standard” template with SLH

  35. Current matching techniques • SLH is more than reasonably accurate, but slow in current implementations. References report this will change ... • Require matching every word against every other word • O(N2). • N2 = 220 x 6400 ~ 1012 matches !!!

  36. Main problems with wordspotting • Handwriting style

  37. Main problems with wordspotting • Word scaling and connection of ascenders, descenders

  38. Main problems with wordspotting • Skew and Noise

  39. Recent Work • Fixed bugs and some problems with algorithm • Recently have successfully segmented the 6400 scanned images of George Washington’s documents • Its impractical (too labor intensive) to compute segmentation statistics on the entire collection.

  40. Future Work • Continue working on new methods to match words effectively and efficiently • Possible ideas for better matching techniques include: • Combining multiple word features • Language probability for automatic indexing • Continue working on new methods to match words effectively and efficiently • Integrate into indexing scheme (back of book index)

  41. Conclusion • EDM works reasonably well for matching words, but SLH is better since it accounts for variations • SLH pays the price – computationally expensive • Future work is needed; but progress thus far is very encouraging

  42. Questions? Thanks for listening to me blob () about Wordspotting. Any Comments/Questions?

More Related