430 likes | 571 Views
Word Spotting: Indexing Handwritten Manuscripts. Michael D. Fecina IST 497/597 22-JAN-02. History. OCR was used in the past for indexing machine typed letters and documents OCR does not work well with handwritten documents because of: noise (ink marks, perhaps)
E N D
Word Spotting: Indexing Handwritten Manuscripts Michael D. Fecina IST 497/597 22-JAN-02
History • OCR was used in the past for indexing machine typed letters and documents • OCR does not work well with handwritten documents because of: • noise (ink marks, perhaps) • variations among writing styles • inconsistencies in formation of letters/words
More history … • OCR is used to segment a page into words, then break each word into it’s characters • OCR successful with clean machine fonts against clean background • Character segmentation is too difficult with handwritten documents
Motivation • To efficiently index historical hand written documents • To simplify reading documents where the handwriting is particularly hard to read • Eventually, just as with images, it is hoped that automatic indexing of documents will be available
Specific important documents to be indexed … • W.E.B. Dubois • Washington and other Presidents’ writings located in Library of Congress • Over 6,400 scanned 8-bit grey level images of Washington’s manuscripts • Serve as valuable resources for scholars as well as others who wish to consult original source material
What is word spotting? • A method by which handwritten material can be indexed • Assumes documents are written by same person • Assumes that variations between same-word occurrences is minimal • The above assumption does not always hold true (significant contrib. to error)
More about word spotting • Avoid recognizing the words • Use word images • What is difficult about it?? • Segmenting the page into words • Ascenders, descenders • Noise, inconsistencies • Matching the words effectively
Methodology • Obtain grey level image of document. • Reduce image by ½ using Gaussian filtering and sub-sampling. • Image is then binarized by thresholding. (characters=white/bg = black) • Binary image segmented into words (word images).
Methodology • Each word image is tested against every other word image; yet pruning takes place dependent upon image area and aspect ratios. • Matching produces equivalence classes. • Top n equivalence classes chosen. Top s classes are removed; noted as stop words. Then, user provides ASCII equivalent for remaining top m classes.
Details of Word Segmentation • Spacing between characters is smaller than that of between words • If two white pixels are separated by less than a certain distance k, the intermediate pixels are made white • Done in horizontal and vertical direction to obtain descenders
Word segmentation … • Errors do occur using this algorithm (dot over the i,j) • However, minimum length is required. This removes the dots of the i/j becoming separate word images • If large gaps are left in some instances of a word, but not in another, segmented as different word
Two primary algorithms used for word matching: • EDM (Euclidian Distance Mapping (D. 1980)) • Fast, but assumes that no distortions have occurred except for relative translation • Does well matching words with relatively low variations in reference to the template • SLH (Scott and Longuet-Higgins (1991)) • Assumes an affine transformation between the words • Slow, computationally expensive in current implementations
Matching with EDM • Aligning – vertical alignment by baseline, horizontal by coinciding left sides. (thus vertical al. > horizontal al.) • XOR image is computed – XOR corresponding pixels to produce the difference between the images • Not good for sole use in determining image difference since equal weight is given to isolated pixels and blobs …
XOR for Lloyd What’s in each one, but not both, of the images …
EDM Step • EDM computed by assigning to each white pixel in the image its minimum distance to a black pixel • A white pixel inside a blob will get a larger distance than isolated white pixel • An error measure, (EEDM) can now be calculated by summing the distance measures for each pixel
Forming Blobs using EDM • The distance between every white pixel and the nearest black pixel is computed • distance < threshold, assumed to be noise.
Problems with EDM • EDM does not discriminate well between good and bad matches • Fails when there is significant distortion in the words • Need for matching algorithm that models some variation -> SLH
SLH Matching technique • Affine transformation allows for scaling and shear deformations in both directions • Much more accurate than the Euclidian Distance Mapping technique • Computationally slow and expensive because the SVD (Singular Value Decomposition) must be computed for large matrix
Differences (+/-) • EDM does not account for any distortions and thus performs poorly when handwriting is bad • SLH almost always produces the correct rankings even if the handwriting is bad • Two areas need to be improved with both: • Speed and word validity discrimination
Tests . . . • Two documents, “Senior” and “Hudson”, were compared using both matching algorithms • The statistical information for both documents is as follows:
Test Information • Since the SLH algorithm is slow but more accurate than the EDM, EDM was applied first • A cut off threshold was used to limit the number of classes (words) displayed, and remained constant in both tests
Senior Document - classes stop words
EDM on Senior • The EDM algorithm performed quite well on Senior Document • Average precision of 78% • Since the handwriting is good, this performance was expected • Remember that EDM does not account for much variation in word images
EDM matches for Lloyd 50% correct
EDM problems . . . • The algorithm performs poorly in that it cannot discriminate well between valid and invalid words.
EDM on Hudson • Average precision was 57.9%, much lower than that of the Senior Document (78%) • Difference in precision attributed to the handwriting • Difficult to read even for humans looking at greyscale images at 300 dpi
SLH matches for Lloyd 62.5 % correct 3/8 incorrect
SLH on Senior • Proved to be very accurate, yet as mentioned before, slow … • Average precision of SLH was 86.3%, compared to 78.7% for EDM • SLH recorded the word rankings correctly, and also showed a much greater discrimination in match error
SLH on Hudson • Very difficult because of writing, but ranking proved to be much better than EDM • Performance on templates like “they” was good – probably because they are simple, repetitive words • Correct ranking for the word “Standard”
Current matching techniques • SLH is more than reasonably accurate, but slow in current implementations. References report this will change ... • Require matching every word against every other word • O(N2). • N2 = 220 x 6400 ~ 1012 matches !!!
Main problems with wordspotting • Handwriting style
Main problems with wordspotting • Word scaling and connection of ascenders, descenders
Main problems with wordspotting • Skew and Noise
Recent Work • Fixed bugs and some problems with algorithm • Recently have successfully segmented the 6400 scanned images of George Washington’s documents • Its impractical (too labor intensive) to compute segmentation statistics on the entire collection.
Future Work • Continue working on new methods to match words effectively and efficiently • Possible ideas for better matching techniques include: • Combining multiple word features • Language probability for automatic indexing • Continue working on new methods to match words effectively and efficiently • Integrate into indexing scheme (back of book index)
Conclusion • EDM works reasonably well for matching words, but SLH is better since it accounts for variations • SLH pays the price – computationally expensive • Future work is needed; but progress thus far is very encouraging
Questions? Thanks for listening to me blob () about Wordspotting. Any Comments/Questions?