Versatile Document Image Content Extraction

Versatile Document Image Content Extraction Henry S. Baird Michael A. Moll Jean Nonnemaker Matthew R. Casey Don L. Delorenzo

Document Image Content Extraction Problem • Given an image of a document Find regions containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc

Difficulties • Vast diversity of document types • Arduous data collection • How big is a representative training set? • Expense of preparing correctly labeled “ground-truthed” samples • Lack of consensus on how to evaluate performance

Our Research Goals • Versatility First • Beware “brittle” or narrow approaches • Develop methods that work across broadest possible spectrum of document and image types • Voracious Classifiers • Belief that accuracy of a classifier has more to do with training data than other considerations • Want to train on extremely large (and representative) data sets (in reasonable amounts of time) • Extremely High Speed Classification • Ideally, perform nearly at I/O rates (as fast as images can be read) Too ambitious?

Related Strategies (for the future) • Amplification • Real ground-truthed training samples are hard to find, expensive to generate and difficult to ensure coverage • Want to use real samples as ‘seeds’ for massive synthetic generation of pseudo randomly perturbed samples for use in supplementary training • Confidence Before Accuracy • Confidence is maybe more important than accuracy, since even modest accuracy (across all cases) can be useful • Near-Infinite Space • Design for best performance in near future when main memory will be orders of magnitude larger and faster • Data-Driven Design • Avoid arbitrary engineering decisions such as choice of features, instead allowing training data to determine this

Document Images • Range of document and image types • Color, grey-level, black and white • Any size or resolution • Lots of file formats (TIFF, JPEG, PNG, etc) • Pre-processing step of converting images into three channel color PNG file in HSL (Hue, Saturation, Luminance) color space • Bi-level and gray images will primarily map into Luminance component

Document Image Content Types • Now: Handwriting, Machine Print, Line Art, Photos, Junk/Noise, Blank • Soon: Maps, Mathematic Equations, Engineering Drawings, Chemical Diagrams • Gathering large of collection of electronic images: 7123 page images so far • We attempt to collect samples for each content type in black and white, grey scale and color • Avoid arbitrary image processing decisions of our own • Carefully “zoned” images are a rare commodity • Our software accepts existing ground truth in the form of rectangular zones • We have developed a ground truthing tool for images that are not zoned

Coverage of Image and Content Types

Statistical Framework for Classification • Each training & test sample will be a pixel, not a region • Want to avoid arbitrariness and restrictiveness associated with the choice of a limited class of shapes • Since each image can contain millions of pixels, it easy to exceed 1 billion training samples • This policy suggested by Thomas Breuel

Example of Classifying Pixels Photo Machine Print Hand writing ~75% correct Input Image Classification Output

Features • Simple, local features of each pixel • Average luminosity of every pixel in 1x1, 3x3, 9x9 and 27x27 boxes around given pixel • Average luminosity of 20 pixels on either side of given pixel on horizontal and vertical lines and lines of slope +/- 1 and 2 • Also for each box and line, the average change and maximum change from one pixel to a neighbor • Choice of features is merely expedient • Expect to refine indefinitely

A Nearly Ideal Classifier (For Our Purposes) • K Nearest Neighbor Algorithm • Error rate approaches no worse than twice Bayes error • Generalizes directly (more than, e.g., SVMs) to more than two classes • Competitive in accuracy with more recently developed methodologies • We have implemented a straightforward exhaustive 5-NN search • Aware of many techniques (editing, tree search, etc) for speeding up kNN that work well in practice, but do not appear to in the broadest range of cases • We hope to exploit highly non-uniform data distributions • Explore using hashing techniques with geometric tree search • Intrinsic dimensionality of data seems low

Adaptive k-d Trees • Recursively partition set of points in stages • At each stage, divide one partition into two sub partitions • Assume it is possible to choose cuts to achieve balance • Ensures find operations operate in O(log n) time at worst • Final partitions are generally hyper rectangles • This approximates kNN under Infinity Norm • Pruning power of k-d trees (Bentley) speed up range searches • Given a search point (a test sample), it is fast to find the enclosing hyper rectangle • Adaptive k-d trees guarantee shallow trees

We Use Non-adaptive k-d Trees • Constructed in manner similar to adaptive k-d trees, except the distribution of the data is ignored in generating cuts • Suppose upper and lower bounds are known for each feature, then cuts can be placed at midpoints of these bounds • No balance guarantee and time and space optimality properties of adaptive k-d trees are lost • However, values of cut thresholds can be predicted and as result the total number of cuts, r, is known. • Hyperrectangle any sample lies in can be computed in O(r) time • Computing the cuts is so fast it is negligible

Bit-Interleaving Addresses • Partitions can be addressed using bit-interleaving 100111 1 2 1 1 6 0 0 4 0 1 5 0 1 3 0 1 0 1

Assumption of Bit-Interleaving • Since in this context, we expect that our data is non-uniformly distributed in feature space, only a small fraction of partitions should contain any training data • Therefore very few distinct bit-interleaved addresses should occur, making it possible to use a dictionary data structure to store them • Experiments show that the number of occupied partitions as a function of their bit-interleaved address, is asymptotically cubic (far better than exponential!)

Cubic Growth of Populated Bins

Initial Results and Analysis • n = 844,525 training points, d = 15 and tested on 192,405 points • Brute Force 5NN classified 70% correctly but required 163 billion distance calculations • Hashing bit-interleaved addresses of length r = 32 classified 66% correctly, with speedup of 148 times • Of course this only approximates kNN • Hash into a cell and does not contain the kNN neighbors

Approximate, but Quite Good Photo Machine Print Hand writing ~75% correct Input Image Classification Output

Speed Tradeoff

Future Work • Much larger scale experiments • Wider range of content types • More and better features • Explore refined approximate kNN methods • How does BIA approach behave under vastly larger data sets • Paging hash tables that grow too large • Exploit style consistency (isogeny) • Compare to CARTs, Locality Sensitive Hashing, etc

Thank You! Henry S. Baird Michael A. Moll

Versatile Document Image Content Extraction