100 likes | 236 Views
Document Content Analysis for Digital Archives. Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center. Digital Archives. Index. Metadata layer. Content layer. Tasks. Operations. -browse by topic, type, etc. -search for known items
E N D
Document Content Analysisfor Digital Archives • Eric Saund • Perceptual Document Analysis Area • Intelligent Systems Laboratory • Palo Alto Research Center
Digital Archives Index Metadata layer Content layer Tasks Operations -browse by topic, type, etc. -search for known items -search for items meeting criteria -find duplicate items -find similar items -follow links -establish links -apply logical rules -edit metadata -casual browsing -look up information -follow trails -compose narratives -form and organize collections -distribute -assemble timelines All enabled by Metadata
Metadata Metadata as a static record Metadata as an interface Title: Sarix neob Date: 37-23-55 Media: niobium Format: jnb Author: Rsi Liwer Text: “aliirn xeca sarlia isyb...” Index ID: 34962s computeSimilarityTo() containsEntity?() fitsSlotInModel?(); extractTextAfterImageCleanup() functions applied to item content pointer to item Automatic Content Analysis Two major problems with metadata: 1. Extracting metadata from raw content items. 2. Metadata is always incomplete for some purposes.
State of the Art topics entitites text appearance, layout • document image analysis • photographic image analysis • video/film analysis • audio analysis • web site analysis genre category functional roles who what where when genre scenes who, what, ... genre speech/music speaker ID transciption
When OCR Works... APR 21 2004 17:38 FR ---- 203 749 4519 TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -I Page: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp 10 Riverview Drive Danbury, CT 06810 PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: 076613 SERIAL NO.: SM069A 26,351.00 4,;- UNIT NO.: 076614 SERIAL NO.: SM0G9B 26, 351. 00 DOWN PAYMENT 0. 00 BUILDING DELIVERY 0. 00 BUILDING DELIVERY 400.00 BLOCK AND LEVEL 0. 00 BLOCK AND LEVEL 2,100.00 ANCHOR/TIE DOWN 780 00 DECKING 950. 00 / ELECTRICAL 1, 350. 00 / PLUMBING 3, 025. 00 INSTALLATION SITE MANAGEMENT 1,100 00 SKIRTING- VINYL 1,360. 00 TOTAL DUE THIS INVOICE 63,767.00
Font / Layout / Symbol Pattern of Fax ID Line Header alignment • Construction • project Graphical logo • Bill Redacting markings Address block • Supplier • relationship Graphic separator • Itemized • purchase • listing Repeated elements • Inventory & • materials • management Hand-drawn graphical annotation • Annotated • document Handwritten Textual Annotation Tabular Layout Amount Field Textual Field Indicator How People See a Document Category Type Structural Elements and Relations Relational Context • Invoice ST
Technology Ecology • engineering-based • robust • limited capabilities • science-based • toy problems • fragile Characteristics: • businesses • consumers • government • government • industry Paying Customer: Hobbiests • museums • schools • local governments • NGOs • individuals • startups • boutique companies • shoestring projects in Academia and Industry Industry Academia • Computer Vision • Document Recognition • Information Retrieval • Machine Learning • Speech Recognition • Natural Language • Artificial Intelligence • Document Imaging • Transaction Processing • Workflow Systems • Database Vendors • Business Software • Business Process Outsourcing • Advertising/Search
A Hobby Project Document Capture Station + Collection Comprehension Engine Wanted:
Collection Comprehension Engine 308991 OCR Document Structure Modeling Image Processing Automatic Cataloging Classification Genre Tagging Clustering Visualization GUI Document Collection Linking
Conclusion The hobby stage brings together kindred spirits.