1 / 10

Document Content Analysis for Digital Archives

Document Content Analysis for Digital Archives. Eric Saund Perceptual Document Analysis Area Intelligent Systems Laboratory Palo Alto Research Center. Digital Archives. Index. Metadata layer. Content layer. Tasks. Operations. -browse by topic, type, etc. -search for known items

ramona
Download Presentation

Document Content Analysis for Digital Archives

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Document Content Analysisfor Digital Archives • Eric Saund • Perceptual Document Analysis Area • Intelligent Systems Laboratory • Palo Alto Research Center

  2. Digital Archives Index Metadata layer Content layer Tasks Operations -browse by topic, type, etc. -search for known items -search for items meeting criteria -find duplicate items -find similar items -follow links -establish links -apply logical rules -edit metadata -casual browsing -look up information -follow trails -compose narratives -form and organize collections -distribute -assemble timelines All enabled by Metadata

  3. Metadata Metadata as a static record Metadata as an interface Title: Sarix neob Date: 37-23-55 Media: niobium Format: jnb Author: Rsi Liwer Text: “aliirn xeca sarlia isyb...” Index ID: 34962s computeSimilarityTo() containsEntity?() fitsSlotInModel?(); extractTextAfterImageCleanup() functions applied to item content pointer to item Automatic Content Analysis Two major problems with metadata: 1. Extracting metadata from raw content items. 2. Metadata is always incomplete for some purposes.

  4. State of the Art topics entitites text appearance, layout • document image analysis • photographic image analysis • video/film analysis • audio analysis • web site analysis genre category functional roles who what where when genre scenes who, what, ... genre speech/music speaker ID transciption

  5. When OCR Works... APR 21 2004 17:38 FR ---- 203 749 4519 TO 4264 P.02/06 * 9STCapitalModularSpace SALE INVOICE _ jz5| g'" ni'idspace.com -I Page: 1 FAX TO:_ BILL TO: REMIT TO: ACCOUNT NO.: ;m11 GE Capital Corp 10 Riverview Drive Danbury, CT 06810 PO NUMBER: per Chad LOCATION OF UNITS: SAME AS ABOVE UNIT NO.: 076613 SERIAL NO.: SM069A 26,351.00 4,;- UNIT NO.: 076614 SERIAL NO.: SM0G9B 26, 351. 00 DOWN PAYMENT 0. 00 BUILDING DELIVERY 0. 00 BUILDING DELIVERY 400.00 BLOCK AND LEVEL 0. 00 BLOCK AND LEVEL 2,100.00 ANCHOR/TIE DOWN 780 00 DECKING 950. 00 / ELECTRICAL 1, 350. 00 / PLUMBING 3, 025. 00 INSTALLATION SITE MANAGEMENT 1,100 00 SKIRTING- VINYL 1,360. 00 TOTAL DUE THIS INVOICE 63,767.00

  6. Font / Layout / Symbol Pattern of Fax ID Line Header alignment • Construction • project Graphical logo • Bill Redacting markings Address block • Supplier • relationship Graphic separator • Itemized • purchase • listing Repeated elements • Inventory & • materials • management Hand-drawn graphical annotation • Annotated • document Handwritten Textual Annotation Tabular Layout Amount Field Textual Field Indicator How People See a Document Category Type Structural Elements and Relations Relational Context • Invoice ST

  7. Technology Ecology • engineering-based • robust • limited capabilities • science-based • toy problems • fragile Characteristics: • businesses • consumers • government • government • industry Paying Customer: Hobbiests • museums • schools • local governments • NGOs • individuals • startups • boutique companies • shoestring projects in Academia and Industry Industry Academia • Computer Vision • Document Recognition • Information Retrieval • Machine Learning • Speech Recognition • Natural Language • Artificial Intelligence • Document Imaging • Transaction Processing • Workflow Systems • Database Vendors • Business Software • Business Process Outsourcing • Advertising/Search

  8. A Hobby Project Document Capture Station + Collection Comprehension Engine Wanted:

  9. Collection Comprehension Engine 308991 OCR Document Structure Modeling Image Processing Automatic Cataloging Classification Genre Tagging Clustering Visualization GUI Document Collection Linking

  10. Conclusion The hobby stage brings together kindred spirits.

More Related