1 / 19

Automated Building of OAI Compliant Repository from Legacy Collection

ELPUB 2006 June 14-16 Bansko Bulgaria. 15. Selected 100 documents from DTIC; divided them ... Bulgaria. 17. Screenshots OAI. ELPUB 2006 June 14-16 Bansko Bulgaria ...

EllenMixel
Download Presentation

Automated Building of OAI Compliant Repository from Legacy Collection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Automated Building of OAI Compliant Repository from Legacy Collection

    Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006

    2. Contents

    Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots

    3. Introduction

    Key problem : Extracting Metadata from a legacy collection OCR is not sufficient for making ‘legacy’ documents searchable. Manual metadata extraction is costly and time-consuming It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic extraction tools are essential for rapid dissemination at reasonable cost

    4. Background: Digital Library and OAI-PMH

    Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. Based on metadata harvesting Data Providers and Service Providers

    5. Background: Metadata Extraction

    Rule-based Approach Basic idea Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Pros & Cons No need for training from samples Can extract different metadata from different documents Rule writing may require significant technical expertise

    6. Background: Metadata Extraction (cnt.)

    Machine-Learning Approach Basic idea Learn the relationship between input and output from samples and make predictions for new data Pros & Cons Good adaptability but it has to be trained from samples – time consuming Performance degrades with increasing heterogeneity Difficult to add new fields to be extracted Difficult to select the right features for training

    7. Background: Document Classification

    Classify document pages into groups based on their visual similarity: the geometrical arrangement of components the typographic features such as font Existing Approaches MXY-Tree recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. M*N bins cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin

    8. System Architecture

    9. System Architecture (cont.)

    Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine

    10. Template-Based Metadata Extraction

    11. Template-Based Metadata Extraction- Document Classification

    classify documents into groups based on the visual similarity of their metadata pages (page with richness in metadata) . the geometrical arrangement of metadata fields on the metadata page the typographic features such as font size, text alignment, and text height The identification of metadata pages by a set of rules

    12. Template-Based Metadata Extraction- Document Classification

    Document Pages MXY-Tree m*n bins Similarity Integration Furture: sim= a*sim_tree + b* sim_bin. Current: Convert two pages to two MXY-Trees, computing the edit distance between two trees. If D> min (length of tree1, length of tree2) /4, return 0 Convert two pages into m*n bins, computing the similarity (the percentage of bins with same type) If sim < 0.7, return 0. Else return 1.Furture: sim= a*sim_tree + b* sim_bin. Current: Convert two pages to two MXY-Trees, computing the edit distance between two trees. If D> min (length of tree1, length of tree2) /4, return 0 Convert two pages into m*n bins, computing the similarity (the percentage of bins with same type)If sim < 0.7, return 0. Else return 1.

    13. Template sample

    14. Experiments- Document Classification

    downloaded 7413 documents from the DTIC collection randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups

    15. Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class

    Experiments- Metadata Extraction

    16. Template-based experiment

    17. Screenshots – OAI

    18. Screenshots – Search Engine

    19. Conclusions

    We describe how to automate the task of converting existing corpus into an OAI-compliant repository We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents

More Related