190 likes | 439 Views
ELPUB 2006 June 14-16 Bansko Bulgaria. 15. Selected 100 documents from DTIC; divided them ... Bulgaria. 17. Screenshots OAI. ELPUB 2006 June 14-16 Bansko Bulgaria ...
E N D
1. Automated Building of OAI Compliant Repository from Legacy Collection
Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006
2. Contents
Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots
3. Introduction
Key problem : Extracting Metadata from a legacy collection OCR is not sufficient for making ‘legacy’ documents searchable. Manual metadata extraction is costly and time-consuming It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic extraction tools are essential for rapid dissemination at reasonable cost
4. Background: Digital Library and OAI-PMH
Digital Library (DL) A DL is a network accessible and searchable collection of digital information. DL provides a way to store, organize, preserve and share information. Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. Based on metadata harvesting Data Providers and Service Providers
5. Background: Metadata Extraction
Rule-based Approach Basic idea Use a set of rules to define how to extract metadata based on human observation. For example, a rule may be “ The first line is title”. Pros & Cons No need for training from samples Can extract different metadata from different documents Rule writing may require significant technical expertise
6. Background: Metadata Extraction (cnt.)
Machine-Learning Approach Basic idea Learn the relationship between input and output from samples and make predictions for new data Pros & Cons Good adaptability but it has to be trained from samples – time consuming Performance degrades with increasing heterogeneity Difficult to add new fields to be extracted Difficult to select the right features for training
7. Background: Document Classification
Classify document pages into groups based on their visual similarity: the geometrical arrangement of components the typographic features such as font Existing Approaches MXY-Tree recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. M*N bins cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin
8. System Architecture
9. System Architecture (cont.)
Main components: Scan and OCR: Commercial OCR software is used to scan the documents. Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. Search Engine
10. Template-Based Metadata Extraction
11. Template-Based Metadata Extraction- Document Classification
classify documents into groups based on the visual similarity of their metadata pages (page with richness in metadata) . the geometrical arrangement of metadata fields on the metadata page the typographic features such as font size, text alignment, and text height The identification of metadata pages by a set of rules
12. Template-Based Metadata Extraction- Document Classification
Document Pages MXY-Tree m*n bins Similarity Integration Furture: sim= a*sim_tree + b* sim_bin. Current: Convert two pages to two MXY-Trees, computing the edit distance between two trees. If D> min (length of tree1, length of tree2) /4, return 0 Convert two pages into m*n bins, computing the similarity (the percentage of bins with same type)If sim < 0.7, return 0. Else return 1.Furture: sim= a*sim_tree + b* sim_bin. Current: Convert two pages to two MXY-Trees, computing the edit distance between two trees. If D> min (length of tree1, length of tree2) /4, return 0 Convert two pages into m*n bins, computing the similarity (the percentage of bins with same type)If sim < 0.7, return 0. Else return 1.
13. Template sample
14. Experiments- Document Classification
downloaded 7413 documents from the DTIC collection randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups
15. Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class
Experiments- Metadata Extraction
16. Template-based experiment
17. Screenshots – OAI
18. Screenshots – Search Engine
19. Conclusions
We describe how to automate the task of converting existing corpus into an OAI-compliant repository We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents