190 likes | 339 Views
Automated Building of OAI Compliant Repository from Legacy Collection. Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006. Contents. Introduction Background System Architecture Metadata Extraction Approach Experiments Screenshots. Introduction.
E N D
Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Maly@cs.odu.edu Department of Computer Science Old Dominion University May, 2006 ELPUB 2006 June 14-16 Bansko Bulgaria
Contents • Introduction • Background • System Architecture • Metadata Extraction Approach • Experiments • Screenshots ELPUB 2006 June 14-16 Bansko Bulgaria
Introduction • Key problem : Extracting Metadata from a legacy collection • OCR is not sufficient for making ‘legacy’ documents searchable. • Manual metadata extraction is costly and time-consuming • It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). • Automatic extraction tools are essential for rapid dissemination at reasonable cost ELPUB 2006 June 14-16 Bansko Bulgaria
Background: Digital Library and OAI-PMH • Digital Library (DL) • A DL is a network accessible and searchable collection of digital information. • DL provides a way to store, organize, preserve and share information. • Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. • Based on metadata harvesting • Data Providers and Service Providers ELPUB 2006 June 14-16 Bansko Bulgaria
Background: Metadata Extraction • Rule-based Approach • Basic idea • Use a set of rules to define how to extract metadata based on human observation. • For example, a rule may be “ The first line is title”. • Pros & Cons • No need for training from samples • Can extract different metadata from different documents • Rule writing may require significant technical expertise ELPUB 2006 June 14-16 Bansko Bulgaria
Background: Metadata Extraction (cnt.) • Machine-Learning Approach • Basic idea • Learn the relationship between input and output from samples and make predictions for new data • Pros & Cons • Good adaptability but it has to be trained from samples – time consuming • Performance degrades with increasing heterogeneity • Difficult to add new fields to be extracted • Difficult to select the right features for training ELPUB 2006 June 14-16 Bansko Bulgaria
Background: Document Classification • Classify document pages into groups based on their visualsimilarity: • the geometrical arrangement of components • the typographic features such as font • Existing Approaches • MXY-Tree • recursively cuts a page into blocks by separators (e.g. lines) as well as white spaces. A page is converted to a tree. • M*N bins • cuts a page into m*n equal size bins; a bin is either a text bin (if more than half are text) or white space bin ELPUB 2006 June 14-16 Bansko Bulgaria
System Architecture ELPUB 2006 June 14-16 Bansko Bulgaria
System Architecture (cont.) • Main components: • Scan and OCR: Commercial OCR software is used to scan the documents. • Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. • OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. • Search Engine ELPUB 2006 June 14-16 Bansko Bulgaria
Template-Based Metadata Extraction ELPUB 2006 June 14-16 Bansko Bulgaria
Template-Based Metadata Extraction- Document Classification • classify documents into groups based on the visualsimilarity of their metadata pages (page with richness in metadata) . • the geometrical arrangement of metadata fields on the metadata page • the typographic features such as font size, text alignment, and text height • The identification of metadata pages by a set of rules ELPUB 2006 June 14-16 Bansko Bulgaria
Template-Based Metadata Extraction- Document Classification MXY-Tree Similarity Integration Document Pages m*n bins ELPUB 2006 June 14-16 Bansko Bulgaria
Template sample ELPUB 2006 June 14-16 Bansko Bulgaria
Experiments- Document Classification • downloaded 7413 documents from the DTIC collection • randomly selected 200, 400, 800, 1200, 2000, 3000, 4000, 5000, 6000 documents & Classified them into groups ELPUB 2006 June 14-16 Bansko Bulgaria
Experiments- Metadata Extraction • Selected 100 documents from DTIC; divided them into 7 classes; created a template for each class ELPUB 2006 June 14-16 Bansko Bulgaria
Template-based experiment ELPUB 2006 June 14-16 Bansko Bulgaria
Screenshots – OAI ELPUB 2006 June 14-16 Bansko Bulgaria
Screenshots – Search Engine ELPUB 2006 June 14-16 Bansko Bulgaria
Conclusions • We describe how to automate the task of converting existing corpus into an OAI-compliant repository • We propose our metadata extraction approach to address the challenge of getting desirable accuracy for a large heterogeneous collection of documents ELPUB 2006 June 14-16 Bansko Bulgaria