360 likes | 470 Views
Tools for Extracting Metadata and Structure from DTIC Documents. Digital Library Group Department of Computer Science Old Dominion University December, 2004. Problem Statement. Manual metadata extraction and logical structure extraction is expensive
E N D
Tools for Extracting Metadata and Structure from DTIC Documents Digital Library Group Department of Computer Science Old Dominion University December, 2004
Problem Statement • Manual metadata extraction and logical structure extraction is expensive • Metadata improves discovery and interoperability (OAI-PMH). • Logical structure for preservation and supporting different presentation formats (e.g., mobile devices in future)
Motivations – Metadata Extraction • Using metadata helps resource discovery • It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files . (estimation made by Mike Doane on DCMI 2003 workshop) • Using metadata helps make collections interoperable with OAI-PMH • On the other hand, creating metadata manually for a large collection is expensive • It would take about 60 employee-years to create metadata for 1 million documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop)
Motivations – Logical Structure Extraction • Converting a document into XML format with logical structure helps information preservation • Information in a document can still be accessible and the document can still be presented in appropriate way when the software to open the document is not available any more. • Converting a document into XML format with logical structure helps information presentation • With different XSL, a XML document can be presented differently • A XML document can be presented differently to different devices such as web browsers, PDA, etc. • It allows different users have different accesses. For example, a registered user may see all parts of a document while a Guest can only access introduction section. • Converting a document into XML format with logical structure helps information discovery • It allows logical component based retrieval, for example, searching only in introduction. • It allows some special searches such as equation search.
Objectives • To develop a flexible and adaptable approach for extracting metadata from physical collections with focus on DTIC (Defense Technical Information Center) collections. • To develop techniques to extract basic logical structure of the scanned full text documents. • To develop techniques to extract and represent complex objects such as equations, figures, etc.
Background • Metadata Extraction • Rule-based approach • Machine-Learning approach • Hidden Markov Model • Support Vector Machine • Logical Structure Extraction • Basic logical structure extraction • Reference Extraction & Reference Linking • OAI and Digital Library Note: OAI , Open Archive Initiatives Protocols for Metadata Harvesting, is a framework to provide interoperability among distributed collections.
Background - Metadata Extraction • Rule-based approach • Machine-Learning approach • Hidden Markov Model • Support Vector Machine
Metadata Extraction: Rule-based • Basic idea: • Use a set of rules to define how to extract metadata based on human observation. • For example, a rule may be “ The first line is title”. • Advantage • Can be implemented straightforward • No need for training • Disadvantage • Lack of adaptabilities, (work for similar document) • Difficult to work with a large number of features • Difficult to tune the system when errors occur because rules are usually fixed
Metadata Extraction: Rule-based • Related works • Automated labeling algorithms for biomedical document images (Kim J, 2003 ) • Extract metadata from first pages of biomedical journals • Accuracy: title 100%, author 95.64%, abstract 95.85%, affiliation 63.13% (76 articles are used for test) • Document Structure Analysis Based on Layout and Textual Features (Stefan Klink, 2000) • Extract metadata from U-Wash document corpus with 979 journal pages • Good results for some elements (such as page-number has 90% recall and 98% precision) but bad results for others( abstract: 35% recall and 90% precision; biography: 80% recall and 35% precision)
Metadata Extraction: Machine-Learning Approach • Basic idea: • Learn the relationship between input and output from samples and make predictions for new data • This approach has good adaptability but it has to be trained from samples.
HMM - Metadata Extraction • A document is a sequence of words that is produced by some hidden states (title, author, etc.) • The parameters of HMM was learned from samples in advance. • Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.
… Challenges in Building Federation … Kurt Maly … 2003 Challenges in Building Federation Services over Harvested Metadata, Kurt Maly, Mohammad Zubair, 2003
HMM - Metadata Extraction • Related work • K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden Markov model structure for information extraction. • Result: overall accuracy 90.1% was reported
Support Vector Machine - general • Overview • It was introduced by Vapnik in late 70s • It is now receiving increasing attentions • It is widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc. • A list of SVM applications is available at http://www.clopinet.com/isabelle/Projects/SVM/applist.html • It is also used in text analysis (Joachims 1998, etc.) and metadata extraction (Han 2003).
Class 2 Class 1 Support Vector Machine - general • Many decision boundaries can separate these two classes • Which one should we choose? Courtesy: Martin Law
Class 2 Class 1 Support Vector Machine - general hyperplane • Basic idea • Choose the one to separate two classes with largest margin Support Vector margin
Support Vector Machine - general hyperplane • Binary Classifier (classify data into two classes) • It represents data with pre-defined features • It finds the plane with largest margin to separate the two classes from samples • It classifies data into two classes based on which side they located. Font size margin Line number The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.
Multi-Class SVMs • Combining into multi-class classifier • One-vs-rest • Classes: in this class or not in this class • Positive training samples: data in this class • Negative training samples: the rest • K binary SVM (k the number of the classes) • One-vs-One • Classes: in class one or in class two • Positive training samples: data in this class • Negative training samples: data in the other class • K(K-1)/2 binary SVM
SVM - Metadata Extraction • Basic idea • Classes metadata elements • Extract metadata from a document classify each line (or block) into appropriate classes. • For example Extract document title from a document Classify each line to see whether it is a part of title or not • Related work • Automatic Document Metadata Extraction Using Support Vector Machine (H. Han, 2003) • Overall accuracy 92.9% was reported
Logical Structure Extraction • Physical Structure
Structure Extraction • Logical Structure
Digital Library and OAI • Digital Library (DL) • A DL is a network accessible and searchable collection of digital information. • DL provides a way to store, organize, preserve and share information. • Interoperability problem • DLs are usually created separately by using different technologies and different metadata schemas.
Open Archive Initiatives (OAI) • Open Archive Initiatives Protocol for Metadata Harvesting (OAI-PMH) is a framework to to provide interoperability among heterogeneous DLs. • It is based on metadata harvesting: a services provider can harvest metadata from a data provider. • Data provider accepts OAI-PMH requests and provides metadata through network • Service provider issues OAI-PMH requests to get metadata and build services on them. • Each Data Provider can support its own metadata formats, but it has to support at least Dublin Core(DC) metadata set.
Dublin Core Metadata Set • It supports 15 elements • Title, Creator, Subject, Description, Publisher, Contributor, Date, Type, Format, Identifier, Source, Relation, Coverage, Rights • All fields are optional • http://dublincore.org/documents/dces/
What does it mean making an existing digital library OAI enabled ? Digital Library OAI Layer Exposing metadata to OAI service providers – DC and Parallel metadata sets ONLY METADATA Storage
OAI Request and OAI Response - OAI Request for Metadata is embedded in HTTP. - OAI Response to OAI Request is encoded in XML. - XML Schema specification for OAI Response is provided in OAI-PMH document. RCDL 2003, St. Petersburg
OAI Mechanics Request is encoded in http Response is encoded in XML XML Schemas for the responses are defined in the OAI-PMH document Courtesy: Michael Nelson
Overall Approach & Architecture* *This is our overall vision and only some components of this architecture are being implemented as part of the current contract
Overall Approach & Architecture • Main components: • Scan and OCR: Commercial OCR software is used to scan the documents. • Metadata Extractor: Extract metadata by using rules and machine learning techniques. The extracted metadata are stored in a local database. In order to support Dublin Core, it may be necessary to map extracted metadata to Dublin Core format. • Object Digitization: Convert documents into XML format for better preservation and better presentation. The main works: • Extraction of complex objects such as figures • Extraction of document logical structure • Extraction of references and reference linking. • OAI layer: Make the digital collection interoperable. The OAI layer accepts all OAI requests, get the information from database and encode metadata into XML format as responses. • Search Engine
Metadata Extraction • A challenge is how to reach desirable accuracy for a large heterogeneous collection • Humanly defining a set of rules to cover all situations in advance is difficult • Machine Learning • Required a lot of labeled samples, for example, an HMM-based name recognizer used data with about 1.2 million words for training in order to achieve high accuracy (According to Douglas E. Appelt). Accuracy is the ratio of the number of those tagged correctly over the total number.
Metadata Extraction • Feasible solution • Classify documents into classes • Documents in a same class have similar layout • Work on each document class only instead of working on the whole large collection.
Metadata Extraction (Cont.) • Overall Approach for Handling a Large Collection • Manual Classification • This approach assumes it is possible to humanly classify the large set of documents into classes ( based on time period, source organizations, etc. ) • For each class, randomly select, say 100, documents develop a template. Evaluate the template by statistically sampling and refine the template till error is under a tolerance level. Next apply the refined template to the whole set. • Auto-Classification • This approach assumes it is not humanly possible to classify the large set of documents. In this case we develop a higher-set of rules on a smaller sample for classification. Evaluate the classification approach based on statistical sampling. • Next develop the template for each class, apply, and refine as outlined in the manual classification approach.
Preliminary Experiments • Performance Measures • SVM Experiments with different data sets • Pure rule-based experiment