180 likes | 330 Views
Metadata Extraction for NASA Collection. June 21 , 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu. Outline. Metadata Extraction Project System overview Demo What can ODU do for NASA Current Status and Required enhancements Why ODU Cost Estimate.
E N D
Metadata Extraction for NASA Collection June 21 , 2007 Kurt Maly, Steve Zeil, Mohammad Zubair {maly, zeil, zubair} @cs.odu.edu
Outline • Metadata Extraction Project • System overview • Demo • What can ODU do for NASA • Current Status and Required enhancements • Why ODU • Cost Estimate
ODU Metadata Extraction System • Input: pdf documents • processed through OCR (Optical Character Recognition) • Output: metadata in XML format • easily processed for uploading into any database (demo: 1st document)
System Overview • Processing has two main branches: • Documents with forms (RDPs) • Documents without forms
Demo (additional documents)
What Can ODU do for NASA • Automate form containing document processing @ NASA site • Automate document processing for 80% of collection with minimal set of metadata • Provide Interface for Human Intervention for remaining 20% • Develop general reporting tool for management on accuracy of process
Current Status • Completely Automated Software for: • Drop in pdf file • Process and produce output metadata in XML format • Easy (less than 5 minutes) installation process • Default set of templates for: • RDP containing documents • Non-form documents • Statistical models of NASA collection (30,000 documents) • Phrase dictionaries: personal authors, corporate authors • Length and English word presence for title and abstract • Structure of dates, report numbers
Current Status Metadata Extraction Results for 25 documents that were randomly selected from the NASA Collection • * Notes • Accuracy is defined as successful completion of the extractor with reasonable metadata values extracted • “Reasonable” implies that values could be automatically processed (see required enhancements) into standard format • Accuracy for documents without RDP could be enhanced with additional templates, (see required enhancements)
Current Status • Documents with RDP forms • Extracts high-quality metadata for 2 variants of SF-298 • Tested on 154 NASA documents • Documents without RDP forms • Extracts moderate-quality metadata for 9 common document layouts • Tested on 574 NASA documents
Required Enhancements • Develop complete template set • Standardize output and integrate with existing process at NASA site • Provide tutorial for operation and template writing
Required Enhancements • Develop statistical model of target collection • Write default template set to cover at least 80% of known collection • Provide oracle for detection of problem cases
Required Enhancements • Develop interface for showing scoring of output and location in document • Develop interactive modules for correcting metadata • Develop driver for creating output in desired format
Required Enhancements • Develop statistical description of input flow of documents • Develop statistical descriptions of output flow of metadata records • Accuracy • Computer time to process • Human time to validate/correct
Why - software from ODU • Research, new technology • ODU digital library research group is world class and has made many contributions to advancing field. $2.5M funding in last five years from various agencies National Science Foundation, Andrew Mellon Foundation, Los Alamos, Sandia National Laboratory, Air Force Research Laboratory, NASA Langley, DTIC, and IBM • State of art in automated metadata extraction is good for homogenous collection but not effective for large, evolving, heterogeneous collections (such as NASA’s) • Need for new methods, techniques and processes
Why - software from ODU • Inexpensive (relatively) • ODU is university with low overhead (43%) • Universities can use students and pay them assistantships rather than fulltime salaries • Department adds matching tuition waivers for research assistants which is big incentives for students to apply for research work • Faculty are among best in field, require partial funding.
Why - software from ODU • Long term software maintenance through department • Department commits continuity independent of faculty on projects • Department will find and assign faculty and student who can become conversant with code and maintain it (not evolve it) • Likely that there would be other faculty who are interested in evolving code for appropriate funding
Cost of Possible Project • For a 15month project for a significant collection best estimate if it were done in isolation, cost for NASA: $160,000 • For the same 15 month project if done in parallel with DTIC (and possibly GPO), cost for NASA $90,000