Heuristic Approach for Automatic Metadata Capture of E-books/Journals

ARD Prasad DRTC Indian Statistical Institute Bangalore Heuristic Approach for Automatic Metadata Capture of E-books/Journals

Agenda • Earlier Experiment with printed books • Present Experiment with E-Books & E-Journals

Heuristics for Printed Books • Heuristics for the ... • Title page • Verso of the title page

Methodology for Printed Books • Scan the title page • OCR the image • Generate the output in HTML • Apply Heuristics to HTML pages • Identify the bibliographic elements

Heuristics for Verso of the Title Page • Identify date & edition etc. • See whether prenatal cataloging is available • Identify the bibliographic elements in prenatal catalog • Counter check the identifications from the title page • Resolution in case of conflicts

Generating Bibliographic Records • Once the bibliographic elements are identified • Generate bibliographic records in • ISO-2707 • Dublin Core

Sample Heuristics for Identifying Title • Order of the Bibliographic elements • Titles are found in upper or upper middle portion of the title page. • The title appears first in the title page (75.15 per cent) (In few cases author or series occupies first position.) • Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.

If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen). • It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others. • Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.

Heuristics for other elements • Sub titles • Edition • Volume • Authors/ Contributor • Publisher • Place • Year • Series

Present Experiment • E-Books (from sites like amazon.com ) • E-Journals (Non-OAI compliant)

Methodology • Template based Identification • Heuristic based Identification

Disadvantages of Template Based Approach • For every new site / templates are to be created • A site may change the appearance and require you to develop more than one template for each site or journal

Methodology • Study few sites to develop heuristics • Web Crawler to probe the site • Identify the files having documents (filter irrelevant files) • Apply heuristics on the files having e-documents • Generating Dublin Core Records

Welcome to International Conference on Semantic Web & Digital Libraries 21st – 23rd February, 2007 Indian Statistical Institute Bangalore Thank You

Heuristic Approach for Automatic Metadata Capture of E-books/Journals

Heuristic Approach for Automatic Metadata Capture of E-books/Journals

Presentation Transcript

Elementary Battle of the Books

Musical Genre Classification

Heuristic Search

Journals and Indexing

Automatic Indexing

OBJECTIVES

Direct Digital Radiography or Direct Capture Radiography

Making Metadata Work

Towards Interactive and Automatic Refinement of Translation Rules

Dublin Core and metadata: a tutorial

CSCE 580 Artificial Intelligence Ch.4: Informed (Heuristic) Search and Exploration

Picture Storybooks

CONCERT 2008

Introduction to Metadata

CM [A] R’s “MarLIN” Metadata System - or, how do we discover what data we’ve got??

Heuristic Search

Metadata

Introducing Bioinformatics Databases