140 likes | 239 Views
ARD Prasad DRTC Indian Statistical Institute Bangalore. Heuristic Approach for Automatic Metadata Capture of E-books/Journals. Agenda. Earlier Experiment with printed books Present Experiment with E-Books & E-Journals. Heuristics for Printed Books. Heuristics for the ... Title page
E N D
ARD Prasad DRTC Indian Statistical Institute Bangalore Heuristic Approach for Automatic Metadata Capture of E-books/Journals
Agenda • Earlier Experiment with printed books • Present Experiment with E-Books & E-Journals
Heuristics for Printed Books • Heuristics for the ... • Title page • Verso of the title page
Methodology for Printed Books • Scan the title page • OCR the image • Generate the output in HTML • Apply Heuristics to HTML pages • Identify the bibliographic elements
Heuristics for Verso of the Title Page • Identify date & edition etc. • See whether prenatal cataloging is available • Identify the bibliographic elements in prenatal catalog • Counter check the identifications from the title page • Resolution in case of conflicts
Generating Bibliographic Records • Once the bibliographic elements are identified • Generate bibliographic records in • ISO-2707 • Dublin Core
Sample Heuristics for Identifying Title • Order of the Bibliographic elements • Titles are found in upper or upper middle portion of the title page. • The title appears first in the title page (75.15 per cent) (In few cases author or series occupies first position.) • Fonts used in title field are the largest fonts (94.99 per cent) compared with the size of fonts in other fields.
If the title and sub-title occurred in the same line, they are separated by “:” (colon) or “-” (hyphen). • It is not necessary that title should have only alphabetic characters. Title string may have numerals, punctuation marks like comma, hyphen and others. • Usually titles have the terms like “The”, “An”, “Introduction”, “Theory”, “in”, “to”.
Heuristics for other elements • Sub titles • Edition • Volume • Authors/ Contributor • Publisher • Place • Year • Series
Present Experiment • E-Books (from sites like amazon.com ) • E-Journals (Non-OAI compliant)
Methodology • Template based Identification • Heuristic based Identification
Disadvantages of Template Based Approach • For every new site / templates are to be created • A site may change the appearance and require you to develop more than one template for each site or journal
Methodology • Study few sites to develop heuristics • Web Crawler to probe the site • Identify the files having documents (filter irrelevant files) • Apply heuristics on the files having e-documents • Generating Dublin Core Records
Welcome to International Conference on Semantic Web & Digital Libraries 21st – 23rd February, 2007 Indian Statistical Institute Bangalore Thank You