210 likes | 296 Views
Metadata and identifiers for e-journals. Copenhagen 13.-14.3.2000 Juha Hakala Helsinki University Library juha.hakala@helsinki.fi. Contents. Introduction Traditional cataloguing Full-text indexing Embedded metadata + Dublin Core DIEPER choices Identification of e-journals. Introduction.
E N D
Metadata and identifiers for e-journals Copenhagen 13.-14.3.2000 Juha Hakala Helsinki University Library juha.hakala@helsinki.fi
Contents • Introduction • Traditional cataloguing • Full-text indexing • Embedded metadata + Dublin Core • DIEPER choices • Identification of e-journals
Introduction • Metadata = structured description of resource • Structure of metadata is defined in a format • simple formats (AltaVista) • complex formats (MARC) • structured formats (Dublin Core) • Choices have important cost and quality implications (good is not free)
Traditional cataloguing • Routinely done for journals (ISSN DB) • Articles indexed only selectively • Finnish article index Arto: 1100 journals; 65000 articles + 10 man years annually, 40 libraries co-operate in production • Extending MARC cataloguing to all digitised articles is too expensive • Any selection criteria for “good material”?
Full-text indexing • Will not replace cataloguing... • In large databases precision still bad • ...but we should follow what is happening • RDBMS become document-literate (Oracle Intermedia) • new search techniques (e.g. fuzzy searching) • efficient use of language technologies • knowledge management
Embedded metadata (1) • Three issues to solve: • semantics: in which metadata format should my metadata be? • syntax: is it possible / feasible to embed metadata into this document (does the document format allow inclusion of metadata) • once topics 1 & 2 have been solved: are there tools for creating / harvesting / indexing my metadata?
Embedded metadata - syntax • It must be possible to include metadata in non-compromised form & specify each data element separately • Most document formats do not allow efficient metadata usage • “flat files”, image formats, Word97 • “This is Dublin Core identifier element, and there is an ISBN in it”
Embedded metadata - syntax (2) • HTML 4.0 • META tag enables sophisticated metadata • Explicit specification for how to embed Dublin Core -based metadata (RFC 2731) • XML/RDF • “Resource Description Framework makes data machine understandable” • very versatile, but may be tough to implement
Embedded metadata - semantics • Metadata formats tend to be domain specific, complex and hard to learn • Dublin Core as an alternative: • simple (in its basic form) • generic (no domain dependency) • extensible (local elements possible) • Is there any competition left?
Status of Dublin Core Initiative • maintenance in reliable hands • 15 elements stable (DC 1.1) • syntax for HTML 4.0 stable • core qualifiers under development • proposals published in December -99 • agreement in DC-AC in March 2000 • will result to 50-60 qualifiers
Tools for Dublin Core • Metadata support in Web indexes becoming more popular • Metadata creation emerging in document management systems • Text editors: XML support in place, RDF yet to come
DIEPER choices • Document format will be XML/RDF • extensible and open document format that will become very popular in the future • Metadata format will be based on DC • DC tags: Identifier, Title, Creator, Contributor, Publisher, Language, Subject • Local tags: e.g. SerialsNumbering, PlaceOfPublication, SizeSourcePrint
Identifiers for e-journals • Two different issues: • how to identify journals themselves • how to identify articles and possibly sections of articles (table of contents etc.) • Do we need resolution mechanism (based on DOI or URN)
E-journals • ISSN must be used, also for digitised journals • digitised version may have the same ISSN than the original paper version • ISSN should not be embedded on issues / articles, since this enhances recall too much • Broadened scope: serials + integrating resources
Issues & articles • SICI (Serial Item and Contribution Identifier) should be used • ANSI/NISO standard (1996) • http://sunsite.berkeley.edu/SICI/ • Not widely supported yet; e-commerce is likely to change this • need to identify whatever that can be sold • SICI generator available
Properties of SICI • Extensible: can identify issue/article/section within article • Can be created automatically (from structured source document) • Complex • 0002-8231(1929)30:1<ZBDMSU>2.0.CO;2-Z • Can be used as URN or DOI
URN & DOI • Umbrella systems that provide e.g. persistent linkage between a reference and the resource via a resolution service • DOI is a publisher-driven initiative, URN comes from the Internet community • DOIs can be used as URNs, not vice versa
Digital object identifier • Consist of prefix and suffix, separated by a slash • 10.1045/february2000-risher • Suffix may be anything, there is no hint on its content • Prefix identifies the publisher + indicates where to find a resolution service
Uniform resource name • Consists of three parts: • string urn: • Namespace identifier (NID) • Namespace specific string (NSS) • When NID is known, creating URNs from existing identifiers is trivially easy • No hint on where to find resolution service
Business models • DOI: annual payment for each DOI assigned • no decision yet on the size of the payment • flat fee for publisher ID • URN: no price at all • but someone has to pay for the resolution services
DIEPER policy • URNs will be used, in order to enable URN-based resolution services • ISSN/SICI will be used • ISSN International Centre will assist in creation of URN resolution services • ISSN database will be contacted first, in order to get the address of the resolution service