440 likes | 747 Views
Metadata. Andy Powell Technical Development and Research UKOLN University of Bath http://www.ukoln.ac.uk/ a.powell@ukoln.ac.uk. Metadata. What is metadata? an introduction The Dublin Core metadata for the Web Metadata management Models for dealing with Web-site metadata
E N D
Metadata Andy Powell Technical Development and Research UKOLN University of Bath http://www.ukoln.ac.uk/ a.powell@ukoln.ac.uk
Metadata • What is metadata? • an introduction • The Dublin Core • metadata for the Web • Metadata management • Models for dealing with Web-site metadata • UKOLN metadata projects • overviews (and problems)
What is metadata? • by definition: ..data about data.. ..data which provides information about a resource.. • by example: • title, author, subject classification, shelf mark • digital format, terms and conditions, location (URL)
What is metadata? (2) • by usage: • Resource discovery • Searching, location • Authentication • Quality/rating • Semantic interoperability • Resource management • User interface • Grouping resources for printing • 3-D visualisations
Range of formats Simple Rich Alta Vista NetFirst Lycos Dublin Core IAFA SOIF MARC TEI headers CIMI robot generated hand crafted
Where is metadata? • Embedded within resource • HTML <META> tags • Linked to resource • Remote database • distributed • union (centralised)
Publisher side author webmaster institution Service side search service third party creators Who creates metadata? robot generated hand crafted
Dublin Core • 15 element core metadata set • Primarily intended to aid resource discovery on the Web • Main usage currently embedded into HTML META tags • All elements optional and repeatable • Status? • Agreed syntax for embedding in HTML • Still discussion about the use of some of the elements http://www.ukoln.ac.uk/metadata/resources/dc.html
Dublin Core History • 4 DC meetings • Dublin, Warwick, Dublin, Canberra • (DC-5 - Helsinki coming soon) • Mailing list discussions • meta2@lut.ac.uk • W3C interest • RDF (PICS-NG), MCF • Various projects • Still no significant interest yet from the big search engines :-(
DC Elements - 1 • Title • Subject • intended to promote use of controlled vocabularies but in practice likely to be used for uncontrolled list of keywords • Description • abstract • Creator • Publisher
DC Elements - 2 • Contributor • Date • the date ‘the resource was made available in its present form’. Agreed default format uses subset of ISO 8601, e.g. 1997-09-15 • Type • category of resource - document, image, sound, home page, novel, poem, etc. Still much discussion about the content of this element • Format • MIME type • Identifier
DC Elements - 3 • Source • Language • language of the resource - NOT the metadata • Relation • no guidelines for usage currently • Coverage • separate working party looking at usage • Rights • rights management seen as too complex for DC. This will give a URL to some external information
Simple Example <HTML><HEAD> <TITLE>UKOLN Home Page</TITLE> <META NAME="DC.title” CONTENT="UKOLN: UK Office for Library and Information Networking"> <META NAME="DC.subject" CONTENT="national centre, network information support, library community, awareness, research, information services, public library networking, bibliographic management, distributed library systems, metadata, resource discovery, conferences, lectures, workshops"> <META NAME="DC.description" CONTENT="UKOLN is a national centre for support in network information management in the library and information communities. It provides awareness, research and information services"> <META NAME="DC.creator" CONTENT=”Stark, Isobel"> </HEAD> ...
Element qualifiers • Need to refine meaning in some cases • TYPE Refines meaning of element - sub-divides element namespace • SCHEME Element value taken from external schema, e.g. LCSH for DC.subject, Z39.53 for DC.language • LANGUAGE Language of element value (not of the resource being described!)
Examples - TYPE • Original DC.creator tag <META NAME="DC.creator" CONTENT=”Stark, Isobel"> • Non-personal author <META NAME="DC.creator.corporate" CONTENT=”UKOLN Information Services Group"> • Author’s email address <META NAME="DC.creator.email” CONTENT=”isg@ukoln.ac.uk">
Examples - SCHEME • Library of Congress Subject Heading <META NAME="DC.subject" CONTENT=”(SCHEME=LCSH) Library information networks -- Great Britain"> <META NAME="DC.subject" CONTENT="(SCHEME=LCSH) Information technology -- higher education"> …or… <META NAME="DC.subject" SCHEME=“LCSH” CONTENT=”Library information networks -- Great Britain"> <META NAME="DC.subject" SCHEME=“LCSH” CONTENT="Information technology -- higher education">
Metadata Management Practical issues of using Dublin Core for Internet resource description... • UKOLN metadata system • Requirements • 3 models for metadata management • Implementation at UKOLN
UKOLN metadata system requirements • Easy to use • Work with a variety of methods of creating HTML • Simple migration to future metadata formats • Separate metadata from resource
Pros… Simple May be useful for training and familiarisation Cons… May not be possible with all editors Maintenance problems Easy to make errors Managing Dublin Core (1)HTML Authoring tool Embed by hand using HTML or text editor
DC-dot • A Web based tool for creating Dublin Core <meta> tags • Automatic generation of some tags based on content of the resource • Forms based editing of tags • Cut-and-paste output into HTML • Conversion to other formats… • SOIF, ROADS/WHOIS++, USMARC, GILS... http://www.ukoln.ac.uk/metadata/dcdot/
Pros… Use of Web-site management tools likely to increase Object-oriented database approach Cons… Proprietry formats Early days - too early to evaluate use for metadata yet? Managing Dublin Core (2)Web-site management tool Use Web-site management tool, for example NetObjects Fusion
Pros… Separates metadata from resource Future migration fairly simple Cons… Performance Lack of integration with HTML tools Server specific Managing Dublin Core (3)On the fly generation Hold Dublin Core separately and embed on-the-fly using server-side include (SSI)
UKOLN metadata system (1) • Embed on-the-fly • Apache SSI script • Store metadata using SOIF records • Use MS-Access as tool to create the records • Associate metadata with resource by co-locating them in the Web server filestore
UKOLN metadata system (2) intro.html Apache syntax for calling server-side script <!--#exec cmd="getmeta" --> <html> <head> <title>…</title> <!--#exec cmd="getmeta" --> </head> ... HTML editor intro.html.soif @FILE { http://www.ukoln.ac. ... keywords{13}: xxx, yyy, zzz description{14}: blah blah b author{13}: Stark, Isobel ... } MS-Access Database
UKOLN metadata system (3) MS-Access front end... Filename browser Text boxes Name choosers UKOLN specific metadata
UKOLN metadata system (4) intro.html Web robot <html> <head> <title>…</title> <!--#exec cmd="getmeta" --> </head> ... 1 2 UKOLN Web server 6 intro.html.soif @FILE { http://www.ukoln.ac. ... keywords{13}: xxx, yyy, zzz description{14}: blah blah b author{13}: Stark, Isobel ... } 3 4 SSI script 5
Issues • Performance • Interaction with Web caches • Dublin Core vs Alta Vista style metadata <META NAME=”Description” CONTENT=”blah, blah"> <META NAME="Keywords” CONTENT="xxx, yyy, zzz"> • Granularity • Which pages should have metadata?
What's the point... …of embedding DC <meta> tags? • Alta Vista isn't going to look for them • But, worth doing... • within individual projects • within specific communities (e.g. eLib) • Improve local search facilities • e.g. load SOIF records into a Netscape Catalogue Server • Web-site management benefits
UKOLN Metadata projects • ROADS • Software for Subject Service • DESIRE • European Web indexing • NewsAgent • Current awareness service for Library and Information Staff • BIBLINK • Information flow from publishers to National Bibliographic Agencies
ROADS • Resource Organisation and Discovery in Subject-based Services • Web based tools for Subject Services • SOSIG, ADAM, OMNI, … • Manage and search Internet resource descriptions • ROADS templates (based on IAFA templates) • WHOIS++ http://www.ukoln.ac.uk/roads/
ROADS - WHOIS++ (1) • Simple client-server search and retrieve protocol • Developed originally for ‘white pages’ applications • Offer search facilities across several Subject Services • Distribute a Subject Service across several physical servers • Query routing - centroids and CIP
ROADS - WHOIS++ (2) • Centroid generated by ADAM contains… “you’ll find the string ‘mona’ in the ‘title’ attribute of at least one record in the ADAM database”. SOSIG 2 CGI-based WHOIS++ client 3 OMNI CIP sharing of centroids 1 4 6 5 Web browser ADAM
DESIRE European Web cataloguing • Subject Services • EuroSOSIG (Bristol), EELS (Lund), Arts (Koninklijke Bibliotheek) • Manually created ROADS templates • European Web Index • based on Nordic Web Index (NWI) • Robot generated, all resources • Multiple servers linked with Z39.50 • GILS http://www.nic.surfnet.nl/surfnet/projects/desire/desire.html
DESIRE - current work (1) • Internationalisation of ROADS • Use of robots to: • aid manual cataloguing of resources • build indexes based on list of URLs in a ROADS database • Robot will use embedded Dublin Core if available
DESIRE - current work (2) • Re-design of EWI robot - including: • support for Dublin Core • EWI records GILS-II compatible • Allow users to search across subject services and the EWI using Z39.50 • by converting ROADS records into GILS records • by building a WHOIS++ to Z39.50 gateway http://roads.ukoln.ac.uk/cgi-bin/egwcgi/egwirtcl/targets.egw
NewsAgent Current awareness service for LIS... • Distributed database • servers at LITC, FD, UKOLN - Z39.50 • metadata (and some full-text) • based on DALI • Mixture of content streams • Variety of access methods • Web, e-mail and Z39.50 clients • user-configurable profiles http://www.ukoln.ac.uk/metadata/NewsAgent/
NewsAgent - Content • Journals • Program, VINE, Journal of Librarianship and Information Science • News and briefing material • LA, IIS, UKOLN (Ariadne), BL, LITC • Web pages • E-mail lists and USENET news
NewsAgent - Harvesting • Web crawler • looking for embedded Dublin Core • Limiting the harvest • simple heuristics • use of Dublin Core Relation element • E-mail parser http://www.ukoln.ac.uk/metadata/NewsAgent/dcusage.html
BIBLINK Information flow between publishers • traditional • new - CD-ROM or Web (new to publishing) and National Bibliographic Agencies • British Library, UK • Biblioteca Nacional, Madrid, Spain • Bibliothèque Nationale de France, Paris • Koninklijke Bibliotheek, Den Haag, Netherlands • Nasjonalbiblioteket, Rana, Norway • Universitat Oberta de Catalunya, Barcelona, Spain http://www.ukoln.ac.uk/metadata/BIBLINK/
BIBLINK - research • Scope • Electronic publications suitable for inclusion in National Bibliographies • Metadata • Dublin Core (with extensions!), SGML DTD • Identifiers • ISBN, ISSN, SICI, DOI, URN • Transmission • Simple e-mail or Web crawler • Authentication • MD5 hash assigned to each resource
BIBLINK - data set • Minimum data set • Author, Title, Publisher, Place of Publication, Price, Extent (size), Keywords, Description, Edition/Version, Date of Publication, System Requirements, Format, Language, Terms and Conditions, Frequency, Identifier, Contributor, Checksum • Similar to DC but some don’t fit… <META NAME=“BIBLINK.placePublication” CONTENT=“Bath, UK”> <META NAME=“BIBLINK.frequency” CONTENT=“monthly”> • Issues over conversion to MARC
BIBLINK - demonstrator Publishers • Cataloguing in Publication(CIP) level records Dublin Core E-mail NBAs/National Libraries Dublin Core • Enhanced records optionally returned to publishers UNIMARC • Conversion on to local MARC format using USEMARCON ??MARC
Conclusions • Think about metadata as a ‘process’ • Dublin Core syntax now stable enough to use • Use within projects initially • Choose metadata management model appropriate to your site • Consider long term maintenance and transition to other formats