320 likes | 538 Views
Metadata Issues for e-Prints: experiences from setting up an Institutional Repository Jessie Hey Research Fellow TARDis Project University of Southampton ePrints UK Workshop Ashmolean Museum Oxford 22 Mar 2004. e-Prints. A simple illustration of diversity in metadata! EPrints (software)
E N D
Metadata Issues for e-Prints:experiences from setting up an Institutional RepositoryJessie HeyResearch Fellow TARDis Project University of Southampton ePrints UK WorkshopAshmolean Museum Oxford 22 Mar 2004
e-Prints A simple illustration of diversity in metadata! • EPrints (software) • e-Prints (Soton) • ePrints (UK project) • eprints (in URLs, emails) • E-print (Network – US gateway)
Searching for e-Prints in Googlee-Prints 1,200,000; eprints 225,000
Plam pilot? • Looking for a PDA? • Just try searching for plam pilot on eBay • Even a sale is not incentive enough
Metadata • The modern word for ‘Data about data’ • Generally structured data describing an e-Print in this context • Describing an object such as a journal article or book chapter or thesis
Metadata issues for today • Who needs the quality? • What kind of quality? • How we approached it in TARDis • the depositor • the process • classification • mediation • Balancing demands the pragmatic way
Who needs the quality? Service providers (i.e. search services) • Analysis in both e-learning and e-prints communities showed concern about quality of metadata in individual databases to give good search results when combined in cross-domain search services Barton, Jane, Currier, Sarah and Hey, Jessie M.N. (2003) Building quality assurance into metadata creation: an analysis based on the learning objects and e-Prints communities of practice.In:2003 Dublin Core Conference: Supporting Communities of Discourse and Practice - Metadata Research and Applications, DCMI, 39-48. http://eprints.soton.ac.uk/archive/00000020/
As I am in Oxford… • a tribute in Elvish to JRR Tolkien from the Lord of the Rings
Gandalf on Dublin Core metadata • ‘I cannot read the fiery letters,’ said Frodo in a quavering voice. • ‘No’ said Gandalf ‘but I can. ……this in the Common Tongue is what is said, close enough: • One Ring to rule them all, One Ring to find them, • One Ring to bring them all and in the darkness bind them.’
Standards for e-Prints: Dublin Core Metadata Sets • Define minimal metadata elements for simple resource discovery e.g. title, creator, subject and keywords, publisher, date, rights management • Fundamental building blocks for Open Archive Initiative compliant repositories • Software such as GNU EPrints is OAI compliant (in DSpace may need ‘switching on’) • Full text searching (in latest version) will give additional help to compensate for weaknesses
Who needs the quality? • Academics (the depositors) need reasonable quality for their publication record whether full text is available or not • Tendency to think a good citation matters less if access leads straight to the full text An institutional repository needs • To represent their own work well • To represent their faculty and university well • For publicity and communication • For research assessment and proposals • For promotion
What kind of quality? • Fit for purpose – visibility and citability • Rolls Royce or Volkswagon Golf or a Skoda? • The Rolls Royce may not produce a sustainable repository • Library of Congress had to think again with a backlog of millions • A departmental archive had to scrap its editors (too slow) • Need a model with a light touch
Examples to correct From an academic’s current departmental publication record: • Co-author given as Fadden on older references • Given as McFadden on newer ones • McFadden would not find all his papers!
Examples to correct • Authors are not perfect but neither are information specialists or other sources Recent examples: • Author’s assistant put a conference in year 2400 • ‘Web of Knowledge’ put a conference in 2010 NB Amazon proved useful for checking book information from the title page (new Amazon ‘search inside’ service) but main entries may be less accurate
Quality Assurance Procedures • Would like to pick up these and obvious examples of metadata in the wrong field eg book title used for title of chapter • Options include regular checking (e.g at or close to time of deposit or for annual reporting) or random checking • Visualisation techniques promising but still expensive
How we approached it in TARDis • Looked at process from point of view of depositor • to decrease the barriers to deposit • to improve quality by design or example • Looked at metadata required for a good citation • academics using e-print records for many purposes not just visibility • Some information may be easier to strip out if required but harder to add later e.g. • first name or initials – although cultural variations too • journal title or abbreviation
Simple things deter • Questions you can’t answer • No place to put it • Errors which force you to enter it again • On a credit card payment • Date on the card: 06/05 • Date to enter: 06/2005 How many times do I do this incorrectly!
To help the depositor • Aimed to enter information as the depositor sees it on the full text • Arranged input in the order the information is seen • With relevant information grouped together • With ‘pages’ of daunting size • Fields of a size to view as much of the text as possible
The Process • Added help where examples are useful • Added extra buttons at top to ease navigation • Made mandatory fields where essential • Tension between full details and deterrent • commentary field currently not included although some might find useful
Some ‘quality’ traditions may be less practical • Search service recommendations: capitals only for first word of title except proper nouns • Process is generally ‘cut and paste’ so result is variable and advice ignored • Get Caps, non-caps, rarely ALL CAPS • Found in practice likely to be too time consuming to insist • Think retrieval first rather than consistency
Classification – a specific area of debate • ePrints UK exploring automatic classification with Dewey • TARDis looked at current practice: Reviewed subject classification in discipline based and early institutional archives Found whole variety of choices and levels of complexity
TARDis on subject classification • Discussion of issues and snapshot chart http://tardis.eprints.org • Using basic Library of Congress with view to harvesting eg papers in Oceanography • Added search box to find subject • Departments could use an additional scheme if they wish (software option) • Keywords can be added (cut and paste) if available (sometimes papers also have classification categories added for a journal) • Computer classification generally expensive and requires learning examples but accuracy is improving
Mediation • TARDis is experimenting with deposit choices • Branch to: • Self archiving (author or local assistant) with light review as pass through submission buffer • Assisted archiving – give us the file with essential details not evident from the full text
Mediation in practice • Current experience: • Assisted archiving often time consuming – meeting the difficult ones - but can add value (e.g.fuller publisher location details such as DOI) • Self archiving less accurate but author may know details which may be missing from full text • Balance likely to change as authors become either more familiar with early deposit or perhaps happy to delegate to save time • Learning curve for us – later may devolve some quality responsibility (use editorial options) • Give additional feedback into software
The challenge of cutting and pasting from PDFs • Sometimes rather like the Hyperbookworms (Jasper Fforde, The Eyre Affair) • Who produce spurious capitals, apostrophes, hyphens • Problems with hyphens, accents and words starting with f! • LaTex usually the culprit so Humanities have an advantage here
Balancing demands the pragmatic way • Author deposit changes the equation • Incentives can increase accuracy • Deposit support • Requests by department or university or funding council for up to date records • Collaboration between author, department and information specialist may be best way forward • Aim: light quality control to achieve visibility and citability
The New World of e-Prints • Not so elegant to work in as an Oxford College Library such as Brasenose • But should be just as satisfying to use as it meets new needs
Thank you For further information: TARDis http://tardis.eprints.org/ e-Prints Soton (Research Soton) http://eprints.soton.ac.uk/ FAIR Focus on Access to Institutional Resources Programme "Improving the Quality of Metadata in Eprint Archives" Marieke Guy and Andy Powell Ariadne Issue 38 30-January-2004 Barton, Jane, Currier, Sarah and Hey, Jessie M.N. (2003) Building quality assurance into metadata creation: an analysis based on the learning objects and e-Prints communities of practice.In:2003 Dublin Core Conference: Supporting Communities of Discourse and Practice - Metadata Research and Applications, DCMI, 39-48. http://eprints.soton.ac.uk/archive/00000020/