390 likes | 467 Views
Reference Rot and E-Theses: Threat and Remedy. Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library. Centre for Service Delivery & Digital Expertise. Hiberlink ETD2014, Leicester UK July 25th 2014.
E N D
Reference Rot and E-Theses: Threat and Remedy Peter Burnhill EDINA, University of Edinburgh & for the Hiberlink Team at University of Edinburgh & LANL Research Library Centre for Service Delivery & Digital Expertise Hiberlink • ETD2014, Leicester UK July 25th2014 Funded by the Andrew W. Mellon Foundation
Overview • The Hiberlink Project & Reference Rot • Evidence of Threat of Reference Rot for the E-Thesis • Our methods, data source & findings • Devising Remedy for Reference Rot in E-Thesis • Proposals for intervention: plug-ins & infrastructural solutions • Next Steps: who (else) wants to take this work forward?
Investigating Reference Rot in Web-Based Scholarly Communication Reference Rot = Link Rot + Content Drift “when links to web resources no longer point to what they once did”
Link Rot ‘Link Rot’
+ Content Drift: What is at end of URI has changed, or gone! http://dl00.org 2000 (a) Dynamic contentas values on webpage changes over time http://dl00.org 2004 http://dl00.org 2005 http://dl00.org 2008 (b) Static contentbut very different (often unrelated) web pages
An International Team at Workfunded by the Andrew W. Mellon Foundation • Los Alamos National Laboratory: • Research Library: Martin Klein, (Rob Sanderson), Harihar Shankar,Herbert Van de Sompel • University of Edinburgh: • Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou • EDINA * : Neil Mayo, Muriel Mewissen (Project Manager), Christine Rees, Tim Stickland, Richard Wincewicz, Peter Burnhill Centre for Service Delivery & Digital Expertise Hiberlink • ETD2014, Leicester UK July 25th2014 Funded by the Andrew W. Mellon Foundation
What we are doing in Hiberlink, 2013 - 2015 • Creating evidence on extent of ‘Reference Rot’ • Main focus has been on references (& URIs) made in Journal Articles • Includes work on reference rot in Supreme Court judgments with Harvard Law Library & permaCC • ETD2014 is opportunity to look at Reference Rot & the e-Thesis • Understanding the preparation/publication workflow • Identifying opportunity for productive intervention • Prototypes for pro-active archiving to enable remedy • Embedding such ‘solutions’ in existing tools & infrastructure • Raising awareness & seeking collaborative actions …. through events like this
Retrieving thinking about the emerging e-Thesis in 1998 University Theses Online Group, 1994/99 Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in ‘E-Theses Developments in the UK’ 2003
Retrieving thinking about the emerging e-Thesis in 1998 University Theses Online Group, 1994/99 Initiated by U of Edinburgh & UC London, as referenced by Susan Copeland in ‘E-Theses Developments in the UK’ 2003 4.
Measuring the Extent of ‘Reference Rot’ in e-Theses • Data Source • Looked for corpus of e-Theses for our study period of 1997 – 2012 • Interested only in Doctoral Theses/Dissertations • NDLTD Union Catalogue • Basic Method • Define selection and use information in the metadata record • Degree awarded (PhD etc); Department • Date thesis was successfully defended • Link to the full text of the Doctoral Thesis • Download selected e-Thesis from each Institution’s Repository
7,500 E-Theses Downloaded from 5 US Institutions In passing: note decline in numbers indicates ‘lag’ in ingest/availability of e-Theses
Key Aspects of Methodology (Stage 1) • Convert those e-Theses from PDF into XML • pdftohtml –xml • Locate the references & extract each and every URL • Technical challenges: URL broken/newline; underscore as image • Use up to 15 regular expression for matching; regard as URI • UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
Key Aspects of Methodology (Stage 2) • 47,067 URIs were extracted • These were partitioned into two types: • 1,086 publisher sites, representing very many references to online articles: ‘the scholarly record’ • BTW, who does keep those articles in the Scholarly Record safe? • Ask me for evidence on that! • 45,981 URIs that linked ‘the Web at large’ • to Web content required for scholarship • inc. websites, software, blogs, videos, online debate etc • to that which lacks ‘fixity’ and changes over time • Those c.46,000 are the focus for the Hiberlink Project
Increase in Linking to ‘Web-at-large’ Resources, 1997-2010 beyond the e-journal, to that which lacks ‘fixity’ and changes over time 50% URIs, by Year Thesis Defended (%), 1997 - 2010
But Wide ‘Between-Thesis’ Variation in Number of Web Links Focus on e-Theses defended from 2003 1373 Count(Log10) • 10% of Theses have 25 or more URIs • Median (average) increases from 4 to 5.5 • 75% have 2+ URIs per Thesis box plots of medians (averages) & quartiles, with ‘outliers’
Methodology (Stage 3): to discover answer to 2 questions • Do those links (URIs) still work? Is the URI on the ‘Live Web’’? • Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’
Methodology (Stage 3): to discover answer to 2 questions • Do those links (URIs) still work? Is the URI on the ‘Live Web’’? • Allowing up to a maximum of 50 redirects, recording the HTTP transaction chain and regarding an 2XX status code as ‘live’ • Is there a ‘Memento’ of that reference in the ‘Archived Web’? Memento: a prior version, what the Original Resource was like at some time in the past.
Methodology (Stage 3): to discover answer to 2 questions • Do those links (URIs) still work? Is the URI on the ‘Live Web’’? • Is there a ‘Memento’ of that reference in the ‘Archived Web’? • Archival check carried out in June 2014, using installed version of • Memento tool developed by LANL • http://www.mementoweb.org/guide/quick-intro/ • A ‘Datetime’ version at or near the date the Thesis was defended • Searching across several archives (not just Internet Archive) • Approach first used in pilot work at LANL; UoEdin Language Technology Group: Beatrice Alex, Claire Grover, Richard Tobin, Ke “Adam” Zhou
A Measure of Reference Rot: Are those references available? [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] After up to 50 redirects 1st Order Indicator of ‘Reference Rot’ more than one third of references to the Web subject to ‘rot’ Less than two-thirds of those links lead to live content
Confirm?: 2/3rds ‘Live’ URI Ratio same across ‘Big 3’, 2003-2010 => ‘On average’ 1/3rds of the links in an e-Thesis are ‘rotten’
The older the citation, the less likely to be still on the live Web[excluding 0s&1s: a few theses are unaffected; a few are ruined] We can’t stop that process of rot: Web content changes over time, Reference Rot is inevitable function of time Number of months elapsed from Date Thesis Defended until date archives checked (June 2014)
Searching for ‘Datetime’ Mementos of content in ‘Archived Web’ [in 6,400 e-Theses defended in 2003-2010 at 5 US universities] There seems a 50:50 chance that referenced content is in the ‘Archived Web’. => half of those references are at ‘risk of loss’ Some content is being ‘co-incidentally harvested’ by routine web archiving.
50:50 chance that ‘DateTime Reference’ is ‘Incidentally Archived’
‘Incidental Archiving’ is constant over time (This is an ‘upper bound estimate’, independent of age of e-thesis) We can improve upon this ‘50:50 chance’ by pro-actively archiving what we cite
We already have ‘Lost Content’ for References to Web[in 6,400 e-Theses defended in 2003-2010 at 5 US universities] 18.4%‘not live & not found in archive’ judged to be lost forever 34%‘live’ & ‘not in archive’ at is risk of loss NB: The 34% ‘at risk’ could be saved by pro-active archiving
Our General Approach Having demonstrated problem exists & is severe • The Web changes over time: reference rot occurs (36.7%) • Incidental archiving via routine of web archiving initiatives delivers no more than 50:50 chance of success • Seek pro-active ‘transactional archiving’ solutions • focus on what is regarded by authors as important • Thereby to remedy the integrity of the scholarly record We aim to embed ‘solutions’ in existing tools & infrastructure
Strategy for Making Remedy • Understand the preparation/publication workflow • identifying where there can be productive intervention • Devise prototypes for pro-active archiving • writing & implementing code! • Propose/test infrastructure for temporal referencing • supporting & using the Memento protocol We are embedding ‘solutions’ in existing tools & infrastructure
Understanding 3 workflows: Rot orRemedy? Identify the Actors Doctoral Student (& Supervisor) Study -> Preparation - > (Review) ->Submission Faculty, Examiners& Supervisor Post-Submission -> Examination -> (Revision) -> Award University & Library Post-Award -> Deposit/Ingest -> Provide/Access -> Use Extended length of stages in workflows magnify reference rot & affect Identify the best opportunities for Intervention to make Remedy
‘Work in progress’ to effect Remedy • Hiberlink Plug-in - to enable pro-active archiving • Missing Link - re-factoring the HTML link • HiberActive - enables repositories to ‘stop the rot’ via actively archiving those references in e-theses LANL: Martin Klein, Harihar Shankar, Herbert Van de Sompel UoEd EDINA: Neil Mayo, Tim Stickland, Richard Wincewicz Hiberlink • ETD2014, Leicester UK July 25th2014 Funded by the Andrew W. Mellon Foundation
‘Work in progress’ to effect Remedy (1) • Hiberlink Plug-in - to help authors and middle-folk (publishers/librarians) do the right thing: • Zotero - used by authors to manage references https://www.zotero.org/ • Open Journal System (OJS) - used by OA publishers https://pkp.sfu.ca/ojs/
For use during preparation of thesis & before final submission but alsobefore deposit with Library (& maybe for repair by Library …) Hiberlink Plug-in for Zotero • Triggers archiving of referenced web content • Returns Datetime URI for archived content
‘Work in progress’ to effect Remedy (2) • Hiberlink Plug-in - to enable pro-active archiving • Missing Link - re-factor the HTML link that is returned Take simple URI - to French National Library (say) Augment Link with a set of Datetime & location pairs
‘Work in progress’ to effect Remedy (3) • Hiberlink Plug-in - to enable pro-active archiving • Missing Link - re-factoring the HTML link First two approaches support ‘perfect scenario’: • All authors archive all their cited URIs • e.g. (but not exclusively) with Hiberlink / Zotero • HiberActive • Enables repositories to ‘stop the rot’by actively archiving those references in e-theses • A notification hub, a component for the infrastructure • testing workflow with ResourceSync, CORE & external archive programme
Next Steps: who wants to take this work forward? • to ensure references in e-Theses don’t rot • Need to move from the ‘incidental Web archiving’ of cited URIsto pro-active archiving, by student/authors & by libraries • Offer to be an early adopter for these Hiberlink remedies • The Hiberlink Plug-in for Zotero / HiberActive • Email: edina@ed.ac.uk • Subject: Hiberlink ETD • Amend ‘Guidance for ETD Lifecycle Management’
http://hiberlink.org #hiberlink Thank you, Questions welcome • Email: edina@ed.ac.uk Hiberlink • ETD2014, Leicester UK July 25th2014 Funded by the Andrew W. Mellon Foundation
Aside: We would all like to assume that our libraries are ensuring that online e-journal content is being kept safe • But online articles in the Scholarly Record are not in the custody of Libraries, nor on their digital shelves. Picture credit: http://somanybooksblog.com/2009/03/27/library-tour/
Evidence from The Keepers Registry is worrying! • Compare what is being kept by the (10) leading archiving agencies (CLOCKSS, Portico, national libraries etc) with all issued with ISSN • ‘Ingest Ratio’ = titles being ingested by one or more Keeper / ‘online serials’ in ISSN Register • = 23,268 / 136,965 [in March 2014] => 17% • * We do not know about 83% of e-serials having ISSN * • ‘KeepSafe Ratio’ = ingest by 3+ Keepers = 9,652 / 136,965 => 7% • Title Lists of 3 US research libraries (Columbia, Cornell & Duke), checked i2011/12 ‘Ingest Ratio’ = 22% to 28%; c.75% unknown fate • User-centric Evidence, usage logs for the UK OpenURL Router* => over two thirds 68% (36,326 titles) held by none!