240 likes | 250 Views
Learn about the established counting methods used by OpenDOAR, the difficulties in counting records, and the strategies implemented. Discover efficient counting methods and how to improve repository count accuracy.
E N D
Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk
Background to OpenDOAR • Created in 2005 • Lists over 2320 repositories (2013-07-02) • Manually validated • High quality… • …but we didn’t like to talk about the record counts • Counts not updated after the initial entry • Unless prompted by users • Fixed in 2012 • Record counts updated about every 2 weeks
Established counting methods • Manual inspection • Labour-intensive • Counting OAI-PMH record identifiers • Inefficient • Handling big files • Iterative • Unreliable • File size limits and timeouts • Inaccurate • Need to account for deleted records
How difficult can it be? • SELECT COUNT(*) FROM repository; • Still fast even with added complexity • Statuses, Breakdown by date, etc. • The number is often there on the web page • Headline number, or • “x to y of z” tally, or • Adding up numbers on a “Browse by year” page
OpenDOAR’s Strategy • Avoid OAI-PMH whenever possible • Use other m2m interfaces, if available/suitable • Screen scrape numbers from web pages • If all else fails, use manual methods • Counts for “full texts” as well, where possible
Generic n records Documents avec texte intégral 229181
Generic x to y of z counters Showing results 1 to 20 of 6727 DSpace Browse Counter is a special case
DSpace totalCnt Add-on NCKUR中的社群 [40782/74662] [ 全文筆數/總筆數 ] -
Generic Sum of List Counters Add up the numbers in brackets EPrints count Browse List is a special case
EPrints V.3 Counter Number of items http://eprints.nonesuch.ac.uk/cgi/counter
Generic Sum of Numbers Add up the numbers
Generic HTML tag counting Count item tags in HTML source code
Counting multiple pages Separate pages per letter, document type, etc Issues with Greenstone – lack of predictability
OAI-PMH ListIdentifiers: Simple http:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc Count these No resumptionToken
OAI-PMH ListIdentifiers: Iterative resumptionToken for blocks of identifiers <resumptionToken>193114FUS</resumptionToken>
OAI-PMH completeListSize <resumptionToken completeListSize="89805" Bingo!
Twelve count harvesting methods EPrints EPrints count Browse List EPrints V.3 Counter OAI-PMH ListIdentifiers Simple Iterative completeListSize Manual counting • Generic • Generic n records • Generic x to y of z counters • Generic Sum of List Counters • Generic HTML tag counting • Generic Sum of Numbers • DSpace • DSpace Browse Counter • DSpace totalCnt Add-on
Efficiency of the methods Iterative OAI-PMH so much slower
UgentNumbers galore DSpace and EPrintsEasily scrapeable counts
Count harvesting issues • No counts visible or harvestable • Static counts – often approx. – e.g. “over 2m items” • Connectivity issues • Infrastructure limitations – e.g. heavy internet traffic • HTTP 401 (unauthorised) & 403 (forbidden) errors • Data hidden in include files (e.g. JavaScript) • Not visible in View Source code • No direct URL known for the pages with counts • Only accessible to human navigators • Remodelled websites – requiring updated settings
Help OpenDOAR count your repository • Display record counts on your home page • Using distinctive wording & highlighting • Ideally in <div id="[ID]">or <span id="[ID]"> tags • Ensure numbers can be seen in View Source code • Ensure pages & files are not blocked to robots • Grant read-only access if necessary • Implement OAI-PMH properly • Return ListIdentifiers in chunks – not one big file • Include completeListSize in the resumptionToken • Tell us about any changes, so we can update settings
Ideas for the Future • Comparing counts from OpenDOAR & ROAR • E.g. Nottm ePrints: 1,239 < 1,277 • E.g. HAL-Inserm: 7,498 > 2,773 • OpenDOAR • Growth charts • Full text counts • Extending OAI-PMH • Statistical features • Trial PSH