1 / 24

Counting on Open DOAR

Learn about the established counting methods used by OpenDOAR, the difficulties in counting records, and the strategies implemented. Discover efficient counting methods and how to improve repository count accuracy.

jbermudez
Download Presentation

Counting on Open DOAR

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting on OpenDOAR Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk

  2. Background to OpenDOAR • Created in 2005 • Lists over 2320 repositories (2013-07-02) • Manually validated • High quality… • …but we didn’t like to talk about the record counts • Counts not updated after the initial entry • Unless prompted by users • Fixed in 2012 • Record counts updated about every 2 weeks

  3. Established counting methods • Manual inspection • Labour-intensive • Counting OAI-PMH record identifiers • Inefficient • Handling big files • Iterative • Unreliable • File size limits and timeouts • Inaccurate • Need to account for deleted records

  4. How difficult can it be? • SELECT COUNT(*) FROM repository; • Still fast even with added complexity • Statuses, Breakdown by date, etc. • The number is often there on the web page • Headline number, or • “x to y of z” tally, or • Adding up numbers on a “Browse by year” page

  5. OpenDOAR’s Strategy • Avoid OAI-PMH whenever possible • Use other m2m interfaces, if available/suitable • Screen scrape numbers from web pages • If all else fails, use manual methods • Counts for “full texts” as well, where possible

  6. Some examples…

  7. Generic n records Documents avec texte intégral 229181

  8. Generic x to y of z counters Showing results 1 to 20 of 6727 DSpace Browse Counter is a special case

  9. DSpace totalCnt Add-on NCKUR中的社群 [40782/74662] [ 全文筆數/總筆數 ] -

  10. Generic Sum of List Counters Add up the numbers in brackets EPrints count Browse List is a special case

  11. EPrints V.3 Counter Number of items http://eprints.nonesuch.ac.uk/cgi/counter

  12. Generic Sum of Numbers Add up the numbers

  13. Generic HTML tag counting Count item tags in HTML source code

  14. Counting multiple pages Separate pages per letter, document type, etc Issues with Greenstone – lack of predictability

  15. OAI-PMH ListIdentifiers: Simple http:// ... /oai?verb=ListIdentifiers&metadataPrefix=oai_dc Count these No resumptionToken

  16. OAI-PMH ListIdentifiers: Iterative resumptionToken for blocks of identifiers <resumptionToken>193114FUS</resumptionToken>

  17. OAI-PMH completeListSize <resumptionToken completeListSize="89805" Bingo!

  18. Twelve count harvesting methods EPrints EPrints count Browse List EPrints V.3 Counter OAI-PMH ListIdentifiers Simple Iterative completeListSize Manual counting • Generic • Generic n records • Generic x to y of z counters • Generic Sum of List Counters • Generic HTML tag counting • Generic Sum of Numbers • DSpace • DSpace Browse Counter • DSpace totalCnt Add-on

  19. Efficiency of the methods Iterative OAI-PMH so much slower

  20. Relative Frequency of Methods

  21. UgentNumbers galore DSpace and EPrintsEasily scrapeable counts

  22. Count harvesting issues • No counts visible or harvestable • Static counts – often approx. – e.g. “over 2m items” • Connectivity issues • Infrastructure limitations – e.g. heavy internet traffic • HTTP 401 (unauthorised) & 403 (forbidden) errors • Data hidden in include files (e.g. JavaScript) • Not visible in View  Source code • No direct URL known for the pages with counts • Only accessible to human navigators • Remodelled websites – requiring updated settings

  23. Help OpenDOAR count your repository • Display record counts on your home page • Using distinctive wording & highlighting • Ideally in <div id="[ID]">or <span id="[ID]"> tags • Ensure numbers can be seen in View  Source code • Ensure pages & files are not blocked to robots • Grant read-only access if necessary • Implement OAI-PMH properly • Return ListIdentifiers in chunks – not one big file • Include completeListSize in the resumptionToken • Tell us about any changes, so we can update settings

  24. Ideas for the Future • Comparing counts from OpenDOAR & ROAR • E.g. Nottm ePrints: 1,239 < 1,277 • E.g. HAL-Inserm: 7,498 > 2,773 • OpenDOAR • Growth charts • Full text counts • Extending OAI-PMH • Statistical features • Trial PSH

More Related