130 likes | 252 Views
Sometimes, I just want to count things. Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk. Actually, that’s a lie. Just give me numbers for Open DOAR No. of items in ~1,800 repositories Growth rates
E N D
Sometimes,I just want to count things Peter Millington SHERPA Technical Development Officer CRC, University of Nottingham peter.millington@nottingham.ac.uk
Actually, that’s a lie • Just give me numbers for OpenDOAR • No. of items in ~1,800 repositories • Growth rates • Number of full texts v metadata-only records • More generally (any database or resource) • No. of records in the database • No. of records by year, month, etc. • No. of records by category
How difficult can it be? • Screen scraping? – Uh-uh-uh • OAI-PMH – counting identifiers • BIG files – e.g. DSpace – Time out! • Iterative chunks – e.g. EPrints – Yawn • ‘completeListSize’ argument – If only… • ORE is no better – Whatever… • select count(*) from TABLE; – Duh! • So back to screen scraping – Sigh
It should be as easy as …one… • Simplicity • Single SQL SELECT statement • Anything more is too complex and so too slow • Single Call/File • No iteration • Single simple schema • XML (+ optional JSON, and other renditions)
…two… Target Performance - Rules of Two <= 0.2 seconds • SQL execution <= 2 seconds • Rendering the output file <= 20 • Data points
…three Maximum limits - Rules of Twenty (?) <= 2 seconds • SQL execution <= 20 seconds • Rendering the output file <= 200 • Data points
Actions speak louder than words • Protocol for Statistical Harvesting (PSH) • Base URL + verb + optional arguments • Specification & Examples • http://opendoar.org/demos/psh_prototype.php • Example Base URL: • http://opendoar.org/demos/psh.php
Simplest case - [base url]?verb=Count <psh> <responseDate>2011-02-11T00:05:26Z</responseDate> <requestverb="Count"> http://www.opendoar.org/demos/psh.php </request> <CountcountType="allItems"> <header> <setType/> <setSpec/> <setName/> <datestamp/> <numItems>1860</numItems> </header> </Count> </psh>
Optional Count Arguments • &countType – ‘units’ for counts • e.g. records, repositories, groups, genera, etc • &setType – some sort of category • e.g. subject, region, social class, etc. • &dateUnit • e.g. decade, year, month • &dateType • e.g. Date added, updated, performed, extinct, etc.
Breakdown by year added <psh> <responseDate>2011-02-11T00:36:24Z</responseDate> <requestverb="Count">http://www.opendoar.org/demos/psh.php</request> <CountcountType="allItems"dateType="dateAdded"> <header> <setType/> <setSpec/> <setName/> <datestamp>2008</datestamp> <numItems>298</numItems> </header> <header> <setType/> <setSpec/> <setName/> <datestamp>2009</datestamp> <numItems>278</numItems> </header>
Other verbs • Verbs for listing available argument values • ListSetTypes • ListDateUnits • ListDateTypes • ListCountTypes • Help – Technical help • Identify – Information about the resource
Some datasets to play with • OpenDOAR – open access repositories • http://opendoar.org/demos/psh.php • SHERPA/RoMEO – Publishers’ policies • http://www.sherpa.ac.uk/romeo/psh.php • Folk Play Scripts database • http://mastermummers.org/scripts/psh.php • Folk Play Groups & Events • http://mastermummers.org/groups/psh.php
How could this be improved?http://opendoar.org/demos/psh_prototype.phppeter.millington@nottingham.ac.uk