200 likes | 349 Views
NEEO Technical Workshop 2 Exchange of usage metadata. Sciences Po, Paris January 15th, 2009. Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels. Plan. Reminder of planning Problem description First proposal – OAI exchange of SWUP
E N D
NEEO Technical Workshop 2 Exchange of usage metadata Sciences Po, Paris January 15th, 2009 Benoit PAUWELS Université Libre de Bruxelles (ULB) Brussels
Plan • Reminder of planning • Problemdescription • First proposal – OAI exchange of SWUP • Currentimplementation (OAI/SWUP) at ULB (DSpace) • Proposalvariants / issues
EO usage data service Current ideas for EO: • how many times every item in the IR has been read • which item (and by extrapolation which author, department, …) is the most popular within the institution or within a given research domain • an evolution on the usage of the IR in its whole • search results get ranked on frequency of download of the object files In more advanced environments, mining of the usage data could yield other very interesting value-added services, like: • the creation of a network of (clusters of) related publications: publications that are read by the same person within a certain amount of time can be considered to be similar in some way • recommender systems, in which the end user gets a recommendation on which other publications are of possible interest in relation to a document he wishes to retrieve
Information of interest • an identification of the object file that was downloaded • an identification of the corresponding item • an indication of the date and time at which this item was downloaded • an identifier of an end user who downloaded this item • an indication of what type of usage has been done (abstract view, download request) • identification of the service from where the usage request was made by the end user • identification of the web page from which the request was initiated • application that has sent the request • Example: • http://bib11.ulb.ac.be:8080/dspace/handle/2013/781 • Downloadrequest for sameobject file from EO portal searchresult for phrase "wage dispersion and firm performance"
Need for harmonization • This information isstored on the IR platform in log files of all sorts, with all sorts of formatting (Apache log, DSpace log, …) • Wewant to getthis information in the EO gatewayin a normalizedway: • Decide on exchange format • Decide on way to exchange • First proposal: • SWUP ContextObject • OAI-PMH
OpenURLContextObject • An OpenURLContextObject is defined as a data structure that holds information on the following 6 entities: • Referent: this entity corresponds to the resource which this ContextObject is about • ReferringEntity: an entity that references the Referent • Requester: an entity that describes the resource that requests services pertaining to the Referent • ServiceType: type of service requested • Resolver: a resource that can deliver the requested services • Referrer: a resource that generates the ContextObject
OpenURLContextObject • Each of these entities is described through descriptors, which can be of 4 different types: • identifier: identifier for the entity • metadata-by-val: metadata about the entity; the metadata is included ‘by-value’ in the ContextObject • metadata-by-ref: metadata about the entity; the metadata is available at a network location • private-data: metadata about the entity; the format is not defined within the OpenURL Framework (but rather defined within a specific community) • SWUP = proposal on how to use the OpenURLContextObject concepts to describe usage events
WARNING • This is a draftproposal • There are outstanding issues
Information mapped to SWUP • an identification of the object file that was downloaded; and an identification of the corresponding item • Referent • an indication of the date and time at which this item was downloaded • Contextobject attribute • an identifier of an end user who downloaded this item • Requester • an indication of what type of usage has been done (abstract view, download request) • ServiceType • identification of the service from where the usage request was made by the end user • Referrer • identification of the web page from which the request was initiated • ReferringEntity • application that has sent the request • ?
Example • See guidelines
Exchange of SWUPs • OAI-PMH
Implementation in DSpace • University Of Minho (PT) • Statisticsadd-on module • Automaticallytransforms a DSpace log entry into a specificdatabase entry • [ Massagingwithindatabasepermitting all sorts of usage reports ] • ULB • Minimal adaptation: HTTP Referer and User-Agent added to dbentry • Example of database entries • OAICat software: OAI-PMH DP • Crosswalkwhichtransformsdb entry into SWUP ContextObject • http://bib15.ulb.ac.be:8080/dspace-oai-downloads/request?verb=ListRecords&metadataPrefix=swup • More info: http://www.bibhost.ulb.ac.be/RDIB/DISpace/DIfusion%201.4.2/Statistics/index.html
Proposalvariants / issues • Other information of interest • application that has sent the request (User Agent) • Referrer • the repository to which the request was sent • Resolver • baseUrl of the fileserver • use? • URL of the request • use? • geographical info in requester • unnecessary? Can be determined in EO gateway, based on IP address of requester (if not encrypted) • OAI identifier • use? • institution identifier • Is this not already available on the EO Gateway?
Proposalvariants / issues • “Primary” identifier is the one of the object file • JISC: publication • Irrelevant discussion? The two identifiers need to be there, however encoded • “For the publication a new namespace is introduced: http://identifier.economistsonline.org/. The idea is that acting on the URI of the publication results in a redirection to the metadata as stored in the EO gateway.” • Using original+enriched metadata, instead of original metadata from IR?
Proposalvariants / issues • Could be a big XML payload • minimally needed • Identifier of the request • Datetime of the request • Referent: identifier for item and object file • Requester: (encrypted) IP address • Referrer: identifier for User Agent or originating web service • ReferringEntity: identifier for originating web page • ServiceType: identifier • Resolver: identifier for repository • Alternative format to SWUP: one line containing all information (as a variant of the Combined Log Format)
Proposalvariants / issues • Alternatives for exchange: • HTTP / FTP Get of files containing one-line log entries • OAI exchange of files containing one-line log entries • HTTP / FTP Get of OAI-ListRecords-Reponse formatted files containing SWUP ContextObjects • File nomenclature? • Option 2 requires administration of files (filename - datetime)? • If file exchange, size is less of an issue: we should go for XML formatted information? • Filtering out double clicks • No agreement on double click period (COUNTER, Eprints, LogEC). • What do we do in EO?
Proposalvariants / issues • Filtering out robot requests • We must set up (and maintain) a filtering algorithm to beusedby all partners fordistinguishingrealdownloadsfromdownloadsby machines. • Authoritative list of robots? • List of regularexpressions, rules • Remove all HEAD requests • Some bots canberecognizedbytheirip-address • Discover bots frommining EO database withusage log entries: • bots canbeactiveday and nigth, • bots generatemuch more eventsthanhumanbeings • bots regularvisit the sameURLs • LogECeliminatesuserswhoaccess more than 10% of all items in RePEc withinonemonth
Proposalvariants / issues • Exchange of IP addresses of requesters • Infringement on privacylaws? • How to anonymizerequester information? Level of encryption?