330 likes | 340 Views
The MaxData Project aims to maximize library investments in digital collections through better data gathering and analysis. This study compares various data collection methods and develops models to help libraries make informed decisions. The project teams conduct surveys, analyze library data reports, and delve into deep log analysis of usage data. Surveys cover demographics, reading incidents, and critical incidents, providing valuable insights on reading behaviors. Local log data is utilized for database usage, aiding in subscription management and service optimization. Vendor-supplied usage reports and other sources supplement log data for comprehensive analysis. Challenges and solutions related to vendor reports, link resolvers, meta-search engines, and proxy servers are explored. OhioLINK's deep log analysis showcases the value of data in uncovering usage patterns and behaviors. Overall, the project aims to empower libraries with data-driven strategies to enhance user experiences and optimize resources.
E N D
Data, Data Everywhere: Making Sense of the Sea of User Data
MaxData Project Carol Tenopir and Donald W. King Gayle Baker, UT Libraries Eleanor Read, UT Libraries Maribeth Manoff, UT Libraries David Nicholas, Ciber, University College London http://web.utk.edu/~tenopir/maxdata/index.htm
MaxData “Maximizing Library Investments in Digital Collections Through Better Data Gathering and Analysis” Funded by Institute of Museum and Library Services (IMLS) 2005-2007
Study Objectives • To compare different methods of data collection • To develop a model that compares costs and benefits to the library of collecting and analyzing data from various methods • To help libraries make the best use of data
Study Teams • Surveys (UT and Ohio Libraries) • Library Data Reports (Vendor-provided and library collected) (UT Libraries) • Deep Log Analysis of raw journal usage data (Ciber and OhioLINK)
Three Types of Questions • Demographic • Recollection • Critical (last) incident of reading
Critical Incident Added to General Survey Questions • Specific (last incident of reading) • Includes all reading--e & print, library & personal • Detailed questions about last article read, e.g., purpose, value, time spent, format, how located, source • Last reading=random sample of readings • Allows detailed analysis
What Surveys Answer that Logs Do Not • Non-library readings • Print as well as electronic readings • Purpose and value of readings • Outcomes of readings
Surveys provide much useful data, but… • Surveys rely on memory and truthfulness • Response rates are falling • Surveys cost your users’ time • Surveys can only be done occasionally • Log reports and raw logs show usage
Local Sources of Use Data • Local log data for databases • Vendor-supplied usage reports • Other sources of data
Local Log Data: Database Use • Environment: Mixture of web-based and locally-loaded resources • Problem: Use data from vendors not available or not uniform • Solution: Log requests for databases from library’s database menu (1999- )
Local Log Data: Process • MySQL and Perl CGI scripts • Log files compiled monthly • Process data with Excel and SAS • Extract, reformat, summarize, graph
Uses of Local Log Data • Subscription management • Number of simultaneous users • Pattern of use of a database over time • Continuation decisions • Cost per request • Services management • Use patterns by day, week or semester • Location of users (campus, off-campus, wireless)
Local Log Data: Issues • Logs requests for access, not sessions • No detail on activity once in database • Undercounts: • Aggregators and full-text collections • Bookmarked access • Metasearch • Other sources of usage data supplement log data
Vendor-Supplied Usage Reports • Little post-processing of vendor data until 2002 • Made available upon request • Special attention to “big ticket” items • Full-text • Integrate subscription info with vendor data
Vendor-Supplied Usage Reports: Additional Processing • ARL Supplemental Statistics • Use data for electronic resources requested: • Number of logins (sessions) • Number of queries (searches) • Number of items requested • Fiscal year: July ‘04 – June ‘05
Vendor Reports to Review • University of Tennessee • Reports from 28 of 45 vendors listed as compliant with Release 1 of the Counter Code of Practice • Reports from 26 other vendors
The Challenge of Vendor-Supplied Use Reports • Request mode • Delivery • Format • Time period • Subscribed / titles used / all titles
Other Sources –Link Resolvers (e.g. SFX) • Past the database level to access of individual journals • Use is measured the same way across packages • Where vendor reports are unavailable or incomplete (Open Access, backfiles) • The more places SFX links are used (catalog, e-j list), the more complete the data
Other Sources –MetaSearch Engines (e.g. MetaLib) • “Number of searches” data that may not be counted in vendor reports (Z39.50) • Most useful and interesting to see how patrons are using federated searching
Other Sources –Proxy Servers (e.g. EZProxy) • Standard web log format captures data for every request to the server – this generates large logs that have to be analyzed • Some libraries send all users (not only remote users) through the proxy server for more complete log data
OhioLINK deep log analysis (DLA) showcase • Choice of OhioLINK – oldest big deal, common publisher platform and source of interesting data • Two purposes: 1) to show what kinds of data that DLA could generate; 2) raise the questions that need to be asked • Raw server logs of off-campus use June to December ’04 (pick-up returnees) and on-campus use for October. Logs uniquely contained search and navigational behaviour, too
Metrics • Four ‘use’ metrics employed – number of items/pages viewed, number of sessions conducted, number of items viewed in a session (site penetration) and amount of time spent online. • An ‘item’ might be: a list of journals – (subject or alphabetic), a list of journal issues, a contents page, an abstract or full-text article. • Search or navigational approach used (search engine, subject list of journals etc) • Users: returnees; by subject of journal and sub-net; name and type of institution.
Is the resource being used? • Items viewed. 1,215,000 items viewed on-campus (1 month) and 1,894,000 items viewed off campus (7 months). • Titles used. • Journals available October 2004 = 5872 • 5,868 jnls used if content lists, abstracts & articles included; 5,193 if only articles included. • 5% of jnls accounted for 38% of usage; 10% for 53%, and 50% for 93%.
Is the resource being used? • Number of journals viewed in a session. • Very pertinent: OhioLINK all about massive choice • Third of sessions saw no views to any items associated with a particular journal • Of two-thirds of sessions recording a journal item view, half viewed item (s) from 1 journal, 30% from 2 to 3 journals, 14% from 4 to 9 journals and 7% from 10+ • 49% of sessions saw a full text article viewed and the average number of articles viewed in a session was just over 2.
Is the resource being used? • Site penetration • 23% viewed 1 item in a session, 40% viewed 2 to 4 items, 21% viewed 5 to 10 items, 9% viewed 11 to 20 and 7% viewed 21+. • Figures quite impressive when compared to other digital libraries. Thus, in the case of EmeraldInsight, 42% of users viewed just one item. Due to the greater level of download freedom offered by OhioLINK?
Is the resource being used? • Returnees (off-campus) • 73% accessed OhioLINK journals once during the seven months (might have also used OhioLINK on campus). 22% came back between 2 to 5 times, 3% between 6 to 15 times and 2% more than 15 times. • Data compromised by floating IP addresses and multi-user machines
What can we learn about the methods used to find articles? • Search engine popularity. • 41% of sessions saw search engine only being used and a further 23% of sessions saw engine used together with either the alphabetic or subject lists. • Users of engines more likely to look at wider range of: • Journals. 66% of those using search engine viewed 2 or more journals, compared to 43% using either alphabetic or subject lists. People using all three methods most likely to view 10 or more different journals; nearly 1 in 5 did so;
What can we learn about the methods used to find articles? • Users of engines more likely to look at wider range of: • Subjects. Those utilising the engine were more likely to have viewed two or more subjects - 54% had done so compared to 41% of those whose sessions saw use of an alpha or subject list. • Older material. Search engine users viewed older material, while those accessing the service via the alphabetical or subject lists were more likely to view very current material.
Issues • This is only pilot data • Caching means not all transactions recorded in logs • Studying usage patterns of a given IP address, not a given user and there are the consequent problems that arise from multi-user machines, proxy servers and floating IP addresses • There are problems with calculating session time • However: 1) use a number of metrics; 2) will be collaborated by survey techniques; 3) we have three years to perfect our techniques!
References • Nicholas D, Huntington P, Russell B, Watkinson A, Hamid R. Jamali, Tenopir, C. The big deal: ten years on. Learned Information 18(4) October, 2005, pp?? • Nicholas D, Huntington P, Hamid R. Jamali, Tenopir, C. Journal of Documentation. 62, (2), 2006, pp?? • Nicholas D, Huntington P, Hamid R. Jamali, Tenopir, CFinding information in (very large) digital libraries: a deep log approach to determining differences in use according to method of access. Journal of Academic Librarianship. March 2006, pp??