Bandits and Browsing: Data Mining and Network Analysis for Library Collections

Harriett Green, English and Digital Humanities Librarian, University Library Kirk Hess, Digital Humanities Specialist, University Library Richard Hislop, Ph.D. candidate, Department of Economics, UIUC ERRT, April 25, 2012 Bandits and Browsing: Data Mining and Network Analysis for Library Collections

The Problem • How can users effectively find materials in today’s library collections and digital libraries? • Transformation in the acquisitions, access, and storage of library collections with digital materials, off-site storage, etc. • Availability of immense amounts of data • IR literature: user searching patterns

Project • GOAL: To develop information retrieval and analytical tools that could be incorporated into a possible recommender system • Metadata analysis to help users navigate and retrieve items from the collection • Code libraries will allow interdisciplinary study and research about the library itself. • Network analysis can reveal essential information about the collection's structure.

Project Structure • TEAM: Harriett Green (PI), Kirk Hess, Richard Hislop • SUPPORT: I-CHASS Scalable Research Challenge—Michael Simeone, co-PI • TOOLS: Awarded Start-Up Allocation of 30,000 SUs from XSEDE on the SGI Altix UV Blacklight cluster at Pittsburgh Supercomputing Center with XSEDE consultation support

Questions • What other collection items are like X item? How do we show people these related items? • What is the topic area that people want? How do we show people an estimated result of what they want? • How do we create visualizations and recommendations of items in the collection?

The Beginning: Sample Data Set • Initially ran analyses on 40,000 item English collection • Quantify inefficiencies in subject headings • Developed prototypes of analyses to run on the full UIUC Library catalog data

XSEDE Analysis • Run analyses on entire UIUC Library catalog data • Conduct network analyses on entire UIUC Library catalog data for subject correlations • Extend betweennesscalculation to use weighting based on items checked out together • Find clusters that need to be connected via extra subject headings

Analysis of subject headings • Simple subject analysis can uncover lesser known correlations

Metadata analysis • Help users and library staff identify and connect search terms to subject headings and metadata in the catalog • Our initial approach: Use correlation of subject headings in bibliographic records. • Quantifying Efficiency – ECS and ACS. • Result in a recommender system: analysis that will provide lists of related topics.

Approach: Finding the right questions • Niche topics are important • Some headings are bridges between subjects • Metadata as a network analysis problem

Analyzing Circulation Data • Collection use provides information about how to further improve the catalog • Can identify not only the most-important known links, but find connections that need to be added • Database is represented as a network, with traffic between items that are checked out together

Analyzing user transactions

Other collection analyses • Collection development can be analyzed across time in acquisition of authors and titles • Changes in library policy • Effect of converting collection from Dewey to LOC? • Effect of book location on check out frequency? (General stacks vs. departmental library vs. high-density storage)

Approaches to Collection Analysis

Challenges for library Recommender System • Google/Amazon/Netflix vs. Voyager and VuFind different approaches to users • Keyword searching: word frequency, Solr sorting by proximity and frequency • Recommender systems : build user profiles, clustering of users and of documents • Easy Search: tracking by simple click-throughs

Future Steps • Analyze other data sets from other libraries’ catalogs • Create a suite of tools that libraries can use to calculate and improve the economic efficiency • Code libraries that can be shared and used across library systems: Reduce the need to re-solve problems (UTF-8); Code uses CSV files for easy integration • Visualize network diagrams of the data for assessments of collections

QUESTIONS? Thank you! Harriett Green, green19@illinois.edu Kirk Hess, kirkhess@illinois.edu Richard Hislop, rdhislop@gmail.com

Bandits and Browsing: Data Mining and Network Analysis for Library Collections