Unlocking New Value from Content Mining JSTOR Usage Data

Unlocking New Value from Content Mining JSTOR Usage Data Ron Snyder Director of Advanced Technology, JSTOR NFAIS 2013 Annual Conference February 25, 2013

Who we are ITHAKA is a not-for-profit organization that helps the academic community use digital technologies to preserve the scholarly record and to advance research and teaching in sustainable ways. We pursue this mission by providing innovative services that aid in the adoption of these technologies and that create lasting impact. JSTOR is a research platform that enables discovery, access, and preservation of scholarly content.

JSTOR Archive Stats • JSTOR archive • Started in 1997 • Journals online: +1,700 • Documents online: 8.4 million • Includes Journals, Books, Primary Source • 50 million pages (2.5 miles of shelf space) • Disciplines covered: +70 • Participating institutions: +8,000 • Countries with participating institutions: 164

JSTOR site activity User Sessions (visits) • 661K per day, 1.3M peak • New visits per hour: • 38K average, 70K peak • Simultaneous users: • 21K average, 44K peak Page Views • 3.4M per day, 7.9M peak Content Accesses • 430K per day, 850K peak Searches • 456K per day, 1.13M peak

Accumulated logging data • 2 billion visits/sessions • 10 billion page views • 1.25 billion searches • 600 million Google & Google Scholar searches • 580 million PDF downloads

Using logging data to better understand our users and their needs Our goal is to go from this… to this Fragmented and incomplete understanding of users Clear(er) understanding of user behaviors and needs

A few things we want to know more about… • How many distinct users? • Where are they coming from? • What content are they accessing? • What content are they unsuccessful in accessing? • How effective are our discovery tools at helping users find content? • How are external discovery tools used? • How do users arrive at a content items? • What are the users content consumption behaviors and preferences? • etc, etc, …

Why? • User expectations and a more competitive environment have raised the bar significantly on the need for analytics • Successful enterprises need to effectively use data • The volume and diversity of our content have increased the difficulty of finding content

Data reporting/analysis capabilities - 2010 • Predefined reports from delivery platform • COUNTER reports, primarily intended for participating institutions and participants • Limited in scope, batch oriented • Ad-hoc querying and reporting • SQL queries from delivery system RDBMS database • Limited capabilities and difficult to produce reports/analyses involving usage data and content/business data • Turnaround time on requests was typically days or weeks (if there was even bandwidth available) • Limited capacity for longitudinal studies and trend analysis • 2 delivery systems with incompatible logging • Legacy system: 1997 – April 2008 • Current system: April 2008 – Present

The problem to be addressed • Our ability to improve services for our users was hampered by weak and inefficient analysis capabilities • Analytics tools and staffing had not kept pace with the volume and complexity of our usage data

What we’re doing about it… • Initiated a data warehouse project at the end of 2010 • Initial focus on ingesting, normalizing, and enriching usage data • Key objective has been to improve access to dimensioned usage data by staff, both technical and non-technical • Designed for flexibility and scalability • Increased analytics staffing • Formed Analytics team • Hired a Data Scientist • What is a Data Scientist? skilled in statistical methods/tools, data modeling, data visualizations, programming, …

Benefits of improved data and analytics • Personalized user-centric features • Content development • Informed decisions on content additions and enrichment • Outreach and participation • Better matching of subscriptions with institution needs • Improved discovery • Better search relevancy, recommendations • Improved platform features • Fewer actions required by users to find the content they need • Improved access for users (both affiliated and unaffiliated) • More avenues for content access • Tools assisting affiliated users in accessing content when off campus

Our data warehouse project • The project consists of 4 main areas of work, largely performed in parallel • Infrastructure building • Building a flexible and scalable infrastructure for storing and processing Big Data • Data ingest, normalization, and enrichment • ETL (extract, transform, load) of data from many data sources • Tool development • Integrating and building tools supporting a wide range of uses and user capabilities • Analysis and reporting • Using the infrastructure, data, and tools while under development

Challenges • Big Data problems • Many billions of usage events with dozens of attributes (some with thousands of unique values) • Conventional approaches using RDBMS technologies did not prove well suited for this scale and complexity • Locating and building feeds from authoritative data sources • Redundant and sometimes contradictory data • Poor/non-existent history in many sources • Domain knowledge • Data validation and integrity • No ground truth in many instances • Budget!

Not only Big… but Rich Data as well • For each session / action we want to know things like: • IP address, user agent • Referrer • Action type • Status – successful or denied access (and reasons for deny) • Geographic location (country, region, city, Lat/Lon) • User identities (institution, society, association, MyJSTOR) • User affiliations • Attributes of accessed content • Journal, article, collections, disciplines, publication date, language, release date, publisher, authors • Time (at different resolutions, year, month, day, hour) • Preceding/succeeding actions linked (for click stream analysis) • Search terms used • and many more…

ITHAKA usage data warehouse Logging data Logging data Content metadata Content metadata Licensing data Content metadata Content metadata eCommerce data Geolocation & DNS data eCommerce data Data Warehouse Content metadata Beta Search eCommerce data Register & Read data Analysis Reporting

Out first attempt… • Our initial approach consisted of: • RDBMS (MySQL) • Star schema to implement a multi-dimensional database • Use of BI tool, such as Pentaho, for OLAP cube generation • Problems encountered included: • Generally poor performance in ETL processing • Table locking issues • Long processing timelines • Relatively poor query performance (still much better than the operational DB though) • Concerns about the long-term scalability and flexibility of the approach

Our technology solution • Open source

Architecture: Why Hadoop? • Rapidly becoming the architecture of choice for Big Data • Modeled after Google’s GFS and BigTable infrastructures • Used by Facebook, Yahoo and many others • Open Source • Large and vibrant developer community • Designed from ground-up for scalability and robustness • Fault-tolerant data storage • High scalability • Large (and growing) ecosystem • HDFS – distributed, fault-tolerant file system • Hive – Supports high-level queries similar to SQL • HBase – Column-based data store (like BigTable), supporting billions of rows with millions of columns

Architecture: Why SOLR? • HDFS and Hive provide significant improvements in query times over an equivalent relational database representation • But still not good enough for an interactive application • SOLR is a highly-scalable search server supporting fast queries of large datasets • Uses Lucene under the covers • Provides faceting and statistical aggregations • Scales linearly • Hadoop + SOLR is a logical fit for managing Big Data • SOLR is proven technology with a vibrant OSS community behind it • We’ve been using it for a number of years for R&D projects such as our Data for Research site (DfR) – http://dfr.jstor.org

Usage data warehouse architecture JSTOR Users … … … www.jstor.org Non/semi-technical Users Public Delivery System Tools optimized for web-oriented interactive use … … Log data Master Logging/Reporting DB Web App Power Users Data Warehouse Daily ETL SOLR Index Ease of use Legacy system logging data Data Nodes 1997-2008 … Content Metadata TODO Hadoop Distributed File System (HDFS) Programmers & Data Analysts Off-line Batch Oriented Tools Daily Updates Hive Business Data

Progress to date • Production Hadoop and SOLR clusters in-place • 25 Hadoop servers • 11 SOLR servers (~3.6 billion index records) • Usage data from April ‘08 through present has been ETL’d and is available in the warehouse • Represents about 55% of JSTOR historical log data • Hive tables containing enriched usage data have been developed • Highly de-normalized for optimal performance • Hive tables can be queried from Web interface • Web tool developed for data exploration, filtering, aggregation, and export • Usable by non-technical users • Intended to provide 80% solution for common organization usage data needs

Analytics environment - 2012 • Query performance improvements: • Platform RDBMS – hours to days (programmer required) • HDFS/MapReduce – minutes to hours (programmer required) • Hive – minutes to hours (power user) • Web tool (backed by SOLR index) – seconds to minutes (casual user)

Examples – Usage Data Explorer (UDE) tool

Examples – UDE Faceting

Explorer tool – Top accessed disciplines

Explorer tool – Top accessed articles

Explorer tool – Geomapping visualization

Examples – Usage trends by location

Examples – Usage trends by referrer

Analysis Samples

Examples – User navigation path analysis

Referrer analysis Where JSTOR ‘sessions’ originated | Jan 2011 – Dec 2011

Site Search Activity, by type 6.3M Sessions 19.8M Searches (from Mar-5 to Apr-16, 2012)

Search Pages Viewed 1.3 Search Results Pages Viewed per Search

Click Thru Rates by Search Position JSTOR – 20M searches from March 5 – Apr 16

Click Thru Rates by Search Position • First 2 pages of search results

Click Thru Rates by Search Position • Positions 10-75 • Notice the relative increase in CTR with the last 2 items on each page

Examples – Click-through position

What’s next? • Integrate Business Intelligence Reporting tool with Hadoop/Hive • Good open source options such as Pentaho, Jasper, BIRT • Commercial options also under consideration • Ingest and normalize the pre-2008 log data • Expand data sharing with institutions and publishers • More “push” reporting and dashboards • More data modeling and deep analysis

Thank you

Unlocking New Value from Content Mining JSTOR Usage Data