260 likes | 270 Views
This research project focuses on large-scale web data analysis from 1997-2007, using transaction logs and search engine data. The goal is to gain insights, understand user behavior, and improve interface/system design and user training.
E N D
Web Research - Large-Scale Web Data Analysis Amanda Spink Queensland University of Technology Jim Jansen The Pennsylvania State University
Web Data Analysis 1997-2007 • Track Web search trends and characteristics • Web query transaction logs collected in 1997,1999, 2001, 2003, 2004, 2005 and 2006. • Combined dataset of 20 million+ Web searches
Web Search Studies • Web search engines: - Alta Vista - Ask Jeeves - Excite - AlltheWeb - Vivisimo - Dogpile • Transaction log analysis studies • Focus on user search analysis for competitive advantage
Data Collection Methods • Various combinations of methods and approaches • Transaction log analysis • Videotaping and Audio-taping • Think aloud protocols • Usability – HCI techniques • Focus groups • Interviews • Survey • Experiments • Diaries
Data Analysis Methods • Quantitative and statistical analysis • Qualitative analysis – grounded theory • Combination of both methods
Key Issues – Search Studies What is the goal of the project? • Insights, understanding and develop theory • User modeling • Trends analysis • Interface/systems design • User training
Key Issues – Search Studies • What variables to measure? • How much data is enough? • Methods used – single or multiple? • HCI approach – test interface/system features
Transaction Log Analysis (TLA) • File or log of communications between user and system • File recorded on a server – side recordings • Log or file formats vary but there are fields common to most (e.g., IP address, cookie, time stamp, query, vertical, click thru)
Why Collect and Analyze Log Data? • Gain understanding of user interaction with system and interface • Goal to improve system and interface design, and improve user training. • Transaction log analysis is extensively used in academia and industry
TLA Process • Goals and objectives • Data collection • Log preparation • Data analysis • Making sense
Data Collected • Process of collecting the interaction data for a given period in a transaction log • Collect data on the search episode • User identification • Date • Time • Search session content • Resources accessed (e.g., URL’s)
Logging Software • Custom and commercial applications (the Wrapper - http://ist.psu.edu/faculty_pages/jjansen/academic/wrapper.htm) • WinWhatWhere spy software • Morea 1.1 software • Camtasia Studio
Data Preparation • Process of cleaning and preparing the log data for analysis • Log data into a relational database • Cleaning the log – corrupted data • Parsing the log (e.g., removing Web sessions identified as agents) • Normalizing the log
Log Analysis – Three Levels • Term • Query • Session
Term occurrence Total terms High and low usage terms Term distribution Co-occurring terms Term Level Analysis
Query Level Analysis • Initial query • Subsequent queries • Modified queries and query reformulation • Identical queries • Query complexity • Boolean use • Spelling • Types of queries • Query topics
1. People/Places 49.2% 2. Commerce, etc. 12.5% 3. Computers, etc. 12.4% 4. Health/sciences 7.4% 5. Education/Humanities 5% 6. Entertainment, etc. 4.5% 7. Sex/Pornography 3.2% 8. Society/Culture, etc. 3.1% 9. Government 1.5% 10. Performing/Fine Arts 0.6% 1. Commerce, etc. 21% 2. Indiscernible 19% 3. People/Places, etc. 15% 4. Computers/Internet 13% 5. Social/Culture 9% 6. Health/Sciences 6% 7. Education/Humanities 5% 8. Sex/Pornography 4% 9. Performing/Fine Arts 3% 10. Government 3% 11. Entertainment, etc. 2% Query Subjects – Alta Vista 2002 & Vivisimo 2004
Web Search Session Level Analysis • Search duration • Search patterns • Successive and multitasking sessions • Page or resource viewing
56% less than 1 minute 72% sessions less than 5 minutes 81% sessions less than 15 minutes Mean: approx. 58 minutes and 2 seconds (see Jansen, B. J., Spink, A., and Koshman, S. 2007. Web searcher interactions with the Dogpile.com meta-search engine. Journal of the American Society for Information Science and Technology. 58(5), 744-755.) Web Session Duration (Minutes)
Transaction Log Analysis (TLA) Methods • Quantitative and statistical analysis – requires software and expertise • Qualitative analysis – requires training • Creativity factor • Combination of quantitative and qualitative methods
TLA Strengths • Data from a large user base • Reasonable and non-intrusive • Less time than other methods • Can be relatively inexpensive
TLA Limitations • Transaction logs do not include user demographic and other data • Lacks data on search reasons and motivations • Incomplete data due to corrupted logging
Conclusions • Search analysis is a complex process with many choices • TLA a powerful tool • Requires planning, training and expertise • Can be combined with other data collection and analysis techniques
Further Reading Spink, A., & Jansen, B. J. (2004). Web Search: Public Searching of the Web. Springer. Jansen, B. J. (2006). Search log analysis: What is it; what's been done; how to do it. Library and Information Science Research, 28(3), 407-432 Jansen, B. J., Spink, A., & Taksa, I. (forthcoming). Handbook of Web Log Analysis. Idea Group Publishing.
QUESTIONS? Thank You