250 likes | 364 Views
Amanda Spink : Analysis of Web Searching and Retrieval. Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004. Background. Amanda Spink Self-described areas of work: Information Retrieval Web Retrieval Human Information Behavior / Information Seeking
E N D
Amanda Spink : Analysis of Web Searching and Retrieval Larry Reeve INFO861 - Topics in Information Science Dr. McCain - Winter 2004
Background • Amanda Spink • Self-described areas of work: • Information Retrieval • Web Retrieval • Human Information Behavior / Information Seeking • Medical Informatics • Ph.D. 1993 – Rutgers University • Thesis - Feedback in Information Retrieval • Studied under Tefko Saracevic
Background • Amanda Spink • Over 140 papers published • 5th in journal article production, • 18th in citation production among U.S. IS faculty • Institute for Information Science – most highly cited paper in Web Retrieval: • Real Life, Real Users, Real needs: A Study and Analysis of User Queries on the Web (2000)
Background • Amanda Spink • Associate Professor at University of Pittsburgh • School of Information Sciences • Prior faculty positions • Pennsylvania State University • School of Information Science & Technology • Web Research Group • University of North Texas • School of Library and Information Sciences
Background • Tefko Saracevic • Associate Dean • School of Communication, Information and Library Studies, Rutgers University • Related research • Test and Evaluation of IR systems • Relevance in Information Science • Analysis of web queries
Web Searching and Retrieval • Analyze user queries • Important for building future IR systems on Web • Focus on search terms • Failure analysis in query construction • Term Relevance Feedback (TRF) • Topics / Classification • Use of language
Studies Conducted • U.S. – Excite (www.excite.com) • “51K study” • 51,473 queries • 18,113 users • March 9, 1997 • “1M study” • 1,025,910 queries • 211,063 users • September 16, 1997
Studies Conducted • European - AllTheWeb.com • 1 million queries • 200,000 users • Logs from two days: • February 6, 2001 • May 28, 2002 • Most users from Norway and Germany
Studies Conducted • Issues with Web transaction logs • Where does session start and end? • Temporal boundary – Spink found 15 mins avg, • Others found 5mins, 12mins, 32mins, and 2 hours • Numerical boundary – 100 entries • How to eliminate non-individual users • Meta-search engines, other agents • No user insight into user’s process
Findings • Relevance Feedback • Advanced Search Techniques • Term Characteristics • Query Classification • American vs. European
Findings: Relevance Feedback • Term Relevance Feedback (TRF) rarely used • 51K study • 1,597 queries from 823 users (<5% of queries) • Those using TRF had longer sessions • Successful 60% of time • Implications: • Failure rate of 40% may be too high • IR designers could automatically perform TRF
Findings: Relevance Feedback • Mediated searching • 11% of search terms come from TRF • 37% from users, 63% from mediators • 2/3 of TRF contributed positively
Findings: Relevance Feedback • Identified 6 session states • Initial Query, Modified Query, Next Page, • New Query, Relevance Feedback, Prev Query • Identified 4 session patterns • Using the 6 session states • Implication: IR designers should accommodate these states and patterns
Findings: Relevance Feedback Relevance Feedback Session Patterns
Findings: Advanced Search Techniques • Includes: • Boolean operators • Modifiers +, - • Quotes (phrases) • Not often used by Web users, but used more by mediated search • Boolean <10%, Modifiers 9%, 6% phrases • Used incorrectly • Boolean: AND:50%, OR:28%, AND NOT:19% • Modifiers: 75% of time • Phrases: 8% • Users and advanced techniques do not get along!
Findings: Advanced Search Techniques • Boolean, most common problems: • Not capitalizing AND • Confusing ‘AND’ operator with ‘and’ conjunction • e.g. Science and Technology • Science AND Technology • Modifiers, most common problems: • Prefix rather than mathematical postix • +news +weather rather than news+weather • No space required, as is required with Boolean
Findings: Term Characteristics • Terms per query • 1: 26.6%, 2: 31.5%, 3: 18.2%, >7: 1.8% • Mediated searching: 7-15 terms • Distribution of terms not quite Zipf: • Top terms account for 10% of all terms • Single-use terms account for 9% of all terms • Not understood why this occurs
Findings: Query Classification Classification of queries based on Rutgers’ Web Classification
Findings: Query Classification • What users are looking for is not what is on Web: • Distribution of content: • 83% Commercial, 6% Educational, 3% Health • Example: 10% of searches are for Health • Searchers find classifications understandable • IR system presentation design
Findings: American & European Searching • Commonalities: • Three or fewer terms • American: 80%, European 85% • Predominantly use English terms • Relevance judgments: less than 15 minutes viewing retrieved documents • Information seeking sessions short
Findings: American & European Searching • Differences • Categories • American: Entertainment, Sex, Commerce • European: People-places-things, Computers, Commerce • American searchers spent more time searching e-commerce sites than European counterparts • Did not examine: • Use of advanced techniques • Relevance feedback • First in initial set of studies?
Findings: Summary • Number of query terms is about 2 • TRF is not used often • Boolean operators and modifiers not used often – difficulty in using them correctly • Users do not spend much time making relevancy judgments • Term frequency distribution is a few terms used often, many terms used only once
Findings: Summary • Most users had single query only and did not follow up with successive queries • Average viewing of 2 pages • 50% did not access beyond first page; more than 75% did not go beyond 2 pages
Implications / Further Research • Improve use of advanced search techniques • UI changes, Venn Diagrams • Improve use of relevance feedback • Automatic generation of TRF results • Improve classification of results • UI changes, result overview • Improve understanding of language use • Adapt IR designs to language • Examine cultural differences • TRF, advanced search techniques (same or different)
Amanda Spink - Web Searching and Retrieval • Questions