310 likes | 510 Views
WIRED - Web Analytics Week. WIRED System Evaluations due now Web Logs overview Web Analytics Understanding Queries Tracking Users Web Log Reliability Web Log Data Mining & KDD. Web Analytics. Evaluation of Web Information Retrieval (& Web Information Seeking) What can we learn?
E N D
WIRED - Web Analytics Week • WIRED System Evaluations due now • Web Logs overview • Web Analytics • Understanding Queries • Tracking Users • Web Log Reliability • Web Log Data Mining & KDD
Web Analytics • Evaluation of Web Information Retrieval (& Web Information Seeking) • What can we learn? • IR systems use • Web server administration • Who are the users? • Types of users • User situations • How does it affect or help IR?
Web Server Overview • Any application that can serve files using the HTTP protocol • Text, HTML, XHTML, XML… • Graphics • CGI, applets, serlets • other media & MIME types • Apache or MS IIS that serve primarily Web pages • Servers create ASCII text log files showing: • Date, time, bytes transferred, (cache status) • Status/error codes, user IP address, (domain name) • Server method, URI, misc comments
Web Log Overview • Access Log • Logs information such as page served or time served • Referer Log • Logs name of the server and page that links to current served page • Not always • Can be from any Web site • Agent Log • Logs browser type and operating system • Mozilla • Windows
What can we learn from Web logs? • Every time a Web browser requests a file, it gets logged • Where the user came from • What kind of browser used to access the server • Referring URL • Every time a page gets served, it gets logged • Request time, serve time, bytes transferred, URI, status code
Web Log Analysis in Action • UT Web log reports (Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00). Successful requests: 39,826,634 (39,596,364) Average successful requests per day: 5,690,083 (5,656,623) Successful requests for pages: 4,189,081 (4,154,717) Average successful requests for pages per day: 598,499 (593,530) Failed requests: 442,129 (439,467) Redirected requests: 1,101,849 (1,093,606) Distinct files requested: 479,022 (473,341) Corrupt logfile lines: 427 Data transferred: 278.504 Gbytes (276.650 Gbytes) Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)
Problems with Web Servers • Actual user or intent not known • Paths difficult to determine • Infrequent access challenging to uncover • No State Information • Server Hits not Representative • Counters inaccurate • DOS, Floods, Bandwidth can Stop “intended” usage • Robots, etc. • ISP Proxy servers • “5.3 Unsound inferences from data that is logged” Haigh & Megarity, 1998.
Web Server Configuration • Unique file & directory names = “at a glance analysis” • Hierarchical directory structure • Redirect CGI to find referrer • Use a database • store web content • record usage data with context of content logged • Create state information with programming • Servlets, ActiveX, Javascript • Custom server or log format • Log rollover, report frequency, special case testing
Log File Format • Extended Log File Format - W3C Working Draft WD-logfile-960323 192.117.240.3 - - [24/Jul/1998:00:00:04 -0400] "GET /10/3/a3-160-e.html HTTP/1.0" 200 2308 "http://www.amicus.nlc-bnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i=11683503" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" • Every server generates slightly different logs • Versions & operating system issues • Admin tweaks to log formats • Extended Log Format most common • WWW Consortium Standards (= apache)
Let’s Look at some logs • http://www.ischool.utexas.edu/analog-monthly.html • http://www.ischool.utexas.edu/analog-weekly.html
Log Analysis Tools • Analog • Webalizer • Sawmill • WebTrends • AWStats • WWWStat • GetStats • Perl Scripts • Data Mining & Business Intelligence tools
WebTrends • A whole industry of analytics • Most popular commercial application
Measuring Web Site Usage • Now that the Web is a primary source, understanding its use is critical • Little external cues that the Web site is being used • What - pages and their content/subject • How - browsers • Who - userid or IP • When - trends, daily, weekly, yearly • Where - the user is and what page they came from
What you can’t measure? • Who the user is • Always • If the user’s needs have changed • If they’re using the information • Browsing vs. Reading vs. Acting on the information • Changes to site and how they affect each user • Pages not used at all - and why
Analysis of a Very Large Search Log • What kinds of patterns can we find? • Request = query and results page • 280 GB – Six Weeks of Web Queries • Almost 1 Billion Search Requests, 850K valid, 575K queries • 285 Million User Sessions (cookie issues) • Large volume, less trendy • Why are unique queries important? • Web Users: • Use Short Queries in short sessions - 63.7% one request • Mostly Look at the First Ten Results only • Seldom Modify Queries • Traditional IR Isn’t Accurately Describing Web Search • Phrase Searching Could Be Augmented • Silverstein, Henzinger, Marais, Moricz (1998)
Analysis of a Very Large Search Log • 2.35 Average Terms Per Query • 0 = 20.6% (?) • 1 = 25.8% • 2 = 26.0% = 72.4% • Operators Per Query • 0 = 79.6% • Terms Predictable • First Set of Results Viewed Only = 85% • Some (Single Term Phrase) Query Correlation • Augmentation • Taxonomy Input • Robots vs. Humans
Web Analytics and IR? • Knowing access patterns of users • Lists of search terms • Numbers of words • Words, concepts to add (synonyms) • Types of queries • Success of searching a site • Was a result link clicked on? • How many pp/user after a search? • Is a new or better search interface needed?
Real Life Information Retrieval • 51K Queries from Excite (1997) • Search Terms = 2.21 • Number of Terms • 1 = 31% 2 = 31% 3 = 18% (80% Combined) • Logic & Modifiers (by User) • Infrequent • AND, “+”, “-” • Logic & Modifiers (by Query) • 6% of Users • Less Than 10% of Queries • Lots of Mistakes • Uniqueness of Queries • 35% successive • 22% modified • 43% identical
Real Life Information Retrieval • Queries per user 2.8 • Sessions • Flawed Analysis (User ID) • Some Revisits to Query (Result Page Revisits) • Page Views • Accurate, but not by User • Use of Relevance Feedback (more like this) • Not Used Much (~11%) • Terms Used Typical & frequent • Mistakes • Typos • Misspellings • Bad (Advanced) Query Formulation • Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)
KDD for Extracting Knowledge • Knowledge extraction, information discovery, information extraction, data archeology, data pattern processing, OLAP, HV statistical analysis • Sounds as if “knowledge” is there to be found. • User and usage context help find the knowledge • Hypothesis before analysis • Why KDD, why now? • Data storage, analysis costs • Visualization
KDD Process • Database for structured data and queries • How structured, alorithms for queries • How results can be understood and visualized • Iterative & Interactive, hypothesis driven & hypothesis generating
KDD Efforts • Data Cleaning • Formulating the Questions • “Finding useful features to represent the data” p30 • Models: • Classification to fit data into pre-defined classes • Regressions to fit predictions & values • Clustering to class sets found in data • Summarization to briefly describe data • Dependency discovery of variable relationships • Sequence analysis for time or interaction patterns
Data Prep for Mining the WWW • Processing the data before mining • WEBMINER system - site toplogy • Cleaning • User identification • Session identification (episodes) • Path completion
Web Usage Mining • VL Verification • Data Mining to Discover Patterns of Use • Pre-Processing • Pattern Discovery • Pattern Analysis • Site Analysis, Not User Analysis • Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N. - 2000
Web Usage Discovery • Content • Text • Graphics • Features • Structure • Content Organization • Templates and Tags • Usage • Patterns • Page References • Dates and Times • User Profile • Demographics • Customer Information
Web Usage Collection • Types of Data • Web Servers • Proxies • Web Clients • Data Abstractions • Sessions • Episodes • Clickstreams • Page Views • The Tools for Web Use Verification
Web Usage Preprocessing • Usage Preprocessing • Understanding the Web Use Activities of the Site • Extract from Logs • Content Preprocessing • Converting Content Into Formats for Processing • Understanding Content (Working with Dev Team) • Structure Preprocessing • Mining Links and Navigation from Site • Understanding Page Content and Link Structures
Web Usage Pattern Discovery • Clustering for Similarities • Pages • Users • Links • Classification • Mapping Data to Pre-defined Classes • Rule Discovery • Rule Rules • Computation Intensive • Many Paths to the Similar Answers • Pattern Detection • Ordering By Time • Predicting Use With Time
Web Usage Mining as Evaluation? • Mining Goals • Improved Design • Improved Delivery • Improved Content • Personalization (XMod Data) • System Improvement (Tech Data) • Site Modification (IA Data) • Business Intelligence (Market Data) • Usage Characterization (User Behavior Data)
Web Analytics Wrap-up • What can we learn about users? • What can we learn about services? • How can we help users improve their use? • How can IR models benefit from this analysis? • What kind of improvements in Web IR systems and their interfaces can be take from this?