1 / 31

WIRED - Web Analytics Week

WIRED - Web Analytics Week. WIRED System Evaluations due now Web Logs overview Web Analytics Understanding Queries Tracking Users Web Log Reliability Web Log Data Mining & KDD. Web Analytics. Evaluation of Web Information Retrieval (& Web Information Seeking) What can we learn?

enye
Download Presentation

WIRED - Web Analytics Week

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WIRED - Web Analytics Week • WIRED System Evaluations due now • Web Logs overview • Web Analytics • Understanding Queries • Tracking Users • Web Log Reliability • Web Log Data Mining & KDD

  2. Web Analytics • Evaluation of Web Information Retrieval (& Web Information Seeking) • What can we learn? • IR systems use • Web server administration • Who are the users? • Types of users • User situations • How does it affect or help IR?

  3. Web Server Overview • Any application that can serve files using the HTTP protocol • Text, HTML, XHTML, XML… • Graphics • CGI, applets, serlets • other media & MIME types • Apache or MS IIS that serve primarily Web pages • Servers create ASCII text log files showing: • Date, time, bytes transferred, (cache status) • Status/error codes, user IP address, (domain name) • Server method, URI, misc comments

  4. Web Log Overview • Access Log • Logs information such as page served or time served • Referer Log • Logs name of the server and page that links to current served page • Not always • Can be from any Web site • Agent Log • Logs browser type and operating system • Mozilla • Windows

  5. What can we learn from Web logs? • Every time a Web browser requests a file, it gets logged • Where the user came from • What kind of browser used to access the server • Referring URL • Every time a page gets served, it gets logged • Request time, serve time, bytes transferred, URI, status code

  6. Web Log Analysis in Action • UT Web log reports (Figures in parentheses refer to the 7 days to 28-Mar-2004 03:00). Successful requests: 39,826,634 (39,596,364) Average successful requests per day: 5,690,083 (5,656,623) Successful requests for pages: 4,189,081 (4,154,717) Average successful requests for pages per day: 598,499 (593,530) Failed requests: 442,129 (439,467) Redirected requests: 1,101,849 (1,093,606) Distinct files requested: 479,022 (473,341) Corrupt logfile lines: 427 Data transferred: 278.504 Gbytes (276.650 Gbytes) Average data transferred per day: 39.790 Gbytes (39.521 Gbytes)

  7. Problems with Web Servers • Actual user or intent not known • Paths difficult to determine • Infrequent access challenging to uncover • No State Information • Server Hits not Representative • Counters inaccurate • DOS, Floods, Bandwidth can Stop “intended” usage • Robots, etc. • ISP Proxy servers • “5.3 Unsound inferences from data that is logged” Haigh & Megarity, 1998.

  8. Web Server Configuration • Unique file & directory names = “at a glance analysis” • Hierarchical directory structure • Redirect CGI to find referrer • Use a database • store web content • record usage data with context of content logged • Create state information with programming • Servlets, ActiveX, Javascript • Custom server or log format • Log rollover, report frequency, special case testing

  9. Log File Format • Extended Log File Format - W3C Working Draft WD-logfile-960323 192.117.240.3 - - [24/Jul/1998:00:00:04 -0400] "GET /10/3/a3-160-e.html HTTP/1.0" 200 2308 "http://www.amicus.nlc-bnc.ca/wbin/resanet/itemdisp/l=0/d=1/r=1/e=0/h=10/i=11683503" "Mozilla/2.0 (compatible; MSIE 3.01; Windows 95)" • Every server generates slightly different logs • Versions & operating system issues • Admin tweaks to log formats • Extended Log Format most common • WWW Consortium Standards (= apache)

  10. Let’s Look at some logs • http://www.ischool.utexas.edu/analog-monthly.html • http://www.ischool.utexas.edu/analog-weekly.html

  11. Log Analysis Tools • Analog • Webalizer • Sawmill • WebTrends • AWStats • WWWStat • GetStats • Perl Scripts • Data Mining & Business Intelligence tools

  12. WebTrends • A whole industry of analytics • Most popular commercial application

  13. Measuring Web Site Usage • Now that the Web is a primary source, understanding its use is critical • Little external cues that the Web site is being used • What - pages and their content/subject • How - browsers • Who - userid or IP • When - trends, daily, weekly, yearly • Where - the user is and what page they came from

  14. What you can’t measure? • Who the user is • Always • If the user’s needs have changed • If they’re using the information • Browsing vs. Reading vs. Acting on the information • Changes to site and how they affect each user • Pages not used at all - and why

  15. Analysis of a Very Large Search Log • What kinds of patterns can we find? • Request = query and results page • 280 GB – Six Weeks of Web Queries • Almost 1 Billion Search Requests, 850K valid, 575K queries • 285 Million User Sessions (cookie issues) • Large volume, less trendy • Why are unique queries important? • Web Users: • Use Short Queries in short sessions - 63.7% one request • Mostly Look at the First Ten Results only • Seldom Modify Queries • Traditional IR Isn’t Accurately Describing Web Search • Phrase Searching Could Be Augmented • Silverstein, Henzinger, Marais, Moricz (1998)

  16. Analysis of a Very Large Search Log • 2.35 Average Terms Per Query • 0 = 20.6% (?) • 1 = 25.8% • 2 = 26.0% = 72.4% • Operators Per Query • 0 = 79.6% • Terms Predictable • First Set of Results Viewed Only = 85% • Some (Single Term Phrase) Query Correlation • Augmentation • Taxonomy Input • Robots vs. Humans

  17. Web Analytics and IR? • Knowing access patterns of users • Lists of search terms • Numbers of words • Words, concepts to add (synonyms) • Types of queries • Success of searching a site • Was a result link clicked on? • How many pp/user after a search? • Is a new or better search interface needed?

  18. Real Life Information Retrieval • 51K Queries from Excite (1997) • Search Terms = 2.21 • Number of Terms • 1 = 31% 2 = 31% 3 = 18% (80% Combined) • Logic & Modifiers (by User) • Infrequent • AND, “+”, “-” • Logic & Modifiers (by Query) • 6% of Users • Less Than 10% of Queries • Lots of Mistakes • Uniqueness of Queries • 35% successive • 22% modified • 43% identical

  19. Real Life Information Retrieval • Queries per user 2.8 • Sessions • Flawed Analysis (User ID) • Some Revisits to Query (Result Page Revisits) • Page Views • Accurate, but not by User • Use of Relevance Feedback (more like this) • Not Used Much (~11%) • Terms Used Typical & frequent • Mistakes • Typos • Misspellings • Bad (Advanced) Query Formulation • Jansen, B. J., Spink, A., Bateman, J., & Saracevic, T. (1998)

  20. KDD for Extracting Knowledge • Knowledge extraction, information discovery, information extraction, data archeology, data pattern processing, OLAP, HV statistical analysis • Sounds as if “knowledge” is there to be found. • User and usage context help find the knowledge • Hypothesis before analysis • Why KDD, why now? • Data storage, analysis costs • Visualization

  21. KDD Process • Database for structured data and queries • How structured, alorithms for queries • How results can be understood and visualized • Iterative & Interactive, hypothesis driven & hypothesis generating

  22. KDD Efforts • Data Cleaning • Formulating the Questions • “Finding useful features to represent the data” p30 • Models: • Classification to fit data into pre-defined classes • Regressions to fit predictions & values • Clustering to class sets found in data • Summarization to briefly describe data • Dependency discovery of variable relationships • Sequence analysis for time or interaction patterns

  23. Data Prep for Mining the WWW • Processing the data before mining • WEBMINER system - site toplogy • Cleaning • User identification • Session identification (episodes) • Path completion

  24. Web Usage Mining • VL Verification • Data Mining to Discover Patterns of Use • Pre-Processing • Pattern Discovery • Pattern Analysis • Site Analysis, Not User Analysis • Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.N. - 2000

  25. Web Usage Discovery • Content • Text • Graphics • Features • Structure • Content Organization • Templates and Tags • Usage • Patterns • Page References • Dates and Times • User Profile • Demographics • Customer Information

  26. Web Usage Collection • Types of Data • Web Servers • Proxies • Web Clients • Data Abstractions • Sessions • Episodes • Clickstreams • Page Views • The Tools for Web Use Verification

  27. Web Usage Preprocessing • Usage Preprocessing • Understanding the Web Use Activities of the Site • Extract from Logs • Content Preprocessing • Converting Content Into Formats for Processing • Understanding Content (Working with Dev Team) • Structure Preprocessing • Mining Links and Navigation from Site • Understanding Page Content and Link Structures

  28. Web Usage Pattern Discovery • Clustering for Similarities • Pages • Users • Links • Classification • Mapping Data to Pre-defined Classes • Rule Discovery • Rule Rules • Computation Intensive • Many Paths to the Similar Answers • Pattern Detection • Ordering By Time • Predicting Use With Time

  29. Web Usage Mining as Evaluation? • Mining Goals • Improved Design • Improved Delivery • Improved Content • Personalization (XMod Data) • System Improvement (Tech Data) • Site Modification (IA Data) • Business Intelligence (Market Data) • Usage Characterization (User Behavior Data)

  30. Web Analytics Wrap-up • What can we learn about users? • What can we learn about services? • How can we help users improve their use? • How can IR models benefit from this analysis? • What kind of improvements in Web IR systems and their interfaces can be take from this?

More Related