220 likes | 525 Views
Web Analytics. Xuejiao Liu INF 385F: WIRED Fall 2004. Outline. Introduction What is Web Analytics Why Web Analytics matter Secondary readings Log files analysis Web usage mining Data preparation KDD process Document access in repositories . Log File Lowdown (Michael Calore, 2001 ).
E N D
Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004
Outline • Introduction • What is Web Analytics • Why Web Analytics matter • Secondary readings • Log files analysis • Web usage mining • Data preparation • KDD process • Document access in repositories
Log File Lowdown(Michael Calore, 2001 ) • Log file • What are in log file • Traffic • Audience • Browsers/Platforms • Errors • Referers
Log File Lowdown • Sample Log File adsl-63-183-164.ilm.bellsouth.net - - [09/May/2001:13:42:07 -0700] "GET /about.htm HTTP/1.1" 200 3741 “http://www.e-angelica.com“ "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)" • Log File Analyzers • WebTrends, Sawmill, Analog, Webalizer, HTTP-analyze
WebTrends • log file analyzer • Advantages • Fast and effective • User-friendly interface • Feature-rich • Support different operating systems • Disadvantages • Not free
The KDD Process for Extracting Useful Knowledge from Volumes of Data (Fayyad, U., G. Piatetsky-Shapiro, et al. 1996) • KDD: Knowledge Discovery in Databases • The value of data • Definitions • KDD • Data mining
The KDD Process • The KDD process • 1.Creating a target dataset • 2.Preprocessing and data cleaning • 3.Data reduction and projection • 4.Data mining • Choosing the data mining function • Choosing the data mining algorithm • 5.Interpretation and evaluation
The KDD Process • Data Mining • Data mining involves fitting models to or determining patterns from observed data • Data mining algorithms • The model • The preference criterion • The search algorithm
The KDD Process • Data Mining • Model functions Classification Regression Clustering Dependency modeling Link anlysis • Goals of Data Mining Predictive and descriptive
Data Preparation for Mining World Wide Web Browsing Patterns(Cooley, R. W., B. Mobasher, et al. 1999) • Web Usage Mining vs. data mining • The WEBMINER process • Preprocessing • Mining algorithms • Pattern Analysis
Data Preparation • Preprocessing • Data cleaning • User identification • Session identification • Path completion • Formatting
Tracking the Growth of a Site( Nielsen, Jakob, 1998) • Exponential growth of the web and the internet • Statistical method • Logarithmic convert to get linear regression Statistical analysis • Hypothesis: the site is growing (number of pageviews and date are correlated) • R2 and significance
Tracking the Growth of a Site R2 = 0.96, p<0.001
Tracking the Growth of a Site • Predict growth rate • Clean noise • Confident interval
Predicting Document Access in Large, Multimedia Repositories(by Recker, M. R. and J. E. Pitkow, 1996) • patterns of document requests in network-accessible multimedia databases • Main idea • Two related domains: Human memory and libraries • Borrow models and research results from them
Predicting Document Access • The model – human memory (Anderson and Schooler) • The relationship of recency and performance is a power function • The relationship of frequency and performance is a power function • Tow parameters for performance • Need probability p and Need odds p/(1-p) • The linear function: • Log(Need odds) = a Log(Frequency) + b
Predicting Document Access • Apply Human Memory Analysis in Document Requests Model • Dataset: log file of Georgia Tech WWW repository • A dynamic information ecology • Frequency analysis • Regression equation: • Log(Need Odds) = .99 Log (Frequency) – 1.30 • Recency analysis • Regression equation: • Log(Need Odds) = -1.15 Log(days) + .41 • Combining recency and frequency
Predicting Document Access • Conclusion • Recency and frequency of past document access are strong predictors of future document access • Recency probed to be a stronger predictor than frequency • Applications for the design of information systems • Determine optimal ordering of retrieved items • Inform design decisions • Design of caching algorithms