670 likes | 983 Views
Web Mining: An Overview Of Web Analytics with Examples. Donghui Wu, Ph.D. Oracle Corporation April 16 th 2003. Agenda. Web Mining Overview Basic Web Analysis Problems Data Warehouse Solutions Oracle 9iAS Clickstream Intelligence Demo Site Configure Excerpts
E N D
Web Mining: An Overview Of Web Analytics with Examples Donghui Wu, Ph.D. Oracle Corporation April 16th 2003
Agenda • Web Mining Overview • Basic Web Analysis Problems • Data Warehouse Solutions • Oracle 9iAS Clickstream Intelligence Demo • Site Configure Excerpts • Site Basic Statistics Examples • Business Scenario Examples
Web Mining Web Mining, generally speaking, is the activity of applying data mining principles and process to Web domain. It may tackle the World Wide Web as a whole, or focus on a particular (group) of Web sites (servers) In this talk, we will limited the scope to Web usage and pattern analysis, or, more specifically Web Log Mining, at the enterprise (Web sites) level. In industry, it is also referred as Web Analytics.
Web Analytics • Web Analytics is the monitoring and reporting of Web site usage so that enterprises can better understand the complex interactions between Web visitor actions and Web site offers, and leverage that insight to optimize the site for increased customer loyalty and sales. • FromWeb Analytics :Making Business Sense of Online Behavior, Aberdeen Group, June 2002
Web Mining and Privacy • Privacy issue is always a concern for data mining projects. • When analyzing/mining visitor online behaviors, in particular visitor / user profiling, privacy issue is a major concern • Usually only the aggregated info are analyzed, not the individual visitor’s/user’s
Web Log Data Sources (1) • Web Server Log • This is the server log at the Web server, easy to get, and most widely analyzed. • It is logged at the destination. The analysis is about a particular Web server or servers. • One Web server can host many Web sites, and one Web site may served by multiple Web servers. • Proxy Server Log • If the Web connection is through a proxy, every requests are logged at the proxy server as well. • It’s logged the origin. The analysis is about a group users, e.g. all users within a company.
Web Log Data Sources (2) • Client Side Browser Log • Embeded client-side collection. It requires sending simple javascripts with the the response to the Browser, and will collect browser info, and visitor client side activity, e.g. mouse movement, to a collector server for analysis • Application Log • Web application usually has its own logs at various details and for various purposes
Web Server Log Analysis and Mining • From now on, we limited our subject to Web Server Log Analysis and Mining only. • The emphasis is on Enterprise Web Analytics. • We will use a fiction site drugdepo.com as sample analysis, and Oracle 9iAS Clickstream Intelligence to produce the sample analysis.
Web Analytics Tasks Category • Site Activity and Operation Site traffic, performance and status • Usage Mining Visitor Behavior Analysis, Referrer analysis, Path Analysis • User Profiling/Clustering Visitor Profiling, visitor segmentation User profiling, user segmentation
Web Analytics Tasks for Business Users • Content effectiveness evaluation • Online marketing campaign analysis • Target marketing analysis • Personalization and recommendation • Cross-sell and up-sell opportunities • Many more…
Data Mining Techniques in Web Analytics The following data mining techniques may be applied to solve those problems: • Association Rule Mining • Clustering / Segmentation • Visitor / User • Pages • Visitor/User Profiling
Web Mining Difficulties • Data size is huge • For site with 1 million hits per day, the raw log file size can be 500M to 1 G per day depending Web server configure • Bad records • There are many bad records due to Server errors. • Lack exact information • In many cases, heuristics have to be applied
Web Server Log Format • NCSA Common Log Format • NCSA Extended Common Log Format • W3C Extended Common Log Format For more information, see W3C website
NCSA Common Log Format The following is a line in an Apache server log. It is in NCSA Common Log Format, and has the following fields separated by a space. Host Ident Authuser Time Request Status BytesSent Refer Browser 24.69.48.18 - 709697D0CE694757E034080020CB1B7C [01/Nov/2000:23:59:05 -0800] "GET /products/forms/pdf/256629.pdf HTTP/1.0" 206 308928 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98)"
Dynamic Page and Parameters • In the previous example, the requested page is a static page. • For dynamic pages: e.g. ASP, JSP, etc. The request has two parts: The static URL stem and query separated by “?” • The query string is consisted of “paremeter=value” pairs. • Parameters provide detailed info of the request.
Web Log Mining Task Types • Web Log Analyzer • Provide simple statistics, e.g. # of visitor, # of page view, # of sessions, etc. at given time • Web Log Mining • Web Usage Mining and Pattern Analysis • E-commerce, Personalization and CRM • Integrate and mining data across enterprise
Related Terms • Hits • A hit is a URL request in server log • Page Views (Page Impressions) • A page view may require multiple requests. E.g. several .gif or .jpeg requests plus a .html requests • Data Sent • Visitors ( identified and unidentified visitors) • Users (Authenticated Visitors) • Sessions
Data Filtering Data analysis purpose, the following data preparationa are often applied: • Remove .gif or .jpeg and other non-essential requests in raw data • Some other filtering may also be applied based on tasks under attack. • Page construction rules, to consolidate records
Basic Processing • Parsing Log, resolve the following: • Client IP address • Visitor ID • User ID • Browser and OS • Request • Session
Basic Tasks For any Web Analytics, you need to resolve the following before any possible analysis: • Visitor identification • User identification / matching • Session Construction • Path Completion
Visitor Identification Methods • Client Hostname or IP Address only • IP Address + Browser String • Query String Parameter • Cookie Value • Visitor Field
IP Method Limitations • Single IP / Multiple Users • A single proxy server can sever many users. • Multiple IP / Single User • A single user may use multiple machines over time, or even in one session. For example, AOL dynamically assign IP address to every request • Always configure your web server to use cookie or query string if possible
Session Identification • Visitor ID and Timeout Period • Once Visitor ID is constructed, the requests with the same Visitor ID are sequenced according to the timestamp, the time the requests were made. If between two requests the time difference is more than, say 30 minutes, then the sequence is break into two sessions. • Query String Parameter • In the request query string • Cookie Value • Session Field
User Identification • Web Server Authentication • Query String Parameter • Cookie Value • A cookie is a small text file that stores information about a visitor on the user’s PC
Web Analytics Solution Types • Simple Web Log Analyzer • Many free ones, simple parsing and counting • WebTrend Web Log Analyzer • Data Warehouse Solutions • WebTrend E-commerce Server • Oracle 9iAS Clikcstream Intelligence • Hosting Solutions • Digimine • Consulting Solutions • Many companies specialized in customized Web Log and Application Log analysis
Web Log Analyzer • Web Log Analyzer - Report simple site usage measures, e.g. # of hits, # of visitors, page sequence, etc. • Methodology: simple parsing and counting • Small and quick, but only produce simple static reports, usually with big error margin
Data Warehouse Solutions • Load Server Log into Data Warehouse • Integrate with other data, e.g. sales • Support interactive query and OLAP • More accurate analysis and data mining results • Expensive
Simplified DW Scheme:Dimensions • Date • Time • Visitor • User • Browser • Client Host
Date Time of Day Browser Client Host User Visitor Page Server Site Event Referrer Search Simplified DW Scheme:Dimensions
Impression (page view) Browser Client Host Visitor User Page Time to Serve Referrer Status Event Server Session ID Session Fact Session Date Session Time Session Visitor ID Session User ID Session Duration # of Impressions Data Sent First Impression Id Last Impression ID First referrer Simplified DW Scheme:Facts
ETL Process and external data The ETL process can be customized to support business analysis according to: • Web server log format • External customer data • External sales data and marketing data • Other external data sources
Oracle Warehouse Builder Collector Server Oracle 9iAS Clickstream Intelligence Loader Staging Star Schema Partitioning Oracle 9i
Agenda • Configuration • Basic Site Statistics • Business Scenarios
Site Basic Statistics Site: DrugDepo.com Start Date: October 1 End Date: October 10