530 likes | 786 Views
Our Topic: Web Usage Mining. Presented by : Wenzhen Xing & Kun Gao With Guide of: Dr. Bettina Berendt For seminar: Web Mining. This is a dynamic and fast changing world!. Introduction and Background.
E N D
Our Topic: Web Usage Mining Presented by: Wenzhen Xing & Kun Gao With Guide of:Dr. Bettina Berendt For seminar:Web Mining
Introduction and Background • More and more organizations rely on the Internet and the World Wide Web to conduct business. • Generate and collect large volumes of data in daily operations.
These data are generally gathered automatically by web servers and collected in server or access log. • Mining and analyzing these log can provide valuable information, e.g. targeting ads to specific groups of users.
Web mining is the application of data mining techniques to large web data repositories.
The Goals of Web Mining also include the improvement of web design and structure, and generation of dynamic recommendations. (Session 1)
Overview • Web Mining • Web Usage Mining • Data Source Three phases of Web Usage Mining • Preprocessing • Pattern Discovery • Pattern Analysis • Application of related softwares • Conclusion
Taxonomy of Web Mining Web Mining Web Usage Mining Web Content Mining • Data Integration • Transaction Identification • Pattern Discovery Tools • Pattern Analysis Tools Database Approach Agent-based Approach • Intelligent Search Agents • Info. Filtering/Categorization • Personalized Web Agents • Multilevel Databases • Web Query Systems
Knowledge Discovery in Databases interpretation data mining KNOWLEDGE transformation preprocessing selection Patterns Transformed Data Preprocessed Data DATA Target Data
Classification of web data: • Content data: any complete or synthetic representation of the resource (the realdata) such as HTML documents, images, sound files, etc; • Structure data: data describing the structure and the organization of the contentthrough internal tags (intra-page) or hyper-links (inter-page); • User profile data: demographic information derived from registration. • Usage data: Data that describes the pattern of usage of Web pages, such as IP addresses, page references, and the date and time of accesses.
Data Sources: • server level collection: the server stores data regarding requests performed bythe client, thus data regard generally just one source; • client level collection: it is the client itself which sends to a repositoryinformation regarding the user's behaviour (can be implemented by using a remote agent (such as Javascripts or Java applets) or by modifying the source code of an existing browser (such as Mosaic or Mozilla) to enhance its data collection capabilities. ); • proxy level collection: information is stored at the proxy side, thus Web dataregards several Websites, but only users whose Web clients pass through theproxy.
Web Server Log (Session 2)
Web Server Access Logs • Typical Data in a Server Access Log looney.cs.umn.edu han - [09/Aug/1996:09:53:52 -0500] "GET mobasher/courses/cs5106/cs5106l1.html HTTP/1.0" 200 mega.cs.umn.edu njain - [09/Aug/1996:09:53:52 -0500] "GET / HTTP/1.0" 200 3291 mega.cs.umn.edu njain - [09/Aug/1996:09:53:53 -0500] "GET /images/backgnds/paper.gif HTTP/1.0" 200 3014 mega.cs.umn.edu njain - [09/Aug/1996:09:54:12 -0500] "GET /cgi-bin/Count.cgi?df=CS home.dat\&dd=C\&ft=1 HTTP mega.cs.umn.edu njain - [09/Aug/1996:09:54:18 -0500] "GET advisor HTTP/1.0" 302 mega.cs.umn.edu njain - [09/Aug/1996:09:54:19 -0500] "GET advisor/ HTTP/1.0" 200 487 looney.cs.umn.edu han - [09/Aug/1996:09:54:28 -0500] "GET mobasher/courses/cs5106/cs5106l2.html HTTP/1.0" 200 . . . . . . . . . • Access Log Format • IP addressuseridtimemethodurlprotocolstatussize • mega.cs.umn.edu njain 09/Aug/1996:09:54:31 advisor/csci-faq.html • Other Server Logs: referrer logs, agent logs
·client IP address or hostname; ·user id ('-' if anonymous); ·access time; ·HTTP request method (GET, POST, HEAD, ...); ·path of the resource on the Web server (identifying the URL); ·the protocol used for the transmission (HTTP/1.0, HTTP/1.1); ·the status code returned by the server as response (200 for OK, 404 for notfound, ...); ·the number of bytes transmitted.
Three Phases • Preprocessing • Pattern discovery • Pattern analysis
Preprocessing • Convert raw usage data into the data abstractions. • Most difficult task in Web usage mining due to the incompleteness of the available data.
Usage Preprocessing • A single proxy server may have several users accessing a Web site, potentially over the same time period. Single IP Address/ MultipleUsers --“AOL Effect” – ISP Proxy Servers – Public Access Machines
Usage Preprocessing(cont.) • Multiple IP address/Single User - A user that accesses the Web from different machines will have a different IP address from session to session. This makes tracking repeat visits from the same user difficult.
Usage Preprocessing(cont.) • Multiple IP address/Single Server Session - Some ISPs or privacy tools randomly assign each request from a user to one of several IP addresses. In this case, a single server session can have multiple IP addresses. • Multiple Agent/Singe User - Again, a user that uses more than one browser, even on the same machine,will appear as multiple users.
Usage Preprocessing(cont.) • Solutions to the prolem: • Cookies - small piece of code that is saved on the • client machine • – Advantages: Track same user across multiple sessions • – Disadvantages: Can be declined or deleted. Privacy concerns. • User Login – Require user to use login ID withpassword • – Advantages: Unique ID tied to an individual, not a • machine or browser • – Disadvantages: Not all users willing to register.
Usage Preprocessing(cont.) • Soultions to the problem(cont.): • Embedded SessionID. • – Advantages: Can’t be turned off. • – Disadvantages: Can’t track repeat visits. • Losethe “first” file access of each session. • Client-side tracking ( Modified Browser) • – Advantages: Clean, accurate source of • usage data. • – Disadvantages: Privacy concerns. Can only • track a small percentage of the userpopulation.
Usage Preprocessing(cont.) Other methods: • use session time-outs; • path completion to infer cached references: EX: expanding a session A ==> B ==> C by an access pair (B ==> D) results in: A ==> B ==> C ==> B ==> D
Content Preprocessing • converting the text, image, scripts, and other files such as multimedia into forms that are useful for the Web Usage Mining process. • this consists of performing content mining such as classification or clustering. (also found in pattern discovery)
Pattern Discovery • Pattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining, machine learning and pattern recognition.
Pattern Discovery • Statistics • Association Rules • Clustering • Classification • Sequential Patterns • Path Analysis etc...
Pattern Discovery(cont.) • Statistics • Most common method. • This kind of analysis is performed by many tools, its aim is to give a description of the traffic on a Web site, likemost visited pages, average daily hits, etc.; • Useful for improving the system performance, enhancing the security of the system, facilitating the site modification task, etc. (Session 3 and 5)
Pattern Discovery (cont.) • Association rules • Its main idea is to consider every URL requested by a user in avisit as basket data (item) and to discover relationships with a minimum supportlevel between them; • Discover the correlations among references to various pages of a web site in a single server session. • Useful for restructuring web site, serving as a heuristic for pre-fetching docs to reduce latency. ;
Association Rules (cont.) • discovers affinities among sets of items across transactions • X =====> Y • where X, Y are sets of items, confidence,support • Examples: • 60% of clients who accessed /products/, also accessed /products/software/webminer.htm. • 30% of clients who accessed /special-offer.html, placed an online order in /products/software/.
Pattern Discovery (cont.) • Clustering • meaningful clusters of URLs can be created by discovering similarcharacteristics between them according to users behaviors. • Usage clusters • Useful to perform market segmentation in E-commerce or provide personalized Web content to the users. • Pages clusters • Useful for Internet search engines and web assistance providers.
Pattern Discovery (cont.) • Classification: • Develop a profile of users belonging to a particular class or category. • Require extraction and selection of features that best describe the properties of a given class or category.
Pattern Discovery (cont.) • Clustering and Classification: • clients who often access /products/software/webminer.html tend to be from educational institutions. • clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States. • 75% of clients who download software from /products/software/demos/ visit between 7:00 and 11:00 pm on weekends.
Pattern Discovery (cont.) • Sequential Patterns • Find inter-session patterns such that the presence of a set of items is followed by another item in a time-ordered set of sessions or episodes. • Useful to predict the future behavior of the clients. • the attempt of this technique is to discover time orderedsequences of URLs followed by past users, in order to predict future ones (this ismuch used for Web advertisement purposes);
Sequential Patterns: • 30% of clients who visited /products/software/, had done a search in Yahoo using the keyword “software” before their visit • 60% of clients who placed an online order for WEBMINER, placed another online order for software within 15 days
Pattern Discovery (cont.) Path Analysis: • Types of Path/Usage Information • Most Frequent paths traversed by users • Entry and Exit Points • Distribution of user session durations / User Attrition • Examples: • 60% of clients who accessed /home/products/file1.html, followed the path /home==> /home/whatsnew==> /home/products==> /home/products/file1.html • (Olympics Web site) 30% of clients who accessed sport specific pages started from the Sneakpeek page. • 65% of clients left the site after 4 or less references.
Data and Transaction Model for Association Rules • Let L be a set of server access log entries. A log entry l Є L has the following components: . The IP address of client, denoted l.ip . The user id for the client, denoted l.uid . The URL of the page accessed by the client, denoted by l.url . The time of access l.time
Data and Transaction Model for Association Rules • Definition 1 An association transaction t is a triple:
Example: Session Inference with Referrer Log Agent Time IP URL Referrer 1 www.aol.com 08:30:00 A # Mozillar/2.0; AIX 4.1.4 2 www.aol.com 08:30:01 B E Mozillar/2.0; AIX 4.1.4 3 www.aol.com 08:30:02 C B Mozillar/2.0; AIX 4.1.4 4 www.aol.com 08:30:01 B # Mozillar/2.0; Win 95 5 www.aol.com 08:30:03 C B Mozillar/2.0; Win 95 6 www.aol.com 08:30:04 F # Mozillar/2.0; Win 95 7 www.aol.com 08:30:04 B A Mozillar/2.0; AIX 4.1.4 8 www.aol.com 08:30:05 G B Mozillar/2.0; AIX 4.1.4 Identified Sessions: S1: # ==> A ==> B ==> G from references 1, 7, 8 S2: E ==> B ==> C from references 2, 3 S3: # ==> B ==> C from references 4, 5 S4: # ==> F from reference 6
Applications for Web-Based Organizations • Electronic Commerce • determine lifetime value of clients • design cross marketing strategies across products • evaluate promotional campaigns • target electronic ads and coupons at user groups based on their access patterns • predict user behavior based on previously learned rules and users’ profile • present dynamic information to users based on their interests and profiles