670 likes | 1.13k Views
Advanced Topics in Data Mining: Web Mining. Web Mining. Web Mining. Applications are ported to the Web at rapid pace On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web How Amazon does it?
E N D
Web Mining • Applications are ported to the Web at rapid pace • On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web • How Amazon does it? • Understanding Web user behavior is important • It can improve Web page organization • It can increase Web server performance • It can exploit Web advertising • It can increase business opportunity
Amazon Web Page Association Rules
More Information Desired • Collect statistical information (page hits) only, which is insufficient since: • The hit frequency of a page depends not only on its content but also on its location • The number of users accessing a page is not available • Information on what pages accessed together is not available • Data mining in the Web (Web Mining) • Web Access Pattern Collection • Web User Pattern Mining
Web Access Pattern Collection • Server-Based Data Collection • Who are visiting a given Web site and what are they doing • Agent-Based Data Collection • What are the Web sites a particular user has visited?
Server-Based Data Collection • Examine the logs collected by HTTPd • Access Log (IP, Time, Access Data), Referred Log (AB), Error Log, … • We can combining some of them for our use if necessary • Problems • The use of proxy servers • The effect of caching
Access Log IP/Domain Name Time Access Data
Referred Log 不考慮Caching的問題
Server-Based Data Collection • Have to be done in accordance with technology advances • The use of Active Server Pages (Session ID available) • The use of proxy servers • The effect of caching • HTTPd 1.1 • Limitation • Can only capture the user behavior when they are within this site
Agent-Based Data Collection • Understanding individual Web behavior needs client-based data collection • Results are useful • Better Personalized Service • Improved Web Page Organization • Better Pricing Policies • Methods • Applets can only read/write files in their source servers • a big security constraint • Using Active Components (ActiveX Control) and PlugIns • APCS (Access Pattern Collection Server)
Agent-Based Data Collection • Very difficult to do for non-registered users in the current Web environment • We have to be conducted with users’ consent • Very dependent upon available Web technologies
Web User Pattern Mining • Web user pattern mining is to discover user access patterns in Web servers • Pattern discovery and analysis tools • Some existing Web tools provide mechanisms for reporting user activity in the servers • Web Trends (http://www.webtrends.com.tw/) • Open Market (http://www.openmarket.com/) • Net.Genesis (http://www.netgen.com/)
Path Traversal Patterns Mining • Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access • Solution procedure consists of three steps: • Convert the original sequence of log data into a set of maximal forward references (MF) • Filter out the effect of some backward references • Mainly made for ease of traveling and concentrate on mining meaningful user access sequences • Some objects are visited because of their locations rather than their content • Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained • Determine the maximal reference sequences from large reference sequences (Trivial)
Step1: MF References • Suppose the traversal log contains the following traversal path for a user: • A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V When backward references occur, a forward reference path terminate. The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}
Step1:Arrange Database Encoding
Step1:Database Reduction Database Reduction
Step2: Find Frequent Reference Sequences • Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences) • Full-Scan (FS) Algorithm • FS utilizes key ideas of the DHP algorithm • Selective-Scan (SS) Algorithm • SS reduces the number of database scans
Full-Scan (FS) Algorithm Generate L1 & Hash Table Scan DB-1
Generate L1 & Hash Table Scan DB-1 h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17
Generate L2 & Reduce DB Scan DB-2
Generate L2 & Reduce DB Scan DB-2
Generate C3, L3 & Reduce DB Scan DB-3
Generate C4, L4 & Reduce DB Scan DB-4
Selective-Scan (SS) Algorithm Scan DB-3
Step 3: Generate FrequentTraversal Patterns Maximal Reference Sequences
WAP-Mine Algorithm • The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure • Given WebAccess Sequencedatabase WAS and a support threshold , mine the complete set of -patterns of WAS WAS
WAP-Mine Algorithm (1)Scan WAS once,find all frequent-1 events (2)Scan WAS again,construct a WAP-tree (3)Recursively mine the WAP-tree using conditional search Access patterns
Find All Frequent-1 Events Min_Sup=75%
WAP-Tree Construction • Using frequent events to register all count information for further mining
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on c Generate Web Access Patterns: ac, bc
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on ac Generate Web Access Patterns: aac, bac
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on bac Generate Web Access Patterns: abac
Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on abac No Web Access Patterns are Generated
Mining for Web Transactions • To capture Web customer buying behavior • It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules) • It is not just Web user travel patterns (Path Traversal Patterns) • It is an extension from path traversal patterns • Exploring the relationship between traveling and buying
Mining for Web Transactions Web Transaction Algorithm WR (Web-transaction-Record) Web Transaction Records <Path: a Set of Purchases> Algorithm WTM, MTSPJ,MTSPC Frequent Transaction Patterns Web Transaction Association Rules
Mining for Web Transactions • Web-transaction-Record (WR) Algorithm • Extract meaningful Web transaction records from the given Web transaction • WTM (Web Transaction Mining)Algorithm • Mining Web Transaction Patterns • MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM
WTM Algorithm • It joins the purchased itemsets for generating candidate transaction patterns • WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns • WTM hashes not only each item but also each purchase in the path
DATABASE Web Transaction WT_ID Path Purchase 100 ABCE B{i1}, C{i2}, E{i4} ABFGH B{i1}, H{i6} ASJL S{i7}, L{i9} 200 ABCE B{i1}, C{i2}, E{i4} ASJLQ S{i7}, Q{i10} 300 ABCE B{i1}, E{i4} ABFG B{i1}, G{i5} ASJL S{i7}, J{i8}, L{i9} 400 ABD D{i3} ABFG G{i5} ASJLQ S{i7}, J{i8}, Q{i10} WTM Algorithm