1 / 66

Advanced Topics in Data Mining: Web Mining

Advanced Topics in Data Mining: Web Mining. Web Mining. Web Mining. Applications are ported to the Web at rapid pace On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web How Amazon does it?

Download Presentation

Advanced Topics in Data Mining: Web Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Topics in Data Mining:Web Mining

  2. Web Mining

  3. Web Mining • Applications are ported to the Web at rapid pace • On-line services, such as America Online (AOL), and CompuServe (merged to AOL), are anxious to know user access patterns; not just “search” in the Web • How Amazon does it? • Understanding Web user behavior is important • It can improve Web page organization • It can increase Web server performance • It can exploit Web advertising • It can increase business opportunity

  4. Amazon Web Page Association Rules

  5. More Information Desired • Collect statistical information (page hits) only, which is insufficient since: • The hit frequency of a page depends not only on its content but also on its location • The number of users accessing a page is not available • Information on what pages accessed together is not available • Data mining in the Web (Web Mining) • Web Access Pattern Collection • Web User Pattern Mining

  6. Web Access Pattern Collection • Server-Based Data Collection • Who are visiting a given Web site and what are they doing • Agent-Based Data Collection • What are the Web sites a particular user has visited?

  7. Server-Based Data Collection • Examine the logs collected by HTTPd • Access Log (IP, Time, Access Data), Referred Log (AB), Error Log, … • We can combining some of them for our use if necessary • Problems • The use of proxy servers • The effect of caching

  8. Server-Based Data Collection

  9. Access Log IP/Domain Name Time Access Data

  10. Referred Log 不考慮Caching的問題

  11. Server-Based Data Collection • Have to be done in accordance with technology advances • The use of Active Server Pages (Session ID available) • The use of proxy servers • The effect of caching • HTTPd 1.1 • Limitation • Can only capture the user behavior when they are within this site

  12. Agent-Based Data Collection • Understanding individual Web behavior needs client-based data collection • Results are useful • Better Personalized Service • Improved Web Page Organization • Better Pricing Policies • Methods • Applets can only read/write files in their source servers • a big security constraint • Using Active Components (ActiveX Control) and PlugIns • APCS (Access Pattern Collection Server)

  13. APCS

  14. APCS

  15. APCS

  16. APCS

  17. APCS

  18. Agent-Based Data Collection • Very difficult to do for non-registered users in the current Web environment • We have to be conducted with users’ consent • Very dependent upon available Web technologies

  19. Web User Pattern Mining • Web user pattern mining is to discover user access patterns in Web servers • Pattern discovery and analysis tools • Some existing Web tools provide mechanisms for reporting user activity in the servers • Web Trends (http://www.webtrends.com.tw/) • Open Market (http://www.openmarket.com/) • Net.Genesis (http://www.netgen.com/)

  20. Path Traversal Patterns Mining • Mining path traversal patterns in a distributed information providing environment (WWW) where documents or objects are linked together (via hyperlinks) to facilitate interactive access • Solution procedure consists of three steps: • Convert the original sequence of log data into a set of maximal forward references (MF) • Filter out the effect of some backward references • Mainly made for ease of traveling and concentrate on mining meaningful user access sequences • Some objects are visited because of their locations rather than their content • Determine the frequent traversal patterns, i.e., large reference sequences, from the maximal forward references obtained • Determine the maximal reference sequences from large reference sequences (Trivial)

  21. Step1: MF References • Suppose the traversal log contains the following traversal path for a user: • A, B, C, D, C, B, E, G, H, G, W, A, O, U, O, V When backward references occur, a forward reference path terminate. The set of maximal forward references is {ABCD, ABEGH, ABEGW, AOU, AOV}

  22. Step1: Another Example

  23. Step1:Arrange Database Encoding

  24. Step1:Database Reduction Database Reduction

  25. Step2: Find Frequent Reference Sequences • Two algorithms for finding Frequent Traversal Patterns (Frequent Reference Sequences, Frequent Consecutive Subsequences) • Full-Scan (FS) Algorithm • FS utilizes key ideas of the DHP algorithm • Selective-Scan (SS) Algorithm • SS reduces the number of database scans

  26. Full-Scan (FS) Algorithm Generate L1 & Hash Table Scan DB-1

  27. Generate L1 & Hash Table Scan DB-1 h(x,y) = [ ( order of x ) * 23 + ( order of y ) ] mod 17

  28. Generate C2

  29. Generate L2 & Reduce DB Scan DB-2

  30. Generate L2 & Reduce DB Scan DB-2

  31. Generate C3, L3 & Reduce DB Scan DB-3

  32. Generate C4, L4 & Reduce DB Scan DB-4

  33. Selective-Scan (SS) Algorithm Scan DB-3

  34. Step 3: Generate FrequentTraversal Patterns Maximal Reference Sequences

  35. WAP-Mine Algorithm • The key consideration is how to facilitate the tedious support counting and candidate generating operations in the mining procedure • Given WebAccess Sequencedatabase WAS and a support threshold , mine the complete set of -patterns of WAS WAS

  36. WAP-Mine Algorithm (1)Scan WAS once,find all frequent-1 events (2)Scan WAS again,construct a WAP-tree (3)Recursively mine the WAP-tree using conditional search Access patterns

  37. Find All Frequent-1 Events Min_Sup=75%

  38. WAP-Tree Construction • Using frequent events to register all count information for further mining

  39. Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on c Generate Web Access Patterns: ac, bc

  40. Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on ac Generate Web Access Patterns: aac, bac

  41. Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on bac Generate Web Access Patterns: abac

  42. Mining Web Access Patterns from WAP-Tree Conditional Sequence Based on abac No Web Access Patterns are Generated

  43. Mining for Web Transactions • To capture Web customer buying behavior • It is not just market basket transaction for the set of items bought by a customer in a single purchase (Association Rules) • It is not just Web user travel patterns (Path Traversal Patterns) • It is an extension from path traversal patterns • Exploring the relationship between traveling and buying

  44. Mining for Web Transactions Web Transaction Algorithm WR (Web-transaction-Record) Web Transaction Records <Path: a Set of Purchases> Algorithm WTM, MTSPJ,MTSPC Frequent Transaction Patterns Web Transaction Association Rules

  45. Mining for Web Transactions • Web-transaction-Record (WR) Algorithm • Extract meaningful Web transaction records from the given Web transaction • WTM (Web Transaction Mining)Algorithm • Mining Web Transaction Patterns • MTS (Maximal Transaction Segment) Algorithms are the improvement versions of WTM

  46. Mining for Web Transactions

  47. Mining for Web Transactions

  48. WTM Algorithm • It joins the purchased itemsets for generating candidate transaction patterns • WTM employs a two-level hash tree, called Web transaction tree, to store candidate transaction patterns • WTM hashes not only each item but also each purchase in the path

  49. DATABASE Web Transaction WT_ID Path Purchase 100 ABCE B{i1}, C{i2}, E{i4} ABFGH B{i1}, H{i6} ASJL S{i7}, L{i9} 200 ABCE B{i1}, C{i2}, E{i4} ASJLQ S{i7}, Q{i10} 300 ABCE B{i1}, E{i4} ABFG B{i1}, G{i5} ASJL S{i7}, J{i8}, L{i9} 400 ABD D{i3} ABFG G{i5} ASJLQ S{i7}, J{i8}, Q{i10} WTM Algorithm

  50. Support Count

More Related