Web Usage Mining Classification

Web Usage Mining Classification • Fang Yao • MEMS 2002 • 185029 Humboldt Uni zu Berlin

Contents: • Defination and the Usages • Outputs of Classification • Methods of Classification • Application to EDOC • Discussion on Imcomplete Data • Discussion questions & Outlook Humboldt Uni zu Berlin

Usages: • Behavior predictions • improve Web design • personal marketing • …… Classification “People with age less than 40 and salary > 40k trade on-line” • •A Major Data Mining Operation • Give one attribute (e.g play), try to predict the value of new people’s behavior by means of some other available attributes. Humboldt Uni zu Berlin

Decision Tree outlook temperature humidity windy play outlook sunny hot high false no overcast sunny sunny hot high ture no rainy overcast hot high false yes …. rainy mild high false yes windy yes rainy cool normal false yes no rainy cool normal true true false …. …. …. …. …. …. humidity high …. no A Small Example WeatherData Source: Witten & Frank, table1.2 Humboldt Uni zu Berlin

outlook overcast sunny rainy …. windy yes true false …. humidity high …. no Outputs of Classification Decision Tree Classification Rules If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes ....... Humboldt Uni zu Berlin

Methods _ divide-and-conquer constructing decision trees Step 1: select a splitting attribute windy outlook humidity temp. true overcast high normal cool sunny false hot rainy mild Yes Yes Yes No No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes No No No Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes Yes yes No No Yes Yes Yes No Gain(outlook): 0,247 bits > > Gain(humidity) 0,152 bits Gain(windy) 0,048 bits > Gain(temperature): 0,029 bits Humboldt Uni zu Berlin

outlook overcast sunny rainy Yes Yes Yes Yes Yes Yes Yes No No Yes Yes No No No Where: info([4,0],[3,2] ,[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3]) Informational value of creating a branch on the „outlook“ Methods _ divide-and-conquer constructing decision trees Calculation information gain: Gain(outlook) = info([9,5]) – info([4,0],[3,2] ,[2,3])=0,247 bits Humboldt Uni zu Berlin

Formula for information value: • Logarithms are expressed in base 2. • unit is ‚bits‘ • argument p is expressed as fraction that add up to 1. = - - - entropy ( p , p ,..., p ) p log p p log p ... p log p 1 2 n 1 1 2 2 n n Example: Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97 Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin

Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin

outlook sunny rainy overcast ? ? yes Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin

temp. cool hot mild No No Yes No Yes Gain(temperature): 0,571 bits Methods _ divide-and-conquer Step 2: select a daughter attribute___outlook = sunny windy humidity true false high normal Yes No Yes Yes No No No No No Yes Yes > Gain(humidity) 0,971 bits Gain(windy) 0,020 bits > Do this recursively !!! Humboldt Uni zu Berlin

outlook sunny rainy overcast humidity windy yes normal high false true no yes yes no Methods _ divide-and-conquer constructing decision trees Stop rules: • stop when all leaf nodes are pure • stop when no more attribute can be splited Humboldt Uni zu Berlin

The real -world data is more Complicated • Numeric attributes • Missing values • Final solution need more Operations • Pruning • From trees to rules Methods _ C 4.5 WHY C4.5? Humboldt Uni zu Berlin

A A A B C B C Methods _ C 4.5 • Numeric attributes: binary split with numeric thresholds halfway between the values • Missing values: -- Ignoring leads to losing information -- Partial instances • Pruning decision tree: -- subtree replacement -- subtree raising Humboldt Uni zu Berlin

Application in WEKA Humboldt Uni zu Berlin

Objective: Prediction of dissertation reading Attributes: HIST-DISS {1,0}OT-PUB-READ {1,0}OT-CONF {1,0}SH-START {1,0}SH-DOCSERV {1,0}SH-DISS {1,0}OT-BOOKS {1,0}SH-START-E {1,0} HOME {1,0}AU-START {1,0}DSS-LOOKUP {1,0}SH-OTHER {1,0}OTHER {1,0}AUHINWEISE {1,0}DSS-RVK {1,0}AUTBERATUNG {1,0}DSS-ABSTR {1,0} Application in WEKA Data: Clickstream from log of EDOC on 30th March Method: J4.8 Algorithm Humboldt Uni zu Berlin

Application in WEKA DSS-ABSTR Result: Humboldt Uni zu Berlin

Application in WEKA DSS-Lookup Humboldt Uni zu Berlin

Discussion on Incomplete data Idea: Site-centric data v.s. User-centric data Incomplete data are inferior to the one from Complete data. Example: User-centric data: User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Site-centric data: User1: Expedia1, Expedia2, Expedia3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) Humboldt Uni zu Berlin

Discussion on Incomplete data Results: Lift curve source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9 Humboldt Uni zu Berlin

Discussion Questions & Outlook • What is the proper target attribute for an analysis of non-profit site? • What data do we prefer to have? • Which improvement could be made to the data? Humboldt Uni zu Berlin

References: • Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations.San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1 • Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“ • http://www.cs.cmu.edu/~awm/tutorials Humboldt Uni zu Berlin

Web Usage Mining Classification

Web Usage Mining Classification

Presentation Transcript

Chapter 12: Web Usage Mining - An introduction

Data Mining: Classification

Web Usage Mining

Web Usage Mining

Chapter 7 Web Usage Mining Part I

Data Mining Classification:

Web Usage Mining with Semantic Analysis

Our Topic: Web Usage Mining

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Web Usage Mining

Web Usage Mining: An Overview

Web Usage Mining for Semantic Web Personalization

Web Usage Mining (Part Two)

Data Mining: Classification

CS454 Topics in Advanced Computer Science Web Usage Mining

Web Usage Mining: Processes and Applications

Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data

Improving Web Sites with Web Usage Mining, Web Content Mining, and Semantic Analysis