1 / 22

Web Usage Mining Classification

Web Usage Mining Classification. Fang Yao MEMS 2002 185029. Humboldt Uni zu Berlin. Contents:. Defination and the Usages Outputs of Classification Methods of Classification Application to EDOC Discussion on Imcomplete Data Discussion questions & Outlook. Humboldt Uni zu Berlin.

tiger
Download Presentation

Web Usage Mining Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Usage Mining Classification • Fang Yao • MEMS 2002 • 185029 Humboldt Uni zu Berlin

  2. Contents: • Defination and the Usages • Outputs of Classification • Methods of Classification • Application to EDOC • Discussion on Imcomplete Data • Discussion questions & Outlook Humboldt Uni zu Berlin

  3. Usages: • Behavior predictions • improve Web design • personal marketing • …… Classification “People with age less than 40 and salary > 40k trade on-line” • •A Major Data Mining Operation • Give one attribute (e.g play), try to predict the value of new people’s behavior by means of some other available attributes. Humboldt Uni zu Berlin

  4. Decision Tree outlook temperature humidity windy play outlook sunny hot high false no overcast sunny sunny hot high ture no rainy overcast hot high false yes …. rainy mild high false yes windy yes rainy cool normal false yes no rainy cool normal true true false …. …. …. …. …. …. humidity high …. no A Small Example WeatherData Source: Witten & Frank, table1.2 Humboldt Uni zu Berlin

  5. outlook overcast sunny rainy …. windy yes true false …. humidity high …. no Outputs of Classification Decision Tree Classification Rules If outlook = sunny and humidity = high then play = no If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes If humidity = normal then play = yes ....... Humboldt Uni zu Berlin

  6. Methods _ divide-and-conquer constructing decision trees Step 1: select a splitting attribute windy outlook humidity temp. true overcast high normal cool sunny false hot rainy mild Yes Yes Yes No No No Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes No No No Yes Yes Yes No No No No Yes Yes Yes Yes Yes Yes No Yes Yes No No Yes Yes Yes yes No No Yes Yes Yes No Gain(outlook): 0,247 bits > > Gain(humidity) 0,152 bits Gain(windy) 0,048 bits > Gain(temperature): 0,029 bits Humboldt Uni zu Berlin

  7. outlook overcast sunny rainy Yes Yes Yes Yes Yes Yes Yes No No Yes Yes No No No Where: info([4,0],[3,2] ,[2,3]) = (4/14)info([4,0]) + (5/14)info([3,2]) + (5/14)info([2,3]) Informational value of creating a branch on the „outlook“ Methods _ divide-and-conquer constructing decision trees Calculation information gain: Gain(outlook) = info([9,5]) – info([4,0],[3,2] ,[2,3])=0,247 bits Humboldt Uni zu Berlin

  8. Formula for information value: • Logarithms are expressed in base 2. • unit is ‚bits‘ • argument p is expressed as fraction that add up to 1. = - - - entropy ( p , p ,..., p ) p log p p log p ... p log p 1 2 n 1 1 2 2 n n Example: Info([2,3])=entropy (2/5,3/5) = -2/5*log2/5 - 3/5*log3/5 =0,97 Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin

  9. Methods _ divide-and-conquer calculating information Humboldt Uni zu Berlin

  10. outlook sunny rainy overcast ? ? yes Methods _ divide-and-conquer constructing decision trees Humboldt Uni zu Berlin

  11. temp. cool hot mild No No Yes No Yes Gain(temperature): 0,571 bits Methods _ divide-and-conquer Step 2: select a daughter attribute___outlook = sunny windy humidity true false high normal Yes No Yes Yes No No No No No Yes Yes > Gain(humidity) 0,971 bits Gain(windy) 0,020 bits > Do this recursively !!! Humboldt Uni zu Berlin

  12. outlook sunny rainy overcast humidity windy yes normal high false true no yes yes no Methods _ divide-and-conquer constructing decision trees Stop rules: • stop when all leaf nodes are pure • stop when no more attribute can be splited Humboldt Uni zu Berlin

  13. The real -world data is more Complicated • Numeric attributes • Missing values • Final solution need more Operations • Pruning • From trees to rules Methods _ C 4.5 WHY C4.5? Humboldt Uni zu Berlin

  14. A A A B C B C Methods _ C 4.5 • Numeric attributes: binary split with numeric thresholds halfway between the values • Missing values: -- Ignoring leads to losing information -- Partial instances • Pruning decision tree: -- subtree replacement -- subtree raising Humboldt Uni zu Berlin

  15. Application in WEKA Humboldt Uni zu Berlin

  16. Objective: Prediction of dissertation reading Attributes: HIST-DISS {1,0}OT-PUB-READ {1,0}OT-CONF {1,0}SH-START {1,0}SH-DOCSERV {1,0}SH-DISS {1,0}OT-BOOKS {1,0}SH-START-E {1,0} HOME {1,0}AU-START {1,0}DSS-LOOKUP {1,0}SH-OTHER {1,0}OTHER {1,0}AUHINWEISE {1,0}DSS-RVK {1,0}AUTBERATUNG {1,0}DSS-ABSTR {1,0} Application in WEKA Data: Clickstream from log of EDOC on 30th March Method: J4.8 Algorithm Humboldt Uni zu Berlin

  17. Application in WEKA DSS-ABSTR Result: Humboldt Uni zu Berlin

  18. Application in WEKA DSS-Lookup Humboldt Uni zu Berlin

  19. Discussion on Incomplete data Idea: Site-centric data v.s. User-centric data Incomplete data are inferior to the one from Complete data. Example: User-centric data: User1: Cheaptickets1, Cheaptickets2, Travelocity1, Travelocity2,Expedia1, Expedia2, Travelocity3, Travelocity4, Expedia3,Cheaptickets3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Site-centric data: User1: Expedia1, Expedia2, Expedia3 User2: Expedia1, Expedia2, Expedia3, Expedia4 Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) Humboldt Uni zu Berlin

  20. Discussion on Incomplete data Results: Lift curve source: Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001) figrue6.6-6.9 Humboldt Uni zu Berlin

  21. Discussion Questions & Outlook • What is the proper target attribute for an analysis of non-profit site? • What data do we prefer to have? • Which improvement could be made to the data? Humboldt Uni zu Berlin

  22. References: • Witten, I.H., & Frank, E.(2000). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations.San Diego, CA: Academic Press. Sections 3.1-3.3; 4.3; 6.1 • Padmanabhan, B., Zheng, Z., and Kimbrough, S. (2001). „Personalization from Incomplete Data: What you don’t know can hurt.“ • http://www.cs.cmu.edu/~awm/tutorials Humboldt Uni zu Berlin

More Related