1 / 17

Mining Click-stream Data With Statistical and Rule-based Methods

Mining Click-stream Data With Statistical and Rule-based Methods. Martin L absk ý, Vladimír Laš , Petr Berka University of Economics, Prague. The C lickstream D ata. 3 617 171 page requests containing : unix time; IP address; session ID; page request; referer

Download Presentation

Mining Click-stream Data With Statistical and Rule-based Methods

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Mining Click-stream Data With Statistical and Rule-based Methods Martin Labský, Vladimír Laš, Petr Berka Universityof Economics, Prague

  2. The Clickstream Data • 3 617 171 page requests containing: unix time; IP address; session ID; page request; referer • 522 410 sessions, out of them only 203 887 with length > 1 • 100 000 training set • 60 000 testing set • 43 887 heldout set Discovery Challenge 2005

  3. Clickstream Data Preprocessing unix time ;IP address ; session ID ; page request; referee 1074589200;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=124 ;www.google.cz; 1074589201;194.213.35.234;3995b2c0599f1782e2b40582823b1c94;/dp/?id=182 ; 1074589202;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/ ;www.seznam.cz; 1074589233;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/dp/?id=148 ;/dp/?id=124; 1074589245;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/ ;/dp/?id=148; 1074589248;194.138.39.56 ;2fd3213f2edaf82b27562d28a2a747aa;/contacts/ ; /; 1074589290;193.179.144.2 ;1993441e8a0a4d7a4407ed9554b64ed1;/sb/ ;/sb/; • Sequences of page visits in each session (same sessionID) were constructed from the www log data • Sequences of page types [start, dp, dp, sb, sb, end] • Sequences of products [start, 124, 148] Discovery Challenge 2005

  4. Predicting New Page in a Sequence • Problem • Observing a sequence of pages A1A2…An-1 what will be the next page An? • Methods • Markov n-gram models • Decision rules Discovery Challenge 2005

  5. Markov N-gram Predictor (1/5) • Probability of a sequence A1A2….An each term (interpolated k-gram distribution) computed as Discovery Challenge 2005

  6. Markov N-gram Predictor (2/5) where n(xy) is the occurrence of sequence xy in data and Discovery Challenge 2005

  7. Markov N-gram Predictor (3/5) • Building model using EM algorithm 1. compute Pi(i=1,…,k) from counts of sequences observed in the training set DTR 2. assign non-zero initial values to weights i 3. repeat 3.1 compute the probability of the holdout set using the interpolated distribution 3.2 modify the weights Discovery Challenge 2005

  8. Markov N-gram Predictor (4/5) • Results for page types Discovery Challenge 2005

  9. Markov N-gram Predictor (5/5) • Results for product types Discovery Challenge 2005

  10. Rule Induction Algorithms (1/5) • “Classical “ Decision rules in the form Ant => Class (p) where Ant is a conjunction of values of input attributes, p = n(Ant Class)/n(Ant) • Decision rules for clickstreams are in the form Ant => page (p) where Ant is a sequence of pages, p = n(Ant//page)/n(Ant) Discovery Challenge 2005

  11. Rule induction algorithms (2/5) • Compositional algorithm • Set-covering algorithm 1. find a rule that covers some positive examples and no negative example of a given class (concept) 2. remove covered examples from the training set DTR 3. if DTR contains some positive examples (not covered so far) goto 1, else end 1. add empty rule to the rule set KB 2. repeat 2.1 find by rule specialization a rule Ant => Class that fulfils the user given criteria on lengths and validity 2.2 if this rule significantly improves the set of rules KB build so far (we test using the chi2 test the difference between the rule validity and the result of classification of an example covered by Ant) then add the rule to KB Discovery Challenge 2005

  12. Rule induction algorithms (3/5) • Compositional algorithm • Set-covering algorithm rule specialization extends the antecedent sequence by any sequence member from left the decision if the antecedent sequence should be specialized is made by a chi2 test adding a rule CDX to the rule DX changes the rule DX into a rule meaning (D but not CD)X rule specialization extends the antecedent sequence by any sequence member from left the decision if new rule should be added is made by a chi2 test Discovery Challenge 2005

  13. Rule induction algorithms (4/5) Rule-based classification • Set-covering algorithm • apply single rule • Compositional algorithm • combine conritbutions of all applicable rules using pseudo-bayesian formula Discovery Challenge 2005

  14. Rule induction algorithms (5/5) Examples of rules • For the page types sequences • dp, sb -> sb (Ant: 5174; AntSuc: 4801; P: 93%) • ct -> end (Ant: 5502; AntSuc: 1759; P: 32%) • faq -> help (Ant: 594; AntSuc: 127; P: 21%) • For the products sequences • loud-speakers -> video (Ant: 14840, AntSuc: 3785, P: 26%) • data cables -> telephones (Ant: 2560, AntSuc: 565, P: 22%) • PC peripheries -> telephones (Ant: 8671, AntSuc: 1823, P: 21%) Discovery Challenge 2005

  15. Results of testing Discovery Challenge 2005

  16. Conclusions • Comparison of methods • N-gram models as exhaustive sets of compositional rules Pi(c|a…b) ~ a…b  c • Set covering algorithm for exhaustive non comp. rules • Compositional algorithm for non exhaustive comp. rules • Comparison of results • N-gram comparable with set covering (slightly better), worst results for compositional algorithm • All algorithms can be applied by web servers to recommend relevant pages to their users, and to identify interesting patterns in their log files. Discovery Challenge 2005

  17. Thank you for your attention.

More Related