September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia

September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia ASSOCIATION MODELS FOR WEB MINING Research carried out within the laboratory: Statistical models for data mining (SMDM)

A small sample of web clickstream data (from a logfile)

Analysis of web clickstream data 1. In data matrix form (Giudici and Castelo, 2001; Blanc and Giudici, 2001): • Association measures • Association models (graphical association models) 2. In transactional data form (in this talk) • Association and sequence rules • Statistical models for sequences

Association measures and models Based on data arranged in contingency table form FOR INSTANCE: Odds ratios Graphical loglinear models Recursive logistic regression models For a review, see Giudici, Applied data mining, Wiley, 2003

Association and sequence rules Implemented in main Data Mining softwares Based on transactional databases Such databases arise for instance in • Market basket analysis (order does not matter) • Web clickstream analysis (order matters) Aim: search for itemsets (groups of events) that occurr simultaneously with a high frequency

Formally: • A1, .., Ap: p binary random variables. Itemset: logical expression such as A = (Aj1 = 1,...,.Ajk =1), k< p. • Association rule: logical relationship between two itemsets: e.g. if A, then B • Example:A= (Milk, Coffee) B=(Bread, Biscuits) • Sequence rule: the relationship is determined by a temporal order. • Example: A= (Home, Register) B=(P_info)

Interestingness of a rule • Support = • Confidence = = • Lift =Confidence / Support (B) A priori search algorithm (Agrawal et al., 1995): based on the support.

Application to real dataData set from a logfile of an e-commerce site, kindly supplied by SAS.Contains the userid (C_VALUE), the time of connection (C_TIME) and the page visualised (C_CALLER).Number of clicks: 21889; Number of visitors (sessions): 1240.

Cluster N.obs Variables Cluster mean Overall mean 1 8802 CLICKSLENGTHstart%PURCH 8 6 min h. 18 0.034 1010 min14 h 0.072 2 2859 CLICKSLENGTHstart%PURCH 22 17 min h. 15 0.241 3 1240 CLICKSLENGTHstart%PURCH 18 59min h. 13 0.194 4 9251 CLICKSLENGTHstart%PURCH 8 6 min h. 10 0.039 Exploratory step(data selected from a cluster of visitors, N. 3)

Remark Data could have been transformed from transactional to data matrix format. Doing so information on the order of the visited pages would have been lost Data matrix format for the considered data:

Application of the apriori algorithmMost frequent indirect sequences of order 2

Most frequent indirect sequences of any order

Proposal: direct sequences • Only “subsequent” visits are being considered • We have inserted two fictitious (deterministic) pages: (start_session;end_session)

Most frequent direct sequences of order 2

Towards a global model:graphical representation of direct association rules

Link analysis representation

Global models for web mining Sequence rules are an instance of a local model (or pattern, see Hand et al, 2001) of data mining. A local model draws statistical conclusions on parts of the dataset, rather than on the whole. Link analysis is an example of a global descriptive model. We have considered two global inferential models: - probabilistic expert systems - Markov chains

Probabilistic expert systems Graphical models that allow to describe (recursive) dependencies between (binary) random variables Can be described by a directed conditional independence graph, that specifies the factorisation of the joint probability distribution. They ARE NOT directly comparable with sequence rules, that are local indexes to study dependencies between events (itemsets) They are built from contingency table data, thus DO NOT model order of visit to pages.

Probabilistic expert systems: structural learning

Probabilistic expert systems: quantitative learning

Markov Chains for web miningIdeal to model dependencies between events. Order of the chain parallels order of a sequence rule.Data have been structured in the following form:

Results from Markov chains(entrance to the site- start session)

Exit from the site(end session)

Most likely paths 17,80% 45,81% 70,18% Start_session Home Program Product 26,73% P_info Markov chains ARE DIRECTLY comparable with direct sequence rules. E.g. for the most likely path: from start_session, the highest confidence is with home (45,81%), then program (20.39,), product ( 78,09% ) and addcart (28,79%). There are small differences, due to the fact that apriori algorithm considers only rules with support higher than a fixed threshold (e.g. 5%).

Essential references Agrawal, R., Manilla, H., Srikant, R., Toivonen, H. and Verkamo, A.I. (1995) Fast discovery of association rules, in: Advances in knowledge discovery and data mining, AAAI/MIT Press, Cambridge. Giudici, P. (2003) Applied Data mining. Wiley, London. Giudici, P. and Castelo, R. (2001) Association models for web mining. Journal ofKnowledge discovery and data mining, 5, pp. 183-196. Trevor Hastie, Robert Tibshirani and Jerome Friedman (2001).The elements of statistical learning: data mining, inference and prediction. Springer-Verlag. Hand, D.J., Mannilla, H. and Smyth, P (2001) Principles of Data Mining, MIT Press, New York.

THANKS FOR THE ATTENTION !Comments to:giudici@unipv.itwww.baystat.it/giudici/index.htm

September, 13th gR2002, Vienna PAOLO GIUDICI Faculty of Economics, University of Pavia