380 likes | 494 Views
Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White. Presented by Motaz El Saban. Outline of the talk. Introduction and problem definition. Model-based clustering. Model learning.
E N D
Model-Based Clustering and Visualization of Navigation Patterns on a Web Site I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White Presented by Motaz El Saban
Outline of the talk • Introduction and problem definition. • Model-based clustering. • Model learning. • Application to Msnbc.com IIS log data. • Data Visualization. • Scalability. • Why mixtures of first-order Markov models? • Conclusions. • Future work. Data Mining Spring '03
Introduction • New methodology for analyzing web navigation patterns. (a form of human behavior in digital environments) • Patterns: sequence of URL categories traversed by users, stored in web-server logs for a duration of 24 hours on msnbc.com. • Functionality: • Clustering users based on navigation patterns. • Visualization (WebCANVAS tool). Data Mining Spring '03
Web data analysis approach • Clustering: • Partition users into clusters with users having similar dynamic behavior in the same cluster. • Visualization: • Display the behavior of the users within that cluster. Data Mining Spring '03
Related Work • Most Previous work on Web Navigation patterns and visualization uses non-probabilistic methods [YAN96] [CHE98], mostly finding rules that govern navigation patterns. • Other work used probabilistic methods for predicting the behavior of users on Web pages, but not for clustering purposes using random walk models [HUB 97], Markov models for pre-fetching pages [PAD96], modeling next probable link use kth order Markov model [BOR00]. • These approaches use a single Markov model for all users as opposed to a clustering of users first. Data Mining Spring '03
Related Work • On the clustering side, [FU00] applied BIRCH to cluster user web navigation patterns. • For Web navigation sequence-based clustering, and visualization no previously known work has been done using probabilistic clustering. • Rather, user history has been visualized using visual metaphors of maps, paths, and signposts [WEX99]. • [MIN99] use planar graphs to visualize crowds of users at particular web pages. Data Mining Spring '03
What do we mean by pattern? Data Mining Spring '03
Challenges • Web navigation patterns are dynamic. No static techniques could capture its patterns, such as histograms Markov models. • Different users have heterogeneous dynamic behavior Mixture of models. • Large data size. • The proposed algorithm for learning the mixture of 1st order Markov models has runtime O(KNL+KM2). K: # clusters. N: # sequences. L: average length of sequence. M: # of Web page categories. For typically small M, the algorithm scales linearly with N and K. • Hierarchical clustering methods scale as O(N2) Data Mining Spring '03
Model-Based Clustering • Assuming data is generated as follows: • A user arrives at the web site and is assigned to one of Kclusters with some probability, and • given that a user is in this cluster, his behavior is generated from some statistical model specific to that cluster. • let Xbe a multivariate random variable taking on values corresponding to the behavior of individual users. • Let Cbe a discrete-valued variable taking on values: c1 ...,cK, corresponding to the unknown cluster assignment for a user. Data Mining Spring '03
Model-Based Clustering • A mixture model for Xwith K components has the form: Where is the marginal probability of the kth cluster, is the statistical model describing the distribution for the variables for users in the kth cluster, and denotes the parameters of the model Data Mining Spring '03
Model-Based Clustering • In our case X = (X1,…,XL) is a sequence of variables describing the user’s path through the website. • Xitakes on some value xi from the M different page categories. • Each component in the model obeys the 1st order Markov model: • where denotes the parameters of the probability distribution over the initial page-category request among users in cluster k, • and denotes the parameters of the probability distributions over transitions from one category to the next by a user in cluster k. • Both distributions are taken to be multinomial distribution. Data Mining Spring '03
Model-Based Clustering • EM algorithm is used to learn the model parameters. • Once learned, we can use the model to assign users to clusters by finding the class K that maximizes the membership probabilities: • The user class assignment may be soft or hard. Data Mining Spring '03
Learning Mixture Models from Data • For a known number of K clusters. • Training data dtrain= {x1,…,xN}, with iid assumption. • MAPEstimate of : Data Mining Spring '03
EM learning algorithm (briefly) • An iterative method to find local maxima for the MAP problem of . • Problem at hand involves two sub-problems: • Compute user class assignment (membership probabilities). • Compute class parameters. • Chicken-egg problem! Data Mining Spring '03
EM learning algorithm (briefly) • EM approach: • E-step: given a current value of the parameters , assign a user with behavior X to cluster Ckusing the membership probabilities. • M-step: pretend that these assignments correspond to real data, and reassign to be the MAP estimate given this fictitious data. • Stop iteration when two consecutive iterations produce log likelihoods on the training data that differ by less than p% (0.01% in the paper). Data Mining Spring '03
How to choose K? • Let the site administrator try several K values and choose the convenient one for visualization too time consuming. Rather, • Choose K by findingthe model that accurately predicts Nt new test cases dtest = {XN+1 ,...,XN +Nt}. That is, choose a model with K clusters that minimizes the out-of-sample predictive log score: Data Mining Spring '03
Application to Msnbc.com • Each sequence in the data set corresponds to page views of a user during a twenty-four hour period. • Each event in the sequence corresponds to a user request for a page. The event denotes a page category rather than a URL. • Example categories are: frontpage, news, tech, … • The number of URLs per category ranges from 10 to 5000. • Modeling only the order in which the pages are requested (no duration is modeled) . • Page requests served via a caching mechanism were not recorded in the server logs and, hence, not present in the data. Data Mining Spring '03
Application to Msnbc.com • The full data set consists of approximately one million sequences (users),with an average of 5.7 events per sequence. • Model learning for various cluster sizes K is done with a training set size of 100,023. • Model evaluation was done using the out-of-sample predictive log score on a different sample of 98,687 sequences drawn from the original data. Data Mining Spring '03
Observation on the model components • Some of the individual model components encode two or more clusters. • Example: consider two clusters: a cluster of users who initially request category aand then choose between categories band c,and a cluster of users who initially request category dand then choose between categories eand f. • These two clusters can be encoded in a single component of the mixture model, although the sequences for the separate clusters do not contain common elements. • The presence of multi-cluster components does not affect the out-of-sample predictive log score of a model. • However, it is problematic for visualization purposes. Data Mining Spring '03
Observation on the model components • Solutions: • One method is to run the EM algorithm and then post-process the resulting model, separating any multi-cluster components found. • A second method is to allow only one state (category) to have a non-zero probability of being the initial state in each of the 1st-order Markov models. • Using the second method has the drawback that a cluster of users that have different initial states but similar paths after the initial state are divided into separate clusters. • Nonetheless,this potential problem was fairly insignificant for the Msnbc.com data. Data Mining Spring '03
Constrained models • Experimentally, constrained models have a predictive power almost equal to that of the unconstrained models. • However, introducing this constraint,more components are needed to represent the data than in the unconstrained case. • For this particular data,the constrained 1st-order Markov models reach limit in predictive accuracy around K =100, as compared to the unconstrained models,which reach their limit around K =60. Data Mining Spring '03
Out of sample results Data Mining Spring '03
Data Visualization:WebCANVAS tool • Display of twenty four hour period using 100 clusters. • Each window corresponds to a cluster. • Each row of squares in a cluster corresponds to a user sequence. • WebCANVAS uses hard clustering, assigning each user to a single cluster. • Each square in a row encodes a page request in a particular category encoded by the color of the square. • Note that the use of color to encode URL category limits the utility of this tool to domains where the number of categories can be limited to fifty or so. Data Mining Spring '03
WebCANVAS Display Data Mining Spring '03
Discovering unexpected facts • Large groups of people enter msnbc.com on techand localpages; • Large group of people navigating from on-airto local; • Little navigation between techand business sections; • and large number of hits to the weatherpages. Data Mining Spring '03
WebCANVAS tool (model-direct sampling) • WebCANVAS display performed better subjectively than two other methods: • Showing the 0th-order and 1st-order Markov models of a cluster. • “traffic flow movie” by Microsoft Site Server 3.0. • Advantage of model-directed sampling over displaying the models themselves is that the former approach is not as sensitive to errors in modeling. • That is, by displaying sampled raw data, behaviors in the data not consistent with the model used can still be seen and appreciated. Data Mining Spring '03
Alternative: Displaying models themselves Data Mining Spring '03
Scalability • Memory requirements of the algorithm are: O(NL+KM2+KM), which typically reduces to O (NL) - i.e.the data size- for data sets where Mis relatively small. • The runtime of the algorithm per iteration is linear in Nand K. Data Mining Spring '03
Scalability in K Data Mining Spring '03
Scalability in N Data Mining Spring '03
Mixtures of 1st order Markov Models: Too simple model? • Sen and Hansen (2001), Deshpande and Karypis (2001) have shown that the 1st-order Markov model to be an inadequate model for empirically-observed page-request sequences. • It is not surprising, because for example: • if a user visits a particular page,there tends to be a greater chance of he returning to that same page at a later time. • 1st order Markov model cannot capture this type of long-term memory. • However: • Though the mixture model is 1st order Markov within a cluster, the overall unconditional model is NOT 1st order Markov. • Msnbc data is different from typical raw page-request sequences. Namely, URL categories result in a relatively small alphabet size as compared to working with uncategorized URLs. Data Mining Spring '03
Mixtures of 1st order Markov Models: Too simple model? • The combined effects of clustering and a small alphabet tend to produce low-entropy clusters in the sense that a few (two or three) categories often dominate the sequences within each cluster. • Thus, the tendency to return to a specific page that was visited earlier in a session can be well approximated by the simple mixture of 1st order Markov models. Data Mining Spring '03
Mixture of 1st order Markov Models vs 1st order Markov Models • Mixture Model: • Looking at the predictive distribution for the next symbol under the mixture model, i.e : • Thus the probability of the next symbol is a weighted combination of the transition probabilities from each of the individual 1st order component models. Data Mining Spring '03
Mixture of 1st order Markov Models vs 1st order Markov Models • The weights are determined by the partial membership probabilities of the prefix (history) subsequence . • These weights are in turn a function of the history of the sequence (via Bayes rule), and typically depend strongly on the pattern of behavior before . • This prediction behavior of is opposed to the simple prediction distribution of the 1st order Markov model: Data Mining Spring '03
Empirical proof of 1st order Markov Model • Diagnostic check: empirically calculate the run lengths of page categories for several of the most likely clusters. • If the data are being generated by a 1st order Markov model, then the distribution of these run lengths will obey a geometric distribution. • Results are shown in each cluster for the three most frequently visited categories that had at least one run length of four or greater.(Categories that have run lengths of three or fewer provide relatively uninformative diagnostic plots.) Data Mining Spring '03
Empirical proof of 1st order Markov Model • Asterisks mark the empirically observed counts. • The center dotted line on each plot is the expected count as a function of run length under a geometric model using the empirically estimated self-transition probability of the Markov chain for the corresponding cluster. • The upper and lower dotted lines represent the plus and minus three-sigma sampling deviations for each count under the model. Data Mining Spring '03
Conclusions • Using a model-based clustering approach to cluster users based on web navigation patterns. • Develop a visualization tool that enables web administrators to better understand user behavior. • Using mixture of 1st order Markov models for clustering taking into account the order of page requests pages. • Experiments suggest that 1st order Markov model mixture components are appropriate for the msnbc.com data. • The algorithm learning time scales linearly with sample size. In contrast,agglomerative distance-based methods scale quadratically with sample size. Data Mining Spring '03
Future Work • Modeling the duration of each visit. • Avoiding the limitation of the proposed method to small M, modeling page visits at the URL level. • In one such extension,we can use Markov models to characterize both the transitions among categories and the transitions among pages within a given category. • Alternatively,we can use a hidden-Markov mixture model to learn categories and category transitions simultaneously. Data Mining Spring '03