300 likes | 372 Views
A Markov Model for Web Request Prediction. By Habel Kurian MASTER OF SCIENCE Report Presentation Department of Computing and Information Sciences Kansas State University, Manhattan, Kansas. Committee members: Dr. Daniel Andresen (Advisor) Dr. Gurdip Singh Dr. Mitchell L. Neilsen.
E N D
A Markov Model for Web Request Prediction By Habel Kurian MASTER OF SCIENCE Report Presentation Department of Computing and Information Sciences Kansas State University, Manhattan, Kansas Committee members: Dr. Daniel Andresen (Advisor) Dr. Gurdip Singh Dr. Mitchell L. Neilsen Date : 07-24-08 Time : 10:00 AM
Overview • Introduction to web prediction models • Prediction techniques • Markov property • Markov tree - All Kth order Markov model • Efficacy of the model • Markov tree implementation • Pruning - Evolutionary model • Results • Conclusion and Future work
Introduction: Web Prediction Model Goal: Design a prediction model with better predictive accuracy but lesser model complexity. Use: Identifies web request patterns to predict future user requests. These patterns are captured by processing web server logs (access/referrer) over a period of time. Prediction profile: • Point profile • Path profile Application: • Pre-fetching (server side) • Pre-sending (client side) • Recommendation systems • Analyzing and design of web sites
Different Types of Prediction Techniques • Data mining • Finite state machines • Neural networks • Markov based prediction models • Different order Markov models • Prediction by partial matching • Hybrid Markov models • All Kth order Markov model • Markov tree
Markov Property(Stochastic counterpart to deterministic process in probability theory) • Description of the present state fully captures all the information that could influence the future evolution of the process • P = {p1, p2, p3,….. pn} be set of pages in a web site W: user session including a sequence of pages visited by the user prob(pi| W): probability that a user visit page pi next Then, page Pl+1that the user visit is estimated by (assume that the user has visited l pages): Pl+1 = max pεP {P(Pl+1 = p|W)} = max pεP {P (Pl+1 = p| pl,, pl-1,,…. p1,)} Pl+1 = max pεP {P (Pl+1 = p| pl,, pl-1,,…. pl-(k-1),)} k is the number of preceding pages and identifies order of Markov model • N-gram is a sequence of set of actions (web requests). We try to match the prefix of length n-1 and predict nth request. • Track occurrence of each n-gram. Markov model helps to do this by making assumption that next request is a function of current state.
Markov Tree - All Kth Order Markov Model Fig (a) Fig (c) Fig (b) a. All Kth-order model:multiple data structures State space complexity is high Maintenance of different order model (update/prune) b. Markov tree: single data structure(tree) Low model complexity after pruning Efficient way of storing state information to perform computation Incorporates different order model in one tree High applicability and predictive accuracy Uses train-test machine learning paradigm Fig (d)
How to make Markov model more effective ? Markov tree has higher or equivalent space complexity when compared to different order models. Solution: a. Pruning b. Clustering (distance/model) c. Compression How to improve the accuracy of the model? a. Tune factors that affect predictive performance b. Evolutionary model
Parameters Used to Determine the Efficacy of the Model Model performance • Accuracy (predictive precision = No. correct pred / No. pred attempted) • Coverage of the model (Applicability = No. pred / No. requests) • Number of states How to characterize individual nodes? • Confidence (predictive probability) • Frequency (support) Factors that affect predictive performance • Number of predictions (top-n) • N-gram size • Prediction window • Clustering algorithm • Mistake costs
Referrer/Access Logs Source of browsing/link state information: Server logs Access Log 194.170.246.120 - - [08/Jun/2008:03:32:25 -0500] "GET /~schmidt/CIS200/ch1V9.html HTTP/1.0" 200 57896 Referrer Log 194.170.246.120 - - [08/Jun/2008:03:32:25 -0500] "GET /~schmidt/CIS200/ch1V9.html HTTP/1.0" 200 57896 http://people.cis.ksu.edu/~schmidt/CIS200/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)“ Preprocessing steps • Remove embedded resources • Referrer information is particularly not sufficient enough to built the model. However they can be used to validate the model. • Assign unique numbers to URLs
ImplementationUML sequence diagram Server logs (access/referrer) markov_ model.xml url_hash .xml Preprocessing module Markov model building algorithm Pruning module Parser Testing Pre-processed log file fed to parser Request sequences for user session Markov tree Test log file validation Markov tree Fitness function (Accuracy, Applicability) Pruned Markov tree url_hash.xml markov_model.xml
Implementation Program modules • Preprocessing • Parser • Model building algorithm • Pruning • Testing Implementation details Programming language : Perl V5.8.8 (CPAN / XML) Lines of code : 2500 Storage : XML files Platform : Linux OS – Beocat Tree searching : Iterative deepening depth first search(XML::Twig ) Validation Document type definition (external dtd)/ XML validator (http://www.stg.brown.edu/service/xmlvalid/)
Assigning unique identifiers to URLs <urlhash> <number> 15159 </number> <url value = "10000">www.cis.ksu.edu</url> <url value = "10001">/people</url> <url value = "10002">/BadContent</url> <url value = "10003">/BadContent123</url> <url value = "10004">/people/faculty</url> <url value = "10005">/~virg/</url> <url value = "10006">/research/projects</url> <url value = "10007">/_external/chert/home.html</url> • • <url value = "15158">/~schmidt/300s05/Lectures/Week13.html</url> <url value = "15159">/~ab/Miscellany/rhythm_files/a.html</url> </urlhash>
Markov Tree Node <node> <parent>10000</parent> <nochild>1</nochild> <childcount>1</childcount> <selfcount>1</selfcount> <nodeid>10002</nodeid> </node> Fig (e) Fig (f)
M O D E L B U I L D I N G Build_Markov_Tree(log) : for each set of sequence S associated with a user session of log while (there is sub sequence that has not been considered starting from first request) for i from 0 to min(|S|, makov model order) { let ss be the subsequence containing last i items from s let p be a pointer to root if |ss|==0 increment p.selfcount else for j from first(ss) to last(ss) { increment p.childcount if not-exist-child (p,j) increment p.No children add a new node for j to the list of p’s children let p point to child j if j=last(ss) increment p.selfcount } } A L G O R I T H M
Eg: 1000 1003 1002 1000 a. 1003 1002 1000 ->1000, 1002 1000, 1003 1002 1000 b. 1000 1003 1002->1002, 1003 1002, 1000 1003 1002 c. 1000 1003-> 1003, 1000 1003 d. 1000->1000 Fig (g) Fig (h) Markov tree after completing the algorithm for the first part (a) of the sequence Markov tree after completing the algorithm for all the sequence listed above
Markov Tree 19,7,19 Confidence = Selfcount/Par Childcount = Frequency = Selfcount=3 Fig (i)
<?xml version="1.0" encoding="ISO-8859-1"?> <markovroot> <nodeid>10000</nodeid> <parent>10000</parent> <selfcount>6857</selfcount> <childcount>6857</childcount> <nochild>1442</nochild> <markovchildren> <node> <markovchildren_1> <node> <markovchildren_2> <node> <markovchildren_3> </markovchildren_3> <parent>10003</parent> <nochild>0</nochild> <childcount>0</childcount> <selfcount>1</selfcount> <nodeid>10001</nodeid> • • • </markovchildren_1> <parent>10000</parent> <nochild>0</nochild> <childcount>0</childcount> <selfcount>250</selfcount> <nodeid>10003</nodeid> </node> </markovchildren> </markovroot> M A R K O V T R E E I N X M L F O R M A T
Pruning(Elimination of states having low contribution towards accuracy) • Frequency pruning • Confidence pruning • Error pruning • Pessimistic pruning • Hybrid pruning techniques • Top-down/ Bottom-up pruning
prune(node): for i from firstchild(node) to lastchild(node): child_node { if (number_of_children of child_node !=0 ) let child be a pointer to child_node prune(child_node) if (fitness_state(child) is weak) prune_operation(child) decrement parent child_count decrement parent number_of_children if(node is root node) decrement parent self_count } * fitness_state(child)is a weighted sum of frequency and confidence of the child node ** prune_operation (child)prunes the child node and all its sub-children A L G O R I T H M P R U N I N G
Data set used to build & validate the model (Web server log files, Dept. of Computing and Information Sciences, Kansas State University) Table 1.
Results Predictive success Vs Markov order Graph 1.
File size Vs session interval Graph 2. Graph 3. Precision graph of Markov tree
Predictive accuracy table 69.9±3.2 Table 2. Updating/Pruning the prediction model improves the performance substantially
Measurement of degree of pruning Table 3.
Graph 4. SPMM Reference : Selective Markov model for predicting web-page accesses. Mukund Deshpande and George Karypis Graph 5. CPMM
Precision or File Size Vs Pruning Threshold Graph 6. • More current training logs implies better adaptation to current browsing behavior • Pruning thresholds are established by incrementing minimum confidence/frequency • Fitness function determines the need for further pruning
File Size or Precision or Applicability Vs Pruning Threshold Graph 7.
Conclusion and Future Work • Markov tree is a good alternative to different order prediction models. • When these models are updated at regular intervals using pruning • thresholds :good predictive precision,applicability & reduced model • complexity. • Machine learning technique/Handling log files/ Perl (CPAN) • XML building and validation • Track the probability of the next item that has never been seen. • Validate our observations over log files from different sources.
Acknowledgements Dr. Daniel Andresen (Major Professor) Dr. Gurdip Singh Dr. Mitchell L. Neilsen Staff members of Department of Computing and Information Science, Kansas State University Family and Friends