180 likes | 297 Views
Mining Web Logs for Prediction Models in WWW Cashing and Prefetching. Coming from : In The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD ’ 01, August 26 - 29, 2001 San Francisco, California, USA
E N D
Mining Web Logs for Prediction Models in WWW Cashing and Prefetching Coming from : In The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD’01, August 26 - 29, 2001 San Francisco, California, USA Author : Qiang Yang , Haining ,Henry Zhang , Tianyi Li Professor : Dr.Yang Student : Gun –Ren Wang
Outline • Introduction • Page replacement policy • GD-Size & GDSF • Extracting Embedded Object • Mining Frequent Sequences • Prediction Algorithm • Conclusion
Introduction • As the World Wide Web is growing at a very rapid rate, researchers have designed various effective caching algorithms to contain network traffic. • An important advantage of the WWW is that many web servers keep a server access log of its users. These logs can be used to train a prediction model for future document accesses.
Performance Metrics • Hit Rate (HR) The rate between the number of requests that hit in the proxy cache and the total number of requests. • Byte Hit Rate (BHR) The rate between the number of bytes that hit in the proxy cache and the total number of byte requested.
Page replacement policy • Least-Recently-Used (LRU) : Evicts the document that was requested least recently • Least-Frequently-Used (LFU) : replaces the document that has been accessed for the least number of times • Size replaces the large document • Lowest-Latency-First is aimed to minimized the average latency
GD-Size Based on the original GD algorithm , Cao and Iran incorporated the size factor and introduced Greedy-Dual-Size algorithm for web caching to improved the efficiency of the original GD algorithm. K(P)= L + C(P) / S(P) C(P) is the cost to bring document P into the cache ; S(P) is the document size ; L is an aging factor that start at 0 and is updated to the key value of the last replaced document.
GDSF Cherkasova improved GD-Size algorithm by incorporating a frequency count in the computation of key values .GDSF is called Greedy-Dual-Size-Frequency algorithm. K(P) = L + F(P) * C(P) / S(P) ,where F(P) is the access count of document p , F(P) =F(P) + 1 We denote this replacement policy as GDSF ,When the cost function is set to the document size, K(P)=L+F(P) will achieve the best byte hit rate .
Extracting Embedded Object HTML documents also act as containers of other web objects ,such as images, audio and video files. There objects are called as part of their HTML documents are called embedded objects. Embedded object
Mining Frequent Sequences From the graph ,We generate N-gram prediction rule : S1.S2.S3….Sk-1 Sk The condition probability P(Sk|S1.S2…Sk-1), Conf = count(S1.S2…Sk)/count(S1.S2…Sk-1) If Sk has embedded objects, the following rule can be deducted immediately from EOT S1.S2.S3….Sk-1 Oi 0-->i-->n , where Conf(i) = conf * Pi
Prediction Algorithm • The process of building a set of association rule and EOT is called training. Once the training is finished, we can apply these rule to predict the future requests by matching the longest path first. • Let O(i) denote a web object on the server ,S(j) be a session for object O(i), let W(i) be the future frequency of requests to object O(i).
Extend GDSF we extend GDSF to incorporate the W(p): K(p)=L+( W(p) + F(p) )*C(p)/S(p) which implies that the key value of a page p is determined not only by its past occurrence frequency, but also effected by its future frequency. K(p)=L+( K*W(p) + (1-K)*F(p) )*C(p)/S(p) ,K will be a weight value between 0 to 1.
Conclusion We applied association rules minded from web logs to improve the well-known GDSF algorithm. By integrating path-based predcition caching and prefetching , it is possible to dramatically improve both the hit rate while reducing the network latency.