Mining Web Logs for Prediction Models in WWW Cashing and Prefetching

Mining Web Logs for Prediction Models in WWW Cashing and Prefetching Coming from : In The Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining KDD’01, August 26 - 29, 2001 San Francisco, California, USA Author : Qiang Yang , Haining ,Henry Zhang , Tianyi Li Professor : Dr.Yang Student : Gun –Ren Wang

Outline • Introduction • Page replacement policy • GD-Size & GDSF • Extracting Embedded Object • Mining Frequent Sequences • Prediction Algorithm • Conclusion

Introduction • As the World Wide Web is growing at a very rapid rate, researchers have designed various effective caching algorithms to contain network traffic. • An important advantage of the WWW is that many web servers keep a server access log of its users. These logs can be used to train a prediction model for future document accesses.

Performance Metrics • Hit Rate (HR) The rate between the number of requests that hit in the proxy cache and the total number of requests. • Byte Hit Rate (BHR) The rate between the number of bytes that hit in the proxy cache and the total number of byte requested.

Page replacement policy • Least-Recently-Used (LRU) : Evicts the document that was requested least recently • Least-Frequently-Used (LFU) : replaces the document that has been accessed for the least number of times • Size replaces the large document • Lowest-Latency-First is aimed to minimized the average latency

GD-Size Based on the original GD algorithm , Cao and Iran incorporated the size factor and introduced Greedy-Dual-Size algorithm for web caching to improved the efficiency of the original GD algorithm. K(P)= L + C(P) / S(P) C(P) is the cost to bring document P into the cache ; S(P) is the document size ; L is an aging factor that start at 0 and is updated to the key value of the last replaced document.

Algorithm GD-Size

GDSF Cherkasova improved GD-Size algorithm by incorporating a frequency count in the computation of key values .GDSF is called Greedy-Dual-Size-Frequency algorithm. K(P) = L + F(P) * C(P) / S(P) ,where F(P) is the access count of document p , F(P) =F(P) + 1 We denote this replacement policy as GDSF ,When the cost function is set to the document size, K(P)=L+F(P) will achieve the best byte hit rate .

Extracting Embedded Object HTML documents also act as containers of other web objects ,such as images, audio and video files. There objects are called as part of their HTML documents are called embedded objects. Embedded object

Mining Frequent Sequences From the graph ,We generate N-gram prediction rule : S1.S2.S3….Sk-1 Sk The condition probability P(Sk|S1.S2…Sk-1), Conf = count(S1.S2…Sk)/count(S1.S2…Sk-1) If Sk has embedded objects, the following rule can be deducted immediately from EOT S1.S2.S3….Sk-1 Oi 0-->i-->n , where Conf(i) = conf * Pi

Algorithm of mining frequent sequences

Embedded object Table 1

Embedded object Table 2

Prediction Algorithm • The process of building a set of association rule and EOT is called training. Once the training is finished, we can apply these rule to predict the future requests by matching the longest path first. • Let O(i) denote a web object on the server ,S(j) be a session for object O(i), let W(i) be the future frequency of requests to object O(i).

Example

Prediction Algorithm

Extend GDSF we extend GDSF to incorporate the W(p): K(p)=L+( W(p) + F(p) )*C(p)/S(p) which implies that the key value of a page p is determined not only by its past occurrence frequency, but also effected by its future frequency. K(p)=L+( K*W(p) + (1-K)*F(p) )*C(p)/S(p) ,K will be a weight value between 0 to 1.

Conclusion We applied association rules minded from web logs to improve the well-known GDSF algorithm. By integrating path-based predcition caching and prefetching , it is possible to dramatically improve both the hit rate while reducing the network latency.

Mining Web Logs for Prediction Models in WWW Cashing and Prefetching

Mining Web Logs for Prediction Models in WWW Cashing and Prefetching

Presentation Transcript

Mining Query Logs

Data Mining: Classification and Prediction

Pattern mining in system logs: opportunities for process improvement

Event correlation and data mining for event logs

Prefetching for RC

Mining Logs for Long-Term Patterns

Data Mining for Crystal Structure Prediction

Data Mining for Protein Structure Prediction

A Popularity-Based Prediction Model for Web Prefetching

Hybrid Prefetching for WWW Proxy Servers

Analyzing Web Logs

Web Logs

Semantic Trajectory Mining for Location Prediction

Mining Web Server Logs: Tracking Users and Building Sessions

Web search logs

Data Mining: Classification and Prediction

Web Prefetching

Proactive Prediction Models for Web Application Resource Provisioning in the Cloud

Logs Miner : Portal for Data Mining Web Access Logs

Semantic Trajectory Mining for Location Prediction

Mining Query Logs

Data Mining: Classification and Prediction