220 likes | 362 Views
Using Database Technology to Improve Performance of Web Proxy Servers. K. Cheng ¹ , Y. Kambayashi ¹ , M. Mohania ² ¹ Kyoto University, Japan ² Western Michigan University, USA. Proxy Server. Lower Bandwidth. Higher Bandwidth. ( WAN ). ( LAN ). X. Direct Access.
E N D
Using Database Technology to Improve Performance of Web Proxy Servers K. Cheng¹, Y. Kambayashi¹, M. Mohania² ¹Kyoto University, Japan ²Western Michigan University, USA
Proxy Server LowerBandwidth HigherBandwidth (WAN) (LAN) X DirectAccess Caching on web proxy servers • Improve throughput of proxy servers • Improve response times for end users • Bridge bandwidth gap between WAN and LAN • Distribute workload from web servers WebServers Clients WebDB'2001, Santa Barbara CA
Characteristics of proxy caching WebDB'2001, Santa Barbara CA
Limitations of current caching schemes: case 1 • Tom found a very good page “P1” about car models • John is also looking for that kind of pages, but he only got “P2” • Both “P1” and “P2” were cached, but Tom didn’t know “P2” and John didn’t know about “P1”. • After several days, however, both were replaced since no further visits. • As a result, Tom missed “P2”, John missed “P1”, and cache missed 2 hits State-of-art caching schemes cannot deal this case!! WebDB'2001, Santa Barbara CA
Limitations of current caching schemes: case 2 • Suppose the users of a proxy server are mostly interested in “XML”, but rarely favor of “Fuzzy” • Suppose some clients retrieved pages “P1” and “P2” • After checking the content of “P1”and “P2”, we know “P1” is a “XML” one, “P2” is a “Fuzzy” one Should we prefer to cache “P1” or “P2” ? WebDB'2001, Santa Barbara CA
Why current schemes can’t deal with these cases ? • Physical object based cache management • Content transparency low utilization rate (Case 1) • Approximately 60% data in cache never used • Approximately 90% data in cache rarely used • Usage-based object replacement Needlessly long stay time for irrelevant contents (Case 2) WebDB'2001, Santa Barbara CA
Our solution • We propose a hierarchical data model for management of web data (physical pages, logical pages and topics). • Object replacement based on • Link structure (“logical pages”) • Semantic similarity with other objects (“topics” ) • Facilitate active access to cache contents WebDB'2001, Santa Barbara CA
A hierarchical model for web data Topics navigate Topic manager T1 T2 Mapping Logical pages Search Logical page manager L1 L2 L3 Mapping Physical pages Browse Physical page manager p1 p2 p3 p4 p5 p6 WebDB'2001, Santa Barbara CA
Physical pages http://www.difa.unibas.it/webdb2001 ../icons/webdblogo.gif Physical page “A” Physical page “B” /instructionsPage/index.html WebDB'2001, Santa Barbara CA
Logical page A B WebDB'2001, Santa Barbara CA
Managing physical pages • Physical page • HTML/plain text file (.html, .txt) • Embedded media file (.gif, .png, wav, .mp3) • Application Generated File (.pdf, .ps, .doc) • Managing physical pages based on • URL (protocol, ip, port, path) • Physical properties (e.g. size, cost etc.) • Usage (frequency, recency) WebDB'2001, Santa Barbara CA
Constructing logical pages • Basic logical pages • Single multimedia document • HTML(1)+ embedded media files(1..*) • Extended logical pages • Several closely related directly linked pages E.g. an HTML paper with sections on different multimedia documents WebDB'2001, Santa Barbara CA
Managing topics • Defining a topic • Topic = <id, name, criteria, popularity, date, …> • Popularity=f(F, R, P, U) F – Access Frequency of Topic R - Time interval between last access time and current time P – Number of logical pages belonging to a topic U – Number of users accessing a topic • Deciding membership of a logical page to a topic • IR Approaches (K-NN, ) • ML Approaches (e.g. Support Vector Machine-SVM) WebDB'2001, Santa Barbara CA
Definitions • We use a term “Priority” for object replacement. It is a function of several parameters, e.g. access frequency(F), time interval(R), size of object(S), retrieval cost(C), significance(G). • Significance: Importance of the topic WebDB'2001, Santa Barbara CA
Caching policy: LRU-SP+ • Topic management • Priority = f(F, R, G) • Logical page management • Basic logical pages only • Priority = g(F, R) • Physical page management • LRU-SP --size-adjusted & popularity-aware LRU (K. Cheng et al, Compsac’00) • Priority = h(F, R, S) WebDB'2001, Santa Barbara CA
Evaluate & add new objects “D” is of higher priority T2 T1 Topics Priority Higher Lower L1 L2 L3 Logical Pages P10 P20 P30 P40 Physical Pages P22 P11 P12 P21 P31 P41 P42 New Object “D” WebDB'2001, Santa Barbara CA
T2 T1 L1 L2 L3 P10 P20 P30 P40 P22 P11 P12 P21 P23 P31 P41 P42 Replace an object Choose a candidate topic (T1) T1 has 1 logical page (L1), choose (L1) (L1) has 3 physical pages (P10), ( P11), (P12), where (P12) shared by (L2) Choose a victim (P*) from (P10), ( P11). Replace (P*) with the new page WebDB'2001, Santa Barbara CA
Preliminary experiments • Replay access logs of our proxy server(Squid) • 30 clients, 30 days • 873,824 requests, 21.30GB data • 7 Topics, Priority [1..5] • Significance Factor ([0, 2]) • Measure the significance of each topic • Hit Rate(HR) • Percentage of requests satisfied by cache • Profit Rate(PR)-- is significance of topic WebDB'2001, Santa Barbara CA
Baseline algorithm LRV (Rizzo et al 1998) • A physical-page-based algorithm • Using size(S) to predict further access to incoming objects • Parameters in consideration • Access frequency (F) • Time interval (R) • Size of objects (S) WebDB'2001, Santa Barbara CA
Results: Hit Rates 20% UP Cache space in % of total unique data WebDB'2001, Santa Barbara CA
Results: Profit Rates 30% Up Cache space in % of total unique data WebDB'2001, Santa Barbara CA
Conclusion and future work • Performance of caching proxies can be remarkably improved if cache contents were well organized and managed • Proposed a hierarchical model and the cache management scheme based on that model • Future work • Tuning various parameters to achieve better performance(Logical page clustering, priority balancing significance and popularity etc.) • More experiments WebDB'2001, Santa Barbara CA