150 likes | 322 Views
Proxy-Server Architectures for OLAP. Panos Kalnis, Dimitris Papadias THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY. The Problem. Data warehouses: Large repositories of historical summarized information Distributed: Centralized or decentralized. Static structure!
E N D
Proxy-Server Architectures for OLAP Panos Kalnis, Dimitris Papadias THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY
The Problem • Data warehouses: Large repositories of historical summarized information • Distributed: Centralized or decentralized. Static structure! • WWW: new opportunities to access warehouses. Example:Stock market data • Professional brokers: Access directly the warehouse by special purpose OLAP software • Individual investors around the world: Use web browsers. Slow network? Server overloading? Caching? London Stock Market Warehouse OLAP clients Internet Tokyo Singapore Hong Kong
Tokyo Singapore Hong Kong OLAP Cache Servers (OCS) • Similar to WWW Proxy-Servers • Geographically spanned and connected through an arbitrary network • They cache results from OLAP queries • Can derive new results from the cached data • Clients connect to an OCS. If the OCS cannot answer, the query is redirected to a neighbor OCS or to the warehouse • Result: Lower network cost, better scalability, lower response time London Stock Market Warehouse OLAP clients Internet OCS OCS
OCS vs. WWW Proxy-Servers • OCS has computational capabilities. • The cache admission and replacement policies are optimized for OLAP operations. • OCS can update its contents incrementally, instead of invalidating the cached data
Background • Data Cube Lattice: Interdependencies among views SELECT P_id, T_id, SUM(Sales) FROM data GROUP BY P_id, T_id • Client-Server OLAP Caching • Watchman: Semantic caching • Dynamat: Stores fragments • Caching chunks • OCSs may use any of these methods • The prototype caches entire views
System Architecture • Multiple levels of caching • Cooperation among OCSs • Physical organization and fragmentation may differ in each OCS • Centralized: Query optimization and cache control in a central site (intranet) • Semi-centralized: Only query optimization in central site. Each OCS controls its local cache • Autonomous: All decisions are taken locally (internet)
Query Optimizer • A client sends a query q Autonomous policy: • OCS has the exact answer • OCS cannot answer q • OCS can derive q Cost = Read + Transfer
Query Optimizer (cont.) • Autonomous: Scalable, easy to implement, high availability. • Large, unstructured, dynamic environments • BUT may produce inefficient plans • Centralized (and semi-centralized): • A central site has global information for all OCSs. • Creates the execution and routing plan for all queries • Low availability, low scalability • Suitable for intranets
Caching Policy: Autonomous • Lower Benefit First: Considers interdependencies, but: • Cost() difficult to calculate; If v cannot be answered locally we assume that it is answered by the warehouse • The complexity of LBF grows quadratically with the number of materialized views • We evict a set from the cache if the combined benefit < benefit(u). Select the victim set: Similar idea to [HRU96]
Caching Policy: Centralized • All the decisions are taken at the central site • Centralized policy uses Smaller Penalty First • Experiments show that the difference between SPF and LBF is not significant • In general: A bad decision of the caching algorithm does not affect the performance significantly BUT a bad decision of the optimizer has significant impact
Updates • Changes are propagated periodically to the warehouse. It computes deltas for its materialized views • No down time for the OCSs • OCS updates its cache on-demand: Invalidate vs. incrementally update • Deltas are treated as normal data • Deltas are evicted at the end of the update period • Non-updated results are also evicted
DCSR vs. Cmax Experimental Setup • APB and TPC-H • Cmax = max Cache as a percentage of the entire cube • 1500 queries at each OCS Worst case OCS configuration Client-Side-Cache
Effect of Network Cost • 3 OCSs – we vary the speed of the links to the DW • In slow networks, OCSs utilize the contents of their neighbors • In fast networks, many queries reach the warehouse, because the computation cost is lower DCSR vs. Cmax Warehouse Hit Ratiovs. Cmax
DCSR vs. tightness DCSR vs. #of OCSs Autonomous vs. Semi-centralized • Centralized Semi-Centralized • High tightness or many OCSs Autonomous Semi-Centralized 100 OCSs
Conclusions • OCS: Architecture for caching OLAP results • Beneficial for ad-hoc, geographically spanned and possibly mobile users, who sporadically need to access a warehouse • Complimentary to both client-side-cache systems and distributed OLAP approaches • Future work: Prototype on top of a DBMS, support of multiple DWs, finer granularity of cached data, special queries.