A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching Xiaodan Wang, Tanu Malik, Randal Burns Johns Hopkins University Stratos Papadomanolakis, Anastassia Ailamaki Carnegie Mellon University

Overview • Motivation • Data intensive scientific database federations • Mid-tier caching improves scalability • Choosing the unit of cache replacement • Minimize aggregate network traffic • Improve query execution performance • Query prototypes • Cache groups of columns • Adapts to changes in the workload

OpenSkyQuery • Federation of sky surveys (a virtual telescope) • Expected to grow from 30 sites to over 100 • Available over the Internet (community of astronomers, educational users) • Sites are autonomous, heterogeneous, and geographically distributed • Data intensive workload (large data sets, network-bound) • Scaling through mid-tier caching • Minimize network traffic • Offload query processing

Caching Schema • Difficult to achieve good query performance • Caches employ commodity hardware • An index-free environment • Both network and query performance are sensitive to granularity of cache replacement • Fine granularity (column) • Poor network performance at small cache sizes • High I/O overhead • Coarse granularity (table) • Groups unrelated columns • Inefficient query and network performance

Contributions • Cache workload-defined groups of columns (query prototypes) • Adaptive – candidate query prototypes are discovered incrementally from the request stream • Self-organizing – each prototype describes a physical schema optimized for a specific class of queries • Improve in-cache query execution performance without sacrificing network savings

Caching for Network Savings • Identify and cache database objects that provide network savings • Requests that access these objects are serviced from the cache • Reduces contention for network bandwidth • Bypass Yield Caching (Malik et al., ICDE’05) • Caching framework that uses economic principles to maximize network savings • Database objects are ranked by yield (expected network savings per unit of cache space utilized)

Choosing the Unit of Cache Replacement • Semantic caching is unsuitable for Astronomy • Lack locality (objects are rarely reused) • Evaluating query containment is difficult (nested queries, complex joins, and user-defined functions are common) • Employ schema-based caching • Queries reuse the same set of columns • Derive popular columns from the workload • Analogous materialized views

File-Bundling (Otoo et al., SC’04) • Loading only columns with high yield at small cache sizes Cache Q1 Q2 Q3 Q4 Caching columns B, C, H, and I results in no cache hits Solution: cache groups of columns

Caching Groups of Columns • Existing schema-based caching models are static (e.g. CacheTables, MTCache, TimesTen) • Do not account for dynamic workload access patterns • Physical schema of backend database or defined a priori • May group columns that are rarely used together • Query prototypes caching • Identifies the best groupings from the workload • Minimizes query execution cost against prototypes without sacrificing network savings

Query Prototype • Given a query qi, define the Query Access Set, QAS(qi), as the set of attributes accessed by qi • qi and qj share the same query prototype if they access the same attributes (QAS(qi) = QAS(qj)) Example: SELECT objID FROM Galaxy, SpecObj WHERE objID = bestobjID and specclass = 2 and z between 0.121 and 0.127 QAS = {Galaxy:objID, SpecObj:bestobjID, SpecObj:specclass, SpecObj:z}

Query Prototype QAS(Q1) = {R1:A2, R1:A3, R2:B1} QAS(Q2) = {R2:B1, R2:B2, R2:B3} Q1 Q2 Cache Prototype Prototype R1 R2 B1 is replicated in the cache Base Tables

Workload Properties • Read-only queries • One-month trace against the Sloan Digital Sky Survey (SDSS) Data Release 4 – 2TB • 1.4 million queries generating 360GB of network traffic • 1176 query prototypes describe the entire workload • 11 prototypes capture 91% of the queries • 6 prototypes generate 89% of the network traffic

Experiments • Evaluate caching of tables, columns, vertical partitions, and query prototypes • AutoPart (Papadomanolakiset al., SSDBM’04) • An automated partitioning algorithm for large scientific databases • Groups columns in order to improve query execution performance • Produces the best workload-driven, static grouping

Network Savings

Cache Pollution

Query Performance

Discussion • Improving network and query execution performance are complementary goals • Columns should be grouped together at small cache sizes (cache hits suffer due to file-bundling) • Column groupings should be adaptive because • Workload access pattern is dynamic • Indexes are not available

Questions ???

Schema Reuse • Localized to a small subset of tables

Schema Reuse • Similar reuse among columns

Object Reuse • Few objects are reused

SkyQuery • Federation middleware built at Hopkins • Wrapper/Mediator architecture using web services

Load Cost

Scan Cost • Scanning large tables, the useful region is a small fraction • Incur IO overhead for accessing data from extraneous columns • Spatial locality among related columns Q

Join Cost • Joining results for queries that access multiple fragments • Access should be localized to few fragments to minimize join overhead Q

A Workload-Driven Unit of Cache Replacement for Mid-Tier Database Caching