1 / 20

Optimal Distributed Declustering using Replication

Optimal Distributed Declustering using Replication. Keith Frikken Purdue University Jan 5, 2005. Declustering Data. Declustering data over multiple disks to improve performance for range queries has been well studied Applications include: Spatio-temporal databases Image and video data

wardah
Download Presentation

Optimal Distributed Declustering using Replication

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Optimal Distributed Declustering using Replication Keith Frikken Purdue University Jan 5, 2005 ICDT 2005

  2. Declustering Data • Declustering data over multiple disks to improve performance for range queries has been well studied • Applications include: • Spatio-temporal databases • Image and video data • Scientific simulation datasets ICDT 2005

  3. Goal • Divide data uniformly along dimensions to create tiles • Put records contained in each tile on different disks so that I/O can be parallelized • Assumptions • Data can be tiled in such a way • Disks have constant retrieval times • Assigning tiles to disks is similar to a coloring problem (disks are colors) • A range query can be answered optimally if the # of I/O retrievals for any specific disk is: # of tiles/# of disks • Two approaches: • Coloring schemes • Replication ICDT 2005

  4. Notations • k is number of disks • m is number of tiles in queries • r is level of replication (i.e., is 2) • Q is the set of all range queries • ret(q) is the actual retrieval time of q • Optimal retrieval time for a query q is oq=m/k • Additive error ε, maxqQ{ret(q)-oq} ICDT 2005

  5. Coloring schemes • Disk Modulo (DM) [Du and Sobolewski, 1982] • Fieldwise XOR (FX) [Kim and Pramanik, 1988] • Cyclic Schemes (RPHM, GFIB, EXH) – [Prabhakar et al, 1998] • Golden Ratio Sequences (GRS) – [Bhatia et al, 2000] ICDT 2005

  6. Other schemes • [Atallah and Prabhakar, 2000] developed a scheme in two dimensional grids for k=2n disks the has additive error of O(log k) • [Sinha et al, 2001] proved lower bounds on the additive error of Ω(log k) and Ω(log(d-1)/2 k) for 2 dimensions and d (>2) dimensions respectively • [Chen and Cheng, 2002] showed that an additive error of O(log(d-1) k) is achievable for any # of dimensions (>2) ICDT 2005

  7. Replication • Placing records on multiple disks can further improve performance of declustering schemes • Two Problems: • How to schedule a query (i.e., what tiles are retrieved from each disk) • How to use replication to balance load • Approaches: • Chained Declustering [Hsiao and DeWitt, 1990] • Random Duplication Allocation [Sanders et al 2000], [Sanders, 2001], and [Czumaj and Scheidler, 2003] ICDT 2005

  8. Replication Results • Chained Declustering • Fast Scheduling Algorithm O(m+k) time to test if a specific retrieval time is possible [Aerts et al, 2000] • RDA • If m≥ck(log k) then optimal with high prob [Czumaj and Scheideler, 2003] • “Fast” scheduling algorithm” O(ΔkO(1)) time [Czumaj and Scheideler, 2003] • Hybrid techniques [Chen and Cheng, 2002] • Use GRS with second random disk ICDT 2005

  9. Our Results • We define a new class of schemes called the shift schemes • Deterministic • Any query with at least k(k-1)ε tiles can be answered in an optimal fashion • Queries can be scheduled in O(m+k(log ε)) time • If a single disk fails, then any query with at least k(k-1)ε tiles can be answered optimally • Experimental performance similar to RDA (better for many cases) ICDT 2005

  10. Shift Scheme Definition • Use any strong coloring scheme • Use a modified chain declustering • Defined by shift value s (where gcd(s,k)=1) • Base scheme is defined by function f(x,y) • Second color is (f(x,y)+s mod k) ICDT 2005

  11. Shift Scheme Definition • Use any strong coloring scheme • Use a modified chain declustering • Defined by shift value s (where gcd(s,k)=1) • Base scheme is defined by function f(x,y) • Second color is (f(x,y)+s mod k) ICDT 2005

  12. Scheduling • Can use modification of chain declustering scheduling algorithm to schedule queries in O(m+k(log ε)) time • Essentially, use previous algorithm to test if a specific load is possible and do a binary search on the possible loads ICDT 2005

  13. Bound(1) • There are k disks (D0,…,Dk-1) • Disk Di has ti tiles initially (as the primary disk) • The number of tiles is m=t0+…+tk-1 • Di shifts di tiles to Di+1 • di≤ ti • The goal is to minimize the most tiles at a disk, i.e., max0≤i≤k-1{di-1+ti-di} ICDT 2005

  14. Bound(2) • Recall, • o=m/k • max0≤i≤k-1{ti} ≤ o+ε • Suppose m≥k(k-1)ε • Then, • o ≥ (k-1)ε • Surplus ( ) is bounded by (k-1)ε • max0≤i≤k-1{di} ≤ (k-1)ε ≤ o • Two cases: • If disk has a surplus • If disk has a shortage ICDT 2005

  15. 32 disks ICDT 2005

  16. 64 disks ICDT 2005

  17. 128 disks ICDT 2005

  18. 32 disks, 3 dimensions ICDT 2005

  19. Generalizations • Permutations • Higher levels of replication • Survivability • If the level of replication is r, can handle any r-1 failures • When r=2, and a single disk fails then: • Fast scheduling still possible • Large queries still optimal ICDT 2005

  20. Summary • Shift schemes are a new class of schemes • Optimal for “large enough” queries • Efficient scheduling algorithm • Resilient to disk failures • Future Work • Better analysis of scheme • Choosing shift values ICDT 2005

More Related