1 / 27

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries. Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz. Motivation: Data-Driven Science. Oil Reservoir Management. Magnetic Resonance Imaging.

alice
Download Presentation

Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

  2. Motivation: Data-Driven Science Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Large Spatio-temporal datasets Several attributes at each point ICPP2006

  3. Replication of Scientific Datasets • A variety of queries on the same dataset • Each requires different spatial-temporal region and subset of attributes • No chunking and indexing strategy can optimize for all • Replication: Create multiple copies • Use different chunking and indexing schemes • Large storage overhead ICPP2006

  4. Partial Replication • Can we get benefits of replication without the large overheads ? • Not all attributes accessed uniformly • Not all spatio-temporal regions accessed with uniform probability • Partial Replication: Each replica has • Only a subset of attributes (attribute partitioned) and/or • Only a rectilinear spatio-temporal region (space partitioned) • Challenge: • No single partial replica may be able to answer the query • Can we choose and combine partial replicas to optimize query processing ? ICPP2006

  5. Prior Work (CCGRID 05) • Query planning with partial replicas • Cost models • Greedy selection algorithm • Only considered space partitioned replicas • Consider SELECT SQL queries • Implemented as an extension to Automatic Data Virtualization System (HPDC 04) ICPP2006

  6. Contributions • Support combined use of space and attribute partitioned partial replicas • Dynamic programming algorithm for selecting the best set of attribute partitioned replicas • New greedy strategy for recommending a combination of replicas • Extend replica selection algorithm to address queries with aggregations -- replicas may be unevenly stored across storage units ICPP2006

  7. System Overview The Replica Selection Module is coupled tightly with our prior work of supporting SQL Select queries on scientific datasets in a cluster environment. ICPP2006

  8. Outline • Introduction • Motivation • Contributions • System overview • Query execution and algorithm design • Uniformly partitioned chunks and select queries • Uneven partitioning and aggregation operation • Experimental results • Related work • Conclusions ICPP2006

  9. Uniformly Partitioned Chunks and Select Queries • Computing Goodness Value • goodness = useful dataper-chunk / costper-chunk • Chunk: an atomic unit in space partitioned replicas or a logic unit in attribute partitioned replicas • Full chunks and partial chunks of a partial replica • Cost per-chunk = tread * nread + tseek • tread : average read time for a disk page • nread : number of pages fetched • tseek : average seek time • Fragment • intermediate unit between a replica and its chunks • a group of full or partial chunks having same goodness value in a replica • goodnessper-fragmen = useful dataper-fragment / costper-fragment ICPP2006

  10. Replica 1 3 full chunks and 2 partial chunks 3 fragments Composite Replica 2 10 full chunks 1 fragment An Example – Query and Intersecting Replicas ICPP2006

  11. General Structure of Replica Selection Algorithm ICPP2006

  12. Dynamic Programming Algorithm R: a group of attribute-partitioned replicas R’: the optimal combination output l: the number of referred attributes in Q M1..l: the referred attribute list Output Input R’ R r1 contains only Mu..v r2 contains Mu..v Calculate the Costj,j Foreach k from 2 to l Foreach u from 1 to l-k+1 Yes No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r1 Yes No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r2 Costu..v=∞ Find the qmin=Costu..p+Costp+1..v Costu..v=q, Locu..v->s=p, Locu..v->r=-1 Output(loc1..l) ICPP2006

  13. Greedy Strategy Q : an issued query R : the partial replicas D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value Input Q, R, D Remove Fmax from F Overlap with Fmax exists in F? Calculate the fragment set F Yes F is null? No Append Fmax Into S No Yes Subtract the overlap Re-compute the goodness value Add D if needed Output S ICPP2006

  14. Uneven Partitioning and Aggregation Operations • Computing Goodness Value • Goodness(F) = ΣpᄐP data(F) /maxpᄐP (costp(CurLoad)+costp(F)) • P : all available storage nodes • CurLoad : current workload across all storage nodes due to previously chosen candidate replicas • Cost fragment = tread*nread+tseek* nseek+tfilter*nfilter+tagg*nagg+ttrans*ntrans • tfilter : average filtering time for a tuple • nfilter : number of total tuples in all chunks • taggr : average aggregate computation time for a tuple • naggr : number of total useful tuples • ttrans : network transfer time for one unit of data • ntrans : the amount of data after aggregate operation ICPP2006

  15. Workload aware greedy strategy Q : an issued query F : the interesting fragment sets D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value Input Q, F, D Remove Fmax from F Overlap with Fmax exists in F? Foreach Fi in F Yes Overlap with F-{Fi} exists? No Append Fi into S Yes F is NULL? No Calculate the current goodness value for Fi in F Append Fmax Into S No Yes Subtract the overlap Add D if needed Output S ICPP2006

  16. Outline • Introduction • Motivation • Contributions • System overview • Query execution and algorithm design • Uniformly partitioned chunks and select queries • Uneven partitioning and aggregation operation • Experimental results • Related work • Conclusions ICPP2006

  17. Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. • Performance evaluation of the combination of space-partitioned and attribute-partitioned replicas, and the benefit of attribute-partitioned replicas; • Scalability test when increasing the number of nodes hosting dataset; • Performance test when data query sizes are varied; • Performance evaluation for aggregate queries with unevenly partitioned replicas. ICPP2006

  18. ICPP2006

  19. SELECT attrlist from IPARS where RID in [0,1] and TIME in [1000,1399] and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; ICPP2006

  20. attr+space part : • the combined use of all replicas • space part : • only use the space-partitioned replicas • A run-time optimization ICPP2006

  21. #Query • SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 • and Y>=0 and Y<=31 and Z>=0 and Z<=31; • Upto 4 nodes, query execution time scales linearly. • Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. ICPP2006

  22. # Query • SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 • and Y>=0 and Y<=28 and Z>=0 and Z<=28; • Our algorithm has chosen {1,3,4,6} out of all replicas in Table #1. • The query filters 83% of the retrieved data when using the original dataset only; however, • it need to filter about 50% of the retrieved data in the presence of replicas. ICPP2006

  23. Aggregate Queries with Unevenly Partitioned Replicas ICPP2006

  24. Aggregate Queries with Unevenly Partitioned Replicas ICPP2006

  25. Alg – solution by the proposed algorithm Alg+Ref – solution after the refinement step Solution-1 & 2 – two manually created solutions ICPP2006

  26. Related Work • Replication research • Exact copies of portions of data • Data availability and reliability • Multi-disk system with replicated data • Data caching techniques • Using aggregate memory and cooperative caches • Management and replacement of replicas • Our previous work on performance optimization using space partitioned replicas ICPP2006

  27. Conclusions • The proposed cost models are capable of estimating execution time trends. • The designed greedy strategy together with dynamic programming algorithm can choose a good set of candidate replicas that decrease the query execution time. • Our implementations show good scalability. • When data transfer bandwidth is the limiting factor, using a combination of space and attribute partitioned replicas should be preferred. ICPP2006

More Related