200 likes | 488 Views
Scatter-Gather-Merge Algorithm. - Shourie Boddupalli. Data Parallelism. Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment.
E N D
Scatter-Gather-Merge Algorithm -ShourieBoddupalli
Data Parallelism • Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment. • A data-parallel framework is very attractive for large-scale data processing since it enables such an application to easily process a huge amount of data on commodity machines
Data Warehouse • A data warehouse is an online repository for decision support applications that answer business queries in a short time. • Where can data parallelism be used in a Warehouse? Star Schema Star-Join Query
Approaches to process Star-Join • Data Parallel Framework (Ex: Hive , CloudBase) - No need for up-to-date hardware & software - Fault-Tolerance provided by hiding complexity. • But in case of join-query processing computational efficiency in premature state.
Example Query SELECT D_YEAR,S-NATION,P_CATEGORY FROM DATE,CUSTOMER,SUPPLIER,PART,LINORDER WHERE LO_CUSTKEY = C_CUSTKEY AND LO_SUPKEY = S_SUPKEY AND LO_PARTKEY = P_PARTKEY AND C_REGION = ‘AMERICA’ GROUP BY D_YEAR,S_NATION,P_CATEGORY;
Scatter-Gather-Merge • This algorithm(as name indicates) has 3 phases Scatter Gather Merge • Key Manipulation Technique: Basic idea is to join the fact table with n dimension tables within 3 computational phases
Contd. • During the scatter phase 1) If the input is a tuple of FT, the tupleis transformed into two key-value pairs as results 2) If the input is a tuple of the dimension tables, the tuple is transformed into a new key-value pair as a result • Gather Phase aggregates according to key • Merge Phase produces the final results of star-join queries
Algorithm Algorithm 1 (Key manipulation algorithm of Scatter-Gather-Merge) Scatter(r) Input r is a record. 1: if (r is a record of the fact table F) then 2: for each fki do 3: Turn input tuple (fk1, fk2, . . . , fkn, rF ) into key-value pair ((fki , i), (fk1, fk2, fkn, rF )). 4: Store ((fki , i), (fk1, fk2, . . . , fkn, rF )). 5: endfor 6: endif 7: if (r is a record of dimension table Di )then 8: Turn input tuple (pki , rDi) into key-value pair ((pki , i), rDi). 9: Store and Distribute ((pki , i), rDi). 10: endif Gather(k, v) Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki , i), rDi) with all ((fki, i), (fk1, fk2, . . . , fkn, rF )). 2: Make an output ((fk1, fk2, . . . , fkn), (rDi, rF )). 3: Store and Distribute ((fk1, fk2, . . . , fkn), (rDi, rF )). Merge(k, v) Input k is a key (fk1, fk2, . . . , fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2, . . . , fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )). 3: Store ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )). //final output
Notation Used • Di has the primary key PKi that is associated with the foreign key FKi of F where i is the dimension identification number of Di . • Each tuple of Di is (pki , rDi) where pki is the value of the primary key PKi and rDi is a vector that contains other attribute values. • Each tuple of F (fk1, fk2, . . . , fkn, rF ) where fki is the value of the foreign key FKi and rF is a vector that contains other attribute values. The vector (fk1, fk2, . . . , fkn) is unique in the fact table or rF contains the primary key
IO Reduction Technique • In case of key manipulation technique there are n intermediate results to generate a final query which needs to be reduced. • To reduce the number of intermediate results Bloom filters were introduced.
Algorithm for IO Reduction Algorithm 2 (Scatter-Gather-Merge algorithm) Filter-Construction(r) Input r is a record. BFi is a bloom filter of Di . 1: if (r is a record of dimension table Di 2: and r is satisfied with CDi ) then 3: Store and Distribute r. 4: Add pki to BFi . 5: endif Scatter(r) Input r is a record. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BFi return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2, . . . , fkn, rF ) into key value pair ((fki , i), (fk1, fk2, . . . , fkn, rF )). 8: Store and Distribute ((fki , i) , (fk1, fk2, . . . , fkn, rF )). 9: endfor 10: endif 11: if (r is a record of dimension table Di ) then 12: Turn input tuple (pki , rDi) into key-value pair ((pki , i), rDi). 13: Store and Distribute ((pki , i), rDi). 14: endif
Map-Reduce based Scatter-Gather-Merge Algorithm • Three Phases - Construction - Scatter & Gather - Merge
Map-Reduce based Scatter-Gather-Merge Algorithm < The Filter-Construction Phase >Map(k, v) Input k is a key. ν is a record of each participating dimension table that the star-join query has restrictions on. 1: if (v is a record of dimension table Di 2: and v is satisfied with CDi ) then 3: Turn input tuple (pki , rDi) into key value pair ((pki , i), rDi). 4: Emit ((pki , i), rDi). 5: endif Reduce(k, v) Input (k, ν) is a filtered record of each dimension table. BF(i,j ) is a bloom filter of Di for the j th Reduce process. 1: Emit ((pki , i), rDi). 2: Add pki to BF(i,j ). < The Scatter-and-Gather Phase > Map(k, v) // scatter function Input k is a key. ν is a record of the fact table and every participating dimension table. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BF(i,j ) return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2, . . . , fkn, rF ) into key-value pair ((fki , i) , (fk1, fk2, . . . , fkn, rF )). 8: Emit ((fki , i) , (fk1, fk2, . . . , fkn, rF )).
Contd. 9: endfor 10: endif 11: if (v is a record of dimension table Di ) then 12: if (There are restrictions on Di ) then 13: Emit ((pki , i), rDi). 14: else 15: Turn input tuple (pki , rDi) into key-value pair ((pki , i), rDi). 16: Emit ((pki , i), rDi). 17: endif 18: endif Reduce(k, v) // gather function Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki , i), rDi) with all ((fki , i), (fk1, fk2, . . . , fkn, rF )) where pki= fki . 2: Make an output ((fk1, fk2, . . . , fkn), (rDi, rF )). 3: Emit ((fk1, fk2, . . . , fkn), (rDi, rF )). < The Merge Phase > Map(k, v) Input k is a key (fk1, fk2, . . . , fkn) and ν is a value (rDi, rF ). 1: Emit ((fk1, fk2, . . . , fkn), (rDi, rF )). Reduce(k, v) Input k is a key (fk1, fk2, . . . , fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2, . . . , fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )). 3: Emit ((fk1, fk2, . . . , fkn), (rD1, rD2, . . . , rDn, rF )).
Experimental Results • From the experiments conducted it is observed that the query performance was better when Scatter-Gather-Merge algorithm with Bloom filters fared well compared to case without Bloom filters • Even in cases where the warehouse size has increased the same results were obtained.