220 likes | 229 Views
Tan Hongbing, Liu Sheng † , Chen Haiyan School of National University of Defense Technology. Modeling and Evaluation for Gather/Scatter Operations in Vector-SIMD architectures. Presentation Outline. 1. Introduction 2. Models and Verification 3. Evaluation and Results
E N D
Tan Hongbing, Liu Sheng†, Chen Haiyan School of National University of Defense Technology Modeling and Evaluation for Gather/Scatter Operations in Vector-SIMD architectures
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 2
Gather-Scatter in Vector-SIMD architectures Gathers: vector of addrs vector register Scatters: vector register vector of addrs – Reads and writes to different sub-banks performed in parallel – Multiple reads or writes to same sub-bank address combined into single access – Reads overlapped across different gathers; writes overlapped across different scatters 3
Definition of Gather-Scatter Gather: Scatter: 5
Gather/scatter has the stochastic and complicated properties, the hardware design of gather/scatter operations lacks theoretical analysis and modeling. • what’s the possible distributions of access locations to the given PE and memory bank counts, • what ‘s the probability of each distribution • how to detailedly optimize the hardware implemation The proposed model will give the answers. 6
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 7
Example (1) MCPC=4, {4,0,0,0} (2) MCPC=3, {3,1,0,0} -Both the SIMD width and the number of memory (3) MCPC=2, {2,2,0,0} banks(sing-port) are 4 -The Maximum Conflicts Per Cycle (MCPC) is equal to the (4) MCPC=2, {2,1,1,0} maximum element of access location distributions. The Distribution of Access Location, We call DAL for short (5) MCPC=1, {1,1,1,1} 4 access locations divide into 2 groups and distribute in two different memory banks 8
Relation among Distribution of Access Location (DAL) f(4,1): {1,1,1,1} f(4,2): f(4,1);{2,1,1,0};{2,2,0,0} f(4,3): f(4,2);{3,1,0,0} f(4,4): f(4,3);{4,0,0,0} f(7,7): f(7,1);{2,f(5,2)}; {3,f(4,3)};{4,f(3,3)}; {5,f(2,2)};{6,f(1,1)}; g(7); f(a,b) is a set which include all the DALs whose maximum element less than or equal to b with a PEs 9
is the integer portion of the quotient of A divided by B Modeling the DAL f(a,b) is a set which include all the DALs whose maximum element less than or equal to b with a PEs 10
(1) MCPC=4, {4,0,0,0} Modeling the Probability of Access Conflict(PAC) (2) MCPC=3, {3,1,0,0} (3) MCPC=2, {2,2,0,0} (4) MCPC=2, {2,1,1,0} (5) MCPC=1, {1,1,1,1} All possible permutation of the j-th DAL The probability of the j-th DAL 11
Modeling the PAC The data used in this equation come from D, D[i,j] is the i-th element in j-th row; O[j] is the number of non-zero elements in j-th row; G(i,j) is the sum of the front of i elements in j-th row; M(j) is an intermediate variable for calculation; F(m,j) is the number of elements m in the j-th row. 12
Model verification (By Matlab) Validating the PAC model The average accuracy of our model on the gather/ scatter is over 98% (min: 97.3%, max: 100%) when read/write locations are totally random Validating the DAL model The results show all the measured and estimated results are totally same 13
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 14
Evaluation and Results (1) Organizing memory bank into separate sub-banks (2) Adding buffers to cache memory requests To hardware designers, two common methods can improve gather/scatter performance 15
Evaluation and Results Analysis for MCPC with the PE:Bank varied more than 80% DALs, their MCPC<=4 more than 90% DALs, their MCPC<=3 more than 90% DALs, their MCPC<=2 The performance of gather/scatter is closely related to the ratio of PEs to memory banks 16
Evaluation and Results NAC=1.64 NAC=1.32 NAC=2.12 2.12 2.59 3.05 3.45 3.76 Analysis for selecting the proper number of memory banks Average Number of Access Conflict (NAC) φ(k) stands the DALs whose MCPC is k 17
Evaluation and Results 3.05 1.98 1.34 Buffer array deeper, Run time more less The runtime time reduced as the ratio of PEs to memory banks deceased 18
Evaluation and Results The effect of performance improvement with the depth of buffer array varied Dispersive Very close 19
Presentation Outline 1. Introduction 2. Models and Verification 3. Evaluation and Results 4. Conclusion 20
Conclusion -This model can give all the possible DAL, PAC and so on for gather/scatter operation in various situations. -This model can help users to select the optimum number of memory banks and guide the designers to select the proper number of buffers.(For example, if SIMD=16,each bank consist of 2 sub-banks,and buffer depth set to 4 is recommended) 21
Thank you! 22