210 likes | 377 Views
Understanding t he SIMD Efficiency o f Graph Traversal o n GPU. Yichao Cheng , Hong An, Zhitao Chen, Feng Li, Zhaohui Wang , Xia Jiang and Yi Peng. University of Science and Technology of China. Breadth - first Search (BFS). Source. A. C. 1. 1. C. A. 2. D. E. F. E. 2. F. 2. D.
E N D
UnderstandingtheSIMDEfficiencyofGraphTraversalonGPU YichaoCheng,HongAn,ZhitaoChen,FengLi,ZhaohuiWang, XiaJiangandYiPeng UniversityofScienceandTechnologyofChina
Breadth-firstSearch(BFS) Source A C 1 1 C A 2 D E F E 2 F 2 D 3 G H I 3 I H 4 G
Breadth-firstSearch(BFS) BFS_Iteration: foru ∈CurrentFrontier for v ∈ u’ s neighbors do if v has not been labeled labelv putvinNextFrontier B C A E F D I H G
Application of BFS • Many datasetsinrealworld are represented by graph • VLSIcircuits • Socialrelationship • Roadconnections • Primitive for buildingcomplexalgorithms • Path-finding • Belief propagation • Points-toAnalysis(PTA)
I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 100% utilization
I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 37.5% utilization
I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 62.5% utilization
TraditionalImplementation The#ofsub-iterationsdependsonthesizeofu’sadjacentlist GPU_BFS_Iteration u = C[tid] for v ∈ u’ s neighbors do end for task1 =4sub-iterations task2=2sub-iterations …
Visualizing the Irregularity Highlyskewed outlierexists distributed betweenawiderage irregularbutconcentrate vertexrange<8
I AlternativeWay • Assign each task withawarpofthreads • Vectorizethe sub-iterations! So, what’s the relationship between graph topology and SIMD efficiency?
TopologyandUtilization • Assign each vertex with a group of threads Warp Group Thread task1=2sub-iterations task2=1sub-iteration
TopologyandUtilization Divide the SIMD underutilization into two parts • InteR-groupUnderutilization (UR) • IntrA-group Underutilization(UA) SIMDWindow
ConclusionsFrom the Model • UR is induced by the heterogeneity of workloads • Affected by the graphtopology • UR issensitive to the group size(S) • LargelogicalSIMDwindowcannarrowthegap • When S = 32, UR=0 • UA is determined by the intrinsic irregularity of vertex degree • It can be limited by shrink the S • When S = 1, UA=0 • UR and UA canconverttoeachother
ComparingDifferent MappingStrategies Scalability good Expansion Rate(ME/s) poor high low
Evaluatingthe SIMDEfficiency • Metricsderivedfromthemodel: UR=inter-groupunderutilization UA=intra-groupunderutilization ME=mappingefficiency UR+UA+ ME =100% • CapturesutilizationtrendwithincreasingS
Explaining the Result Scalability good Expansion Rate(ME/s) poor high low alleviatetheUR, introducingminorUA
Explaining the Result Scalability good MEinahighlevel(~80%) Expansion Rate(ME/s) poor high low
Explaining the Result Scalability good outweighed by the fast-growing UA Expansion Rate(ME/s) poor high low
Explaining the Result do little help to URbut lead to severe UA Scalability good Expansion Rate(ME/s) poor high low
Conclusion • Studythelinkbetweengraphtopo&hardwareutil • PresentamodelforanalyzingthecomponentsofSIMDunderutilization • DiscoverthattheSIMDarewasteddueto: • Develop3metricsforquantifyingSIMDefficiency • Provideafoundationfordevelopingtechniquesofstaticanalysisandruntimeoptimization • imbalanceofvertexdegreedistribution • heterogeneityofeachvertexdegree