Understanding t he SIMD Efficiency o f Graph Traversal o n GPU

UnderstandingtheSIMDEfficiencyofGraphTraversalonGPU YichaoCheng,HongAn,ZhitaoChen,FengLi,ZhaohuiWang, XiaJiangandYiPeng UniversityofScienceandTechnologyofChina

Breadth-firstSearch(BFS) Source A C 1 1 C A 2 D E F E 2 F 2 D 3 G H I 3 I H 4 G

Breadth-firstSearch(BFS) BFS_Iteration: foru ∈CurrentFrontier for v ∈ u’ s neighbors do if v has not been labeled labelv putvinNextFrontier B C A E F D I H G

Application of BFS • Many datasetsinrealworld are represented by graph • VLSIcircuits • Socialrelationship • Roadconnections • Primitive for buildingcomplexalgorithms • Path-finding • Belief propagation • Points-toAnalysis(PTA)

I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 100% utilization

I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 37.5% utilization

I TheProblem • GPU relies on highSIMD lanes occupancy to boost performance • 100% efficiency isachievedonly ifall SIMD lanes fall in the same path Do_something_common(); If(thread_id>5){ do_something_red(); } else{ dosomething_blue(); } 62.5% utilization

TraditionalImplementation The#ofsub-iterationsdependsonthesizeofu’sadjacentlist GPU_BFS_Iteration u = C[tid] for v ∈ u’ s neighbors do end for task1 =4sub-iterations task2=2sub-iterations …

Visualizing the Irregularity Highlyskewed outlierexists distributed betweenawiderage irregularbutconcentrate vertexrange<8

I AlternativeWay • Assign each task withawarpofthreads • Vectorizethe sub-iterations! So, what’s the relationship between graph topology and SIMD efficiency?

TopologyandUtilization • Assign each vertex with a group of threads Warp Group Thread task1=2sub-iterations task2=1sub-iteration

TopologyandUtilization Divide the SIMD underutilization into two parts • InteR-groupUnderutilization (UR) • IntrA-group Underutilization(UA) SIMDWindow

ConclusionsFrom the Model • UR is induced by the heterogeneity of workloads • Affected by the graphtopology • UR issensitive to the group size(S) • LargelogicalSIMDwindowcannarrowthegap • When S = 32, UR=0 • UA is determined by the intrinsic irregularity of vertex degree • It can be limited by shrink the S • When S = 1, UA=0 • UR and UA canconverttoeachother

ComparingDifferent MappingStrategies Scalability good Expansion Rate(ME/s) poor high low

Evaluatingthe SIMDEfficiency • Metricsderivedfromthemodel: UR=inter-groupunderutilization UA=intra-groupunderutilization ME=mappingefficiency UR+UA+ ME =100% • CapturesutilizationtrendwithincreasingS

Explaining the Result Scalability good Expansion Rate(ME/s) poor high low alleviatetheUR， introducingminorUA

Explaining the Result Scalability good MEinahighlevel(~80%) Expansion Rate(ME/s) poor high low

Explaining the Result Scalability good outweighed by the fast-growing UA Expansion Rate(ME/s) poor high low

Explaining the Result do little help to URbut lead to severe UA Scalability good Expansion Rate(ME/s) poor high low

Conclusion • Studythelinkbetweengraphtopo&hardwareutil • PresentamodelforanalyzingthecomponentsofSIMDunderutilization • DiscoverthattheSIMDarewasteddueto: • Develop3metricsforquantifyingSIMDefficiency • Provideafoundationfordevelopingtechniquesofstaticanalysisandruntimeoptimization • imbalanceofvertexdegreedistribution • heterogeneityofeachvertexdegree

Q&A

Understanding t he SIMD Efficiency o f Graph Traversal o n GPU