520 likes | 674 Views
Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions. Exploit Hierarchical and Irregular Parallelism in UPC. Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS.
E N D
Exploiting the Potential of Modern Supercomputers Through High Level Language Abstractions Exploit Hierarchical and Irregular Parallelism in UPC Li Chen State Key Laboratory of Computer Architecture Institute of Computing Technology, CAS
Exploit Hierarchical and Irregular Parallelism in UPC-H • Motivations • Why use UPC? • Exploit the tiered network of Dawning 6000 • GASNet support for HPP architecture • Exploit hierarchical data parallelism for regular applications • Shared work list for irregular applications
Deep Memory Hierarchies in Modern Computing Platforms • Traditional multicore processors • Many-core accelerators Harpertown Dunnington • Intra-node parallelism should be well exploited
HPP Interconnect of Dawning 6000 HPP Architecture of Dawning7000 Traditional cluster • Global address space, through HPP controller • Three-tier network • PE: Cache coherence • 2 CPUs • HPP: 4 nodes • IB • Discrete CPU:App CPU and OS CPU • Hypernode: discrete OS, SSI • Discrete interconnection: data int, OS int, global Sync
Mapping Hierarchical Parallelism to Modern Supercomputers • Hybrid programming models, MPI+X • MPI+OpenMP/TBB (+OpenAcc/OpenCL) • MPI+StarPU • MPI+UPC • Sequoia • Explicitly tune the data layout and data transfer (Parallel memory hierarchy) • Recursive task tree, static mapping for tasks • HTA • Data type for hierarchical tiled array (multiple level tiling) • Parallel operators: map parallelism statically • X10 • Combine HTA with Sequoia • Abstraction on memory hierarchies: hierarchical place tree (Habanero-java) • Nested task parallelism, task mapping until launching time
Challenges in Efficient Parallel Graph Processing • Data-driven computations • Parallelism cannot be exploited statically • Computation partitioning is not suitable • Unstructured problems • Unstructured and highly irregular data structure • Data partitioning is not suitable, may lead to load balancing • Poor locality • data access patterns has less locality • High data access to computation ratio • explore the structure, not computation • dominated by the wait for memory fetches borrowed: Andrew Lumsdaine, Douglas Gregor Low level Low level Express Parallelism Express Parallelism tedious memory latency dominated Memory latency dominated Communication dominated
Why Unified Parallel C? • UPC, parallel extension to ISO C99 • A dialect of PGAS languages (Partitioned Global Address Language) • Important UPC features • Global address space: thread may directly read/write remote data • Partitioned: data is designated as local or global, affinity • Two kinds of memory consistency • UPC Performance benefit over MPI • Permits data sharing, better memory utilization • Thinking future many core chips, Exascale system • Better bandwidth and latency using one-sided messages (GASNET) • No less scalable than MPI! (to 128K threads) • Why use UPC? • Grasp the non-uniform memory access feature of modern computers • Programmability very close to shared memory programming
The Status of the UPC Community • Portability and usability • Many different UPC implementations and tools • Berkeley UPC, Cray UPC, HP UPC, GCC-based Intrepid UPC and MTU UPC • Performance tools: GASP interface and Parallel Performance Wizard (PPW) • Debuggability: TotalView • Provide Interoperability with pthreads/MPI(/OpenMP) • UPC is developing in • Hierarchical parallelism, asynchronous execution • Tasking mechanism • Scalable work stealing; hierarchical tasking library • Place, async~finish; Asynchronous Remote Methods • Nested parallelism • Instant team: data centric collective • Irregular benchmarks: UTS, MADNESS,GAP • Interoperability • Support for hybrid programming with OpenMP and other languages • More convenient support for writing libraries
What is UPC-H? • Developed by the compiler group of ICT • H: heterogeneous, hierarchical • Based on Berkeley UPC compiler • Added features by ICT • Support HW features of Dawning series computers • HPP interconnect • Load/store in physical global address space • Hierarchical data distribution and parallelism • Godson-T (many-core processor) • GPU cluster • D6000 computer and X86 clusters • SWL support for graph algorithms • Communication optimizations • Software cache, msg vectorization • Runtime system for heterogeneous platform • Data management
Gasnet Extended API Gasnet Core API Lack a BCL Conduit in the UPC system GASNet: Networking for Global-Address Space Languages lnfiniband inter-Process SHared Memory BCL HPP BCL: low level communication layer of HPP
three-tiered topology Two-tiered topology Implementation of the BCL-conduit • Initialization of the tiered-network • construct the topology of the tiered network • set up reliable datagram service through QP virtualization • initialize internal data structures such as send buffers • Finalization of communication • Network selection in the core API of GASNet • PSHM, HPP, IB • Flow control of messages • Implementation of Active Message • Short Message: NAP • Medium Message: NAP • Long Message: RDMA +NAP • RDMA Put/Get : RMDA+NAP
BCL Conduit: latency of short messages Latency of short message intra-HPP Latency of short message inter-HPP
BCL Conduit, Latency of Barriers Net latency of barrier (intra-HPP) Net latency of barrier (inter- HPP)
Summary and Ongoing work of UPCH targeting Dawning 6000 • Summary • UPCH compiler can now support HPP architecture, benefit from the 3-tier network • Ongoing work • Optimization on DMA registration strategy • Evaluate HPP-supported barrier and collective • Full-length evaluation
Hierarchical Data Parallelism, UPC-H Support for Regular Applications
UPC threads UPC program Implicit thread or thread subgroup UPC thread fork point fork joint upc_forall Implicit subgroups Implicit subgroups Implicit threads Implicit threads Join point at upc_forall Join point at upc_forall UPC-H(UPC-Hierarchical/Heterogeneous)Execution model • Standard UPC is SPMD style and has flat parallelism • UPC-H extension • Mix SPMD with fork-join • Two approach to express hierarchical parallelism • Implicit threads (or gtasks), organized in thread groups implicitly specified by the data distribution • Explicit low-level gtask
shared [32][32], [4][4],[1][1] float A[128][128]; 128 UPC program 16 16 … … … UPC thread Upc-tiles 32 32 64 Subgroup 64 … … 16 4 4 Subgroup-tiles logical implicit threads 16 … … Thread-tiles 1 64 … … Multi-level Data Distribution • Data distribution => an implicit thread tree
3-level data distribution Machine configuration Loop Iterations Implicit thread tree CUDA thread tree UPC-H: Mapping Forall Loop to the Implicit Thread Tree • Leverage an existing language construct, upc_forall • Independent loop • Point-to-shared or integer type affinity expression • shared [32][32], [4][4],[1][1]float A[128][128]; • … … • upc_forall(i=0; i<128; i++; continue) • upc_forall(j=6; j<129; j++; &A[i][j-1]) • ... body... • =>Thread topology: <THREADS,64,16>
UPC-H Codes for nbody shared [1024],[128],[1] point P[4096]; Shared [1024][1024] float tempf[4096][4096]; for(int time=0; time<1000;time++) { upc_forall(int i=0;i<N;i++; &P[i]) for(int j=0;j<N;j++) { if(j!=i) { distance = (float)sqrt((P[i].x-P[j].x)*(P[i].x-P[j].x)+ (P[i].y-P[j].y)*(P[i].y-P[j].y)); if(distance!=0) { magnitude = (G*m[i]*m[j])/(distance*distance+C*C); …… tempf[i][j].x = magnitude*direction.x/distance; tempf[i][j].y = magnitude*direction.y/distance; } } upc_forall(int i=0;… …) … … }
Overview of the Compiling Support • On Berkeley UPC compiler v2.8.0 • Compiler analysis • Multi-dimensional and multi-level data distribution • Affinity-aware multi-level tiling • upc tiling • Subgroup tiling, thread tiling • Memory tiling for scratchpad memory • Communication optimization • Message vectorization, loop peeling, static comm. scheduling • Data layout optimizations for GPU • Shared memory optimization • Find better data layout for memory coalescing • array transpose and structure splitting • Code Generation: CUDA, hier parallelism
shared [32][32], [4][4],[1][1] float A[128][128]; …… upc_forall(i=6; i<128; i++; continue) upc_forall(j=0; j<128; j++; &A[i-1][ j]) ... ... F[i][j]... Step1: iteration space transformation, to make affinity expression consistent with data space upc_forall(i=5; i<127; i++; continue) upc_forall(j=0; j<128; j++; &A[i][j]) ... ... F[i+1][j]... //transformation Step2: three level tiling, actually two level for (iu=0; iu<128; iu=iu+32) for (ju=0; ju<128; ju=ju+32) //upc thread affinity if (has_affinity(MYTHREAD, &A[iu][ju])) { // for exposed region …dsm_read… F[iu+1:min(128, iu+32)] [ju: min(127,ju+31) ] for (ib=iu ; ib<min(128, iu+32); ib=ib+4) for (jb=ju; jb< min(128, ju+32); jb=jb+4) for (i=ib; i<min(128,ib+4); i=i+1) for (j=jb; j<min(128,jb+4); j=j+1) if(i>=5 && i<127) //sink guards here! ... F[i+1][j]... ; }//of upc thread affinity Step 3: spawn fine-grained threads …… Affinity-aware Multi-level Loop Tiling (Eg.)
Memory Optimizations for CUDA • What data will be put into the shared memory? • 0-1 bin packing problem (over shared memory’s capacity) • The profit: reuse degree integrated with coalescing attribute • inter-thread reuse and intra-thread reuse • average reuse degree for merged region • The cost: the volume of the referenced array region • prefer inter-thread reuse • Compute the profit and cost for each reference • What is the optimal data layout in GPU’s global memory? • Coalescing attributes of array reference • only consider contiguous constraints of coalescing • Legality analysis • Cost model and amortization analysis
Overview of the Runtime Support • Multi-dimensional data distribution support • Gtask support on multicore platforms • Workload scheduling, synchronization, topology-aware mapping and binding • DSM system for unified memory management • GPU heap management • Memory consistency, block-based • Inter-UPC message generation and data shuffling • Data shuffling to generate data tiles with halos • Data transformations for GPUs • Dynamic data layout transformations • For global memory coalescing, demand driven • Demand driven data transfer between cpu and GPU
Unified Memory Management • Demand driven data transfer • Only on local data space, no software caching on remote data • Consistency maintenance is on the boundary of CPU code and GPU code • Demand driven data layout transformation • Redundant data transformation removal • An extra field is recorded for the current layout of the data tile copy
UPC-H Performance on GPU Cluster • Use 4-node cuda cluster, 1000M Ethernet. Each node has • CPUs : 2 dual core AMD Opteron 880 • GPU: NVIDIA GeForce 9800 GX2 • Compilers: nvcc (2.2) –O3, GCC (3.4.6) –O3 Performance 72%, on average
UPC-H Performance on Godson-T speedup The average speedup of SPM opt is 2.30,that of double-buffering is 2.55
UPC-H Performance on Multi-core Cluster • Hardware and software • Xeon(R) CPU X7550 *8=64 cores/node, 40Gb infiniband, ibv conduit, mvapich2-1.4 • Benchmarks • NPB: CG, FT, • nbody, MM, cannon MM • Results • NPB performance:UPC-H reach 90% of UPC+OMP • Cannon MM can leverage optimal data sharing and communication coalescing • express complicated hierarchical data parallelism which is hard to express in UPC+OpenMP
Introduction 35 • Graph • flexible abstraction for describing relationships between discrete objects • basis of exploration based applications (genomics, astrophysics, social network analysis, machine-learning) • Graph search algorithms • Important technique for analyzing vertices or edges in it • Breadth-first search (BFS) is widely used and is the basis of many others (CC,SSSP,Best-first-search, A*) • Kernel of Graph500 benchmarks
Challenges in Efficient Parallel Graph Processing • Data-driven computations • Parallelism cannot be exploited statically • Computation partitioning is not suitable • Unstructured problems • Unstructured and highly irregular data structure • Data partitioning is not suitable, may lead to load balancing • Poor locality • data access patterns has less locality • High data access to computation ratio • explore the structure, not computation • dominated by the wait for memory fetches borrowed: Andrew Lumsdaine, Douglas Gregor Global view, high level Low level Low level Express Parallelism Express Parallelism User directed, Auto opt tedious memory latency dominated Memory latency dominated Communication dominated
Tedious Optimizations of Bfs (graph algorithm) Optimize bfs on clusters:
Data-Centric Parallelism Abstraction for Irregular Applications Def: Given a set of active nodes and an ordering on active nodes, amorphous data-parallelism is the parallelism that arises from simultaneously processing active nodes, subject to neighborhood and ordering constraints In Galois system • Amorphous Data Parallelism (Keshav Pingali) • Active elements (activities) • Neighborhood • Ordering • Exploit such parallelism: work list • Keep track of active elements and ordering • Unordered-set Iterator, ordered-set Iterator • Conflicts among concurrent operations • support for speculative execution
Design Principles of SWL 39 • Programmability • global-view programming • High level language abstraction • Flexibility • user control on data locality (constructing/executing) • customize the construction and behavior of work items • lightweight speculative execution • Trigger on by user hints, not purely automatic • Lightweight conflict detecting, lock is too costly
SWL Extension in UPC-H • Hide optimization detail from users: • message coalescing, • queue management, • asynchronous communication, • Speculative execution etc. 1) specify a work list 2) user-defined work constructor 3) two iterators of work list blocking one non-blocking one 4) Two kinds of work item dispatcher 5) user-assisted speculation upc_spec_get() upc_spec_put() 40
In UPCH on clusters: while(1){ int any_set = 0; upc_worklist_foreach(Work_t rcv: list1) { size_t ei = g.rowstarts[VERTEX_LOCAL(rcv)]; size_t ei_end = g.rowstarts[VERTEX_LOCAL(rcv) + 1]; for( ; ei < ei_end; ++ei){ long w = g.column[ei]; if( w == rcv) continue; Msg_t msg; msg.tgt = w; msg.src = rcv; upc_worklist_add(list2, &pred[w], usr_add(msg)); any_set = 1; } //for each row } //foreach bupc_all_reduce_allI(.....); if(final_set[MYTHREAD] == 0) break; upc_worklist_exchage(list1, list2); }//while Level Synchronized BFS in SWL, Code Example In Galois on shared memory machines: Work_t usr_add(Msg_t msg){ Work_t res_work; if(!TEST_VISITED(msg.tgt)){ pred[msg.tgt] = msg.src; SET_VISITED(msg.tgt); res_work = msg.tgt;} else res_work = NULL; return res_work; }
Asynchronous BFS in SWL, Code Example Asynchronous implementation on SM machines (Galois) In UPCH on clusters:
Implementation of SWL • Execution model • SPMD • SPMD+Multithreading • Master/slave • State transition Executing; idle; termination detection; Exit • Work dispatching • AM-based, distributed • Coalescing work items and async transfer • mutual exclusion on SWL and work-item buffers
User-Assisted Speculative Execution User API • upc_spec_get: • get the data Ownership; • data transfer, get the shadow copy • conflict checking and rollback • enter the critical region; • upc_cmt_put • release the data Ownership • commit the computation Compiler: • Identify speculative hints • Upc_spec_get/put • Fine-grained atomic protection • Full/empty bits Runtime system: • two modes: (non-)speculative • rollback of data and computation
(Intel Xeon E5450 @ 3.00GHz * 2) *64 nodes Scale=20,edgefactor=16 Intel Xeon X7550 @ 2.00GHz * 8 SPMD Execution, on Shared Memory Machine and Cluster Scale=20,edgefactor=16 On the shared memory machine, UPC gets very close to OpenMP On the cluster, UPC is better than MPI: 1)Save one copying for each work 2)Frequent polling raise the networkthroughput
pthreads/UPC thread SPMD+MT, on X86 Cluster SWL SYNC BFS Scale=24, EdgeFactor=16
On D6000, Strong Scaling of SWL SYNC BFS ICT Loongson-3A V0.5 FPU V0.1@0.75G, *2 Scale=24, EdgeFactor=16 1)MPI Conduit, large overhead 2)tiered network behaves better when more intra-HPP comm happens
Summary and Future Work on SWL • Summary • Put forward Shared Work List (SWL) to UPC to tackle amorphous data-parallelism • Using SWL, bfs can achieve better performance and scalability than MPI at certain scale and runtime configurations • Realize tedious optimizations with less user effort • Future work • Realize and evaluate the speculative execution support • Delaunay Triangulation Refinement • Add dynamic scheduler to the SWL iterators • Evaluate more graph algorithms
Acknowledgement Shenglin Tang Shixiong Xu Xingjing Lu Zheng Wu Lei Liu Chengpeng Li Zheng Jing