520 likes | 670 Views
www.uintah.utah.edu. Uintah Runtime System Tutorial. Outline of Runtime System How Uintah achieves parallelism Uintah Memory management Adaptive Mesh refinement and Load Balancing Executing tasks – Task Graph and Scheduler Heterogeneous architectures GPUs & Intel MIC.
E N D
www.uintah.utah.edu Uintah Runtime System Tutorial • Outline of Runtime System • How Uintah achieves parallelism • Uintah Memory management • Adaptive Mesh refinement and Load Balancing • Executing tasks – Task Graph and Scheduler • Heterogeneous architectures GPUs & Intel MIC Alan Humphrey, Justin Luitjens,* Dav de St. Germain and Martin Berzins *NVIDIA here as an ex-Uintah Researcher/Developer DOE ASCI (97-10), NSF , DOE NETL+NNSA ARL NSF , INCITE, XSEDE
Dav de St. Germains slides go here • Outline of runtime system • Uintah parallelism • DataWarehouse, TaskGraph, etc
Task-Graph Compilation– Detailed Task Dependency Generation src/CCA/Components/Schedulers/SchedulerCommon.cc src/CCA/Components/Schedulers/TaskGraph.cc void SchedulerCommon::compile() { . . . . . . . DetailedTasks* dts = graphs[i]->createDetailedTasks(useInternalDeps(), grid, oldGrid); . . . . . . . } DetailedTasks* TaskGraph::createDetailedTasks(…, …, …) { lb->createNeighborhood(grid, oldGrid); // lb = loadbalancer const set<int> neighborhood_procs = lb->getNeighborhoodProcessors(); dts_ = scinewDetailedTasks(sc, d_myworld, this, neighborhood_procs, useInternalDeps ); constPatchSet* ps = task->getPatchSet(); constMaterialSet* ms = task->getMaterialSet(); //only schedule output task for the subset involving our rank constPatchSubset* pss = ps->getSubset(subset); for(int p=0; p<ps->size(); p++) { constPatchSubset* pss = ps->getSubset(p); // don't create tasks that aren’t in our neighborhood or that do not have patches if(lb->inNeighborhood(pss) && pss->size() > 0) { for(intm=0; m<ms->size(); m++) { constMaterialSubset* mss = ms->getSubset(m); createDetailedTask(task, pss, mss); } } } . . . . . . . lb->assignResources(*dts_); dts_->assignMessageTags(d_myworld->myrank()); dts_->createScrubCounts(); . . . . . . . return dts_; }
Scalable AMR Regridding Algorithm Berger-Rigoutsos Recursively split patches based on histograms Histogram creation requires all-gathers O(Processors) Complex and does not parallelize well Irregular patch sets Two versions – version 2 uses less cores in forming histogram
AMR regridding algorithm (Berger-Rigoutsos) Number of points in each column Box is split into 2 3 3 3 1 1 3 3 2 Tagged points Initial box Process is repeated on the two new boxes (1) tag cells where refinement is needed (2) create a box to enclose tagged cells (3) split the box along its long direction based on a histogram of tagged cells This step is repeated at least log(B) times where B is the number of patches (4) fit new boxes to each split box and repeat the steps as needed.
Berger Rigoutsos Communications costs • Every processor has to pass its part of the histogram to every other so that the partitioning decision can be made • The is done using the MPI_ALLGATHER function - cost of an allgather on P (cores) processors is Log(P) Version 1: (2B-1)log(P) messages Version 2: Only certain cores take part [Gunney et al.] 2P –log(P)-2 messages Alternative: use Berger Rigoutsos locally on each processor
Scalable AMR Regridding Algorithm Tiled Algorithm Tiles that contain flags become patches Simple and easy to parallelize Semi-regular patch sets that canbe exploited Example: Neighbor finding Each core processes subset of refinement flags in parallel-helps produce global patch set [Analysis to appear in Concurrency]
EXAMPLE – MESH REFINEMENT AROUND A CIRCULAR FRONT Global Berger- Rigoutsos Local Berger- Rigoutsos Tiled Algorithm More patches generated by tiled but noneed to split them to load balance
Theoretical Models of Refinement Algorithms C = number of mesh cells in domain F = number of refinement flags in domain B = number of mesh patches in the domain Bc = number of coarse mesh patches in the domain P = number of processing cores M = number of messages T: GBRv1 = c1 (F/P) log(B) + c2 M(GBRv1) T: LBR = c4 (F/P) log(B/Bc) T: Tiled = c5 C/P T: Split = c6 B/P + c7 log (P) - is the time to split large patches
Strong Scaling of AMR Algorithms GBR is Global Berger-Rigoutsos – problem size fixed as number of cores increases should lead to decreasing execution time. No strong scaling Strong Scaling T: GBRv1 = c1 (Flags/Proc) log(npatches ) + c2 M(GBRv1) T: Tiled = c5 Nmeshcells / Proc T: LBR = GBR on a compute node Dots are data and lines are models
Performance Improvements Improve Load Balance Cost Estimation Algorithms based on data assimilation Use load balancing algorithms based on patches and a fast space filling curve algorithm 1,2 Hybrid model Using message passing between nodes and async threaded execution of tasks on cores 3 J. Luitjens, M. Berzins, T.C. Henderson. “Parallel Space Filling Curve Generation Through Sorting,” In Journal of Concurrency and Computation, Vol. 19, No. 10, pp. 1387--1402. 2007. J. Luitjens, Q. Meng, M. Berzins, T. Henderson. “Improving the Load Balance of Parallel Adaptive Mesh Refined Simulations,” SCI Technical Report, No. UUSCI-2008-007,University of Utah School of Computing, 2008. Q. Meng, M. Berzins, J. Schmidt. “Using Hybrid Parallelism to improve memory use in Uintah,” In Proceedings of the Teragrid 2011 Conference, Salt Lake City, Utah, ACM, pp. (accepted). July, 2011.
Load Balancing Weight Estimation Algorithmic Cost Models based on discretization method Vary according to simulation+machine Requires accurate information from the user Time Series Analysis Used to forecast time for execution on each patch Automatically adjusts according to simulation and architecture with no user interaction Need to assign the same workload to each processor. Er,t: Estimated Time Or,t: Observed Time α: Decay Rate Er,t+1 =αOr,t + (1 -α) Er,t =α (Or,t-Er,t) + Er,t Error in last prediction Simple Exponential Smoothing:
Comparison between Forecast Cost Model FCM & Algorithmic Cost Model Particles + Fluid code FULL SIMULATION
Uintah Task Scheduler (De-centralized) • One MPI Process per Multicore node (MPI + Pthreads) • All threads access task queues and process their own MPI • Tasks for one patch may run on different cores • Shared data warehouse and task queues per multicore node • Lock-free data warehouse enables fast mem access for all cores
Uintah Runtime System Data Management Network 3 2 Data Warehouse (one per-node) Thread 1 Execution Layer (runs on multiple cores) MPI_IRecv Internal Task Queue Task Graph recv Select Task & Post MPI Receives Comm Records Check Records & Find Ready Tasks valid MPI_Test External Task Queue get Select & Execute Task put MPI_ISend Post Task MPI Sends Shared Objects (one per node) send
New Hybrid Model Memory Savings: Ghost Cells MPI: Thread/MPI: Raw Data: 49152 doubles 31360 doubles MPI buffer: 28416 doubles 10624 doubles Total: 75K doubles40Kdoubles Local Patch Ghost Cells (example on Kraken, 12 cores/node, 98K core 11% of memory needed
Uintah Threaded Scheduler – Task Selection (per thread) src/CCA/Components/Schedulers/UnifiedScheduler.cc // Each thread executes this method (t_id), checking task queues for work void UnifiedScheduler::runTasks(intt_id) { while (numTasksDone < ntasks) { while (!havework) { /* Run a CPU task that has its MPI communication complete. These tasks * go into the external ready queue automatically when their receive count * hits 0 in DependencyBatch::received, called when a MPI msg is delivered. * * NOTE: This is where the execution cycle begins for each GPU-enabled Task. */ if (dts->numExternalReadyTasks() > 0) { readyTask = dts->getNextExternalReadyTask(); if (readyTask != NULL) { havework = true; numTasksDone++; phaseTasksDone[readyTask->getTask()->d_phase]++; break; ......... // Other types of work to check for, task->checkExternalDepCount() } // process MPI recvs… pendingMPIRecvs(), etc
Uintah Threaded Scheduler – Task Execution (per thread) src/CCA/Components/Schedulers/UnifiedScheduler.cc /* If we have an internally-ready CPU task, initiate its MPI receives, * preparing it for CPU external ready queue. The task is moved to the CPU * external-ready queue in the call to task->checkExternalDepCount(). */ else if (dts->numInternalReadyTasks() > 0) { initTask= dts->getNextInternalReadyTask(); if (initTask != NULL) { if (initTask->getTask()->getType() == Task::Reduction || initTask-> getTask()->usesMPI()) { // delay reduction until end of TS phaseSyncTask[initTask->getTask()->d_phase] = initTask; ASSERT(initTask->getRequires().size() == 0) initTask= NULL; } else if (initTask->getRequires().size() == 0) { // no ext. dependencies, then skip MPI sends initTask->markInitiated(); // where tasks get added to external ready queue initTask->checkExternalDepCount(); initTask= NULL; } else { havework = true; break; } } } // otherwise process MPI recvs - call pendingMPIRecvs();
Uintah Threaded SchedulerOverhead and Challenges • Multiple MPI communicators • Each thread has one communicator • Allow multiple global tasks to run simultaneously • No tags for MPI_Reduce, MPI_Allgather(Even in MPI3) • Third party libs need communicator : PETSc, hypre • Modify existing task and infrastructure code • Thread-safe MPI libraries: MPI_THREAD_MULTIPLE • Thread-safe tasks: “Most” Uintah infrastructure tasks are now thread-safe • Tasks from Component (user) code must also be thread-safe • Overhead for acquiring locks becomes significant with high thread counts • Need to move more sections of infrastructure to Mutex-free, concurrent-access data structures. • Multi-threaded coding complexity and often painful debugging • Many bugs have only manifested at high core counts (e.g. 16-64K cores) with many threads per node (ALCF Mira – 64 threads/node)
Uintah Threaded Scheduler – Signaling Worker Threads src/CCA/Components/Schedulers/UnifiedScheduler.cc void UnifiedScheduler::execute(inttgnum /*=0*/, intiteration /*=0*/) { // signal worker threads to begin executing tasks if (!d_isInitTimestep && !d_isRestartInitTimestep) { for (inti = 0; i < numThreads_; i++) { t_worker[i]->resetWaittime(Time::currentSeconds()); // reset wait time // sending signal to threads to wake them up t_worker[i]->d_runmutex.lock(); t_worker[i]->d_idle = false; t_worker[i]->d_runsignal.conditionSignal(); t_worker[i]->d_runmutex.unlock(); } } // main thread also executes tasks (tid 0) runTasks(0); . . . . . . . } void UnifiedSchedulerWorker::run() { Thread::self()->set_affinity(d_id + 1); while (true) { //wait for main thread signal d_runsignal.wait(d_runmutex); d_runmutex.unlock(); d_scheduler->runTasks(d_id + 1); //signal main thread for next group of tasks . . . . . . . }
Lock-Free Data Structures Global scalability depends on the details of nodal run-time system. Change from Jaguar to Titan – more faster cores and faster communications broke our Runtime System which worked fine with locks previously • Using atomic instruction set • Variable reference counting • fetch_and_add • fetch_and_sub • compare_and_swap • both read and write simultaneously • Data Warehouse • Redesigned variable container • Update: compare_and_swap • Reduce: test_and_set
Uintah DWDatabase – Using Atomics src/CCA/Components/Schedulers/DWDatabase.h template<class DomainType> intDWDatabase<DomainType>::decrementScrubCount(constVarLabel* label, intmatlIndex, constDomainType* dom) { … intrt = __sync_sub_and_fetch(&(scrubs[idx]),1); … } template<class DomainType> Void DWDatabase<DomainType>::putVar(constVarLabel* label, intmatlIndex, constDomainType* dom, Variable* var, boolinit) { newdi->var = var; … DataItem* olddi = __sync_lock_test_and_set(&vars[idx], 0); … newdi = __sync_lock_test_and_set(&vars[idx], olddi); }
Scalability Improvement De-centralized MPI/Thread Hybrid Scheduler (with Lock-free Data Warehouse) And new Threaded runtime system Original Dynamic MPI-only Scheduler Out of memory on 98K cores Severe load imbalance • Achieved much better CPU Scalability
Schedule Global Sync Task R1 R2 • Synchronization task • Update global variable • e.g. MPI_Allreduce • Call third party library • e.g. PETSc, hypre • Out-of-order issues • Deadlock • Load imbalance • Task phases • One global task per phase • Global task runs last • In phase out-of-order R2 R1 Deadlock Load imbalance
Static Task Graph Execution • 1) Static: Predetermined order • Tasks are synchronized • Higher waiting times Task Dependency Execution Order
Dynamic Task Graph Execution • 1) Static: Predetermined order • Tasks are synchronized • Higher waiting times Task Dependency Execution Order • 2) Dynamic: Execute when ready • Tasks are asynchronous • Lower waiting times
Multi-threaded Task Graph Execution • 1) Static: Predetermined order • Tasks are Synchronized • Higher waiting times Task Dependency Execution Order • 2) Dynamic: Execute when ready • Tasks are asynchronous • Lower waiting times • 3) DynamicMulti-threaded • Task-level parallelism • Memory savings • Decrease comm & load imbalance • Multicorefriendly • Supports GPU tasks
UintahRuntime System GPUS
NVIDIA Fermi Outline Host memory to Device memory is max 8GB/second, BUT … Device memory to cores is 144GB/second with less than half of this being achieved on memory bound applications ( Blas level 1,2) 40GFlops
Hiding PCIe Latency • Nvidia CUDA Asynchronous API • Asynchronous functions provide: • Memcopies asynchronous with CPU • Concurrently execute memcopy with kernel(s) • Stream - sequence of operations that execute in order on GPU • Operations from different streams can be interleaved Page-locked Memory Normal Data Transfer Data Transfer Kernel Execution Kernel Execution
GPU Task and Data Management Pin this memory with cudaHostRegister() Framework Manages Data Movement Host Device Call-back executed here (kernel run) existing host memory • Use CUDA Asynchronous API • Automatically generate CUDA streams for task dependencies • Concurrently execute kernels and memory copies • Preload data before task kernel executes • Multi-GPU support cudaMemcpyAsync(H2D) hostRequires devRequires Page locked buffer computation Result back on host hostComputes devComputes cudaMemcpyAsync(D2H) Free pinned host memory Automatic D2H copy here
Multistage Task Queue Architecture • Overlap computation with PCIe transfers and MPI communication • Uintah can “pre-load” GPU task data • Scheduler queries task-graph for a task’s data requirements • Migrate data dependencies to GPU and backfill until ready
Uintah Threaded Scheduler – GPU Task Selection (per thread) src/CCA/Components/Schedulers/UnifiedScheduler.cc void UnifiedScheduler::runTasks(intt_id) #ifdefHAVE_CUDA /* If it's a GPU-enabled task, assign it to a device, initiate its * H2D computes and requires data copies. This is where the execution cycle * begins for each GPU-enabled Task. */ if (readyTask->getTask()->usesDevice()) { readyTask->assignDevice(currentDevice_); currentDevice_++; currentDevice_ %= numDevices_; gpuInitReady = true; } /* Check if highest priority GPU task's asynchronous H2D copies are * completed. If so, then reclaim the streams used for these operations, * execute the task and put it into the GPU completion-pending queue. */ else if (dts->numInitiallyReadyDeviceTasks() > 0) { readyTask= dts->peekNextInitiallyReadyDeviceTask(); if (readyTask->checkCUDAStreamComplete()) { // cudaStreamQuery() // All of this GPU task's H2D copies are complete, prepare to run. readyTask = dts->getNextInitiallyReadyDeviceTask(); gpuRunReady = true; havework = true; break; } } #endif
Uintah Threaded Scheduler – GPU Task Selection (per thread) src/CCA/Components/Schedulers/UnifiedScheduler.cc void UnifiedScheduler::selectTasks(intt_id) ……… #ifdefHAVE_CUDA else if (gpuInitReady) { // initiate all asynchronous H2D memory copies for this task's requires readyTask->setCUDAStream(getCudaStream(readyTask->getDeviceNum())); postH2DCopies(readyTask); preallocateDeviceMemory(readyTask); for (inti = 0; i < (int)dws.size(); i++) { dws[i]->getGPUDW(readyTask->getDeviceNum())->syncto_device(); } dts->addInitiallyReadyDeviceTask(readyTask); } else if (gpuRunReady) { // otherwise run the task runTask(readyTask, currentIteration, t_id, Task::GPU); postD2HCopies(readyTask); dts->addCompletionPendingDeviceTask(readyTask); } else if (gpuPending) { // recycle this task's D2H copies streams reclaimCudaStreams(readyTask); } #endif
GPU DataWarehouse Async H2D Copy • Automatic, on-demand variable movement to-and-from device • Implemented interfaces for both CPU/GPU Tasks Host Device CPU Task GPU Task MPI Buffer CPU Task dw.get() dw.put() MPI Buffer Hash map Flat array Async D2H Copy
Uintah Threaded Scheduler – GPU DataWarehouse API src/CCA/Components/Models/Radiation/RMCRT/RayGPUKernel.cu // Host-side interface GPUDataWarehouse* old_gdw = old_dw->getGPUDW()->getdevice_ptr(); // Device-side interface __global__ void rayTraceKernel(patchParams patch, curandState* randNumStates, GPUDataWarehouse* old_gdw, GPUDataWarehouse* new_gdw) { // calculate the thread indices inttidX = threadIdx.x + blockIdx.x * blockDim.x + patch.loEC.x; inttidY = threadIdx.y + blockIdx.y * blockDim.y + patch.loEC.y; ......... GPUGridVariable<double> divQ; new_gdw->get( divQ, "divQ", patch.ID, matl); for (int z = patch.lo.z; z < patch.hi.z; z++) { // loop through z slices divQ[c] = tidX+ tidY+ z; } divQ[origin] = 4.0 * M_PI * abskg[origin] * ( sigmaT4OverPi[origin] – (sumI/nDivQRays) ); } __host__ __device__ T& operator[](const int3& idx) { // get data - global index return d_data[idx.x - d_offset.x + d_size.x * (idx.y - d_offset.y + (idx.z - d_offset.z) * d_size.y)]; } // [] overload from: src/CCA/Components/Schedulers/GPUDataWarehouse.cu
Performance and Scaling Comparisons • Uintah strong scaling results when using: • Multi-threaded MPI • Multi-threaded MPI w/ GPU • GPU-enabled RMCRT • All-to-all communication • 100 rays per cell 1283cells • Need port Multi-level CPU tasks to GPU Cray XK-7, DOE Titan Thread/MPI: N=16 AMD Opteron CPU cores Thread/MPI+GPU: N=16 AMD Opteron CPU cores + 1 NVIDIA K20 GPU
UintahRuntime System Xeon Phi (Intel MIC)
Uintah on Stampede: Host-only Model Incompressible turbulent flow • Intel MPI issues beyond 2048 cores (seg faults) • MVAPICH2 required for larger core counts • Using Hypre with a conjugate gradient solver • Preconditioned with geometric multi-grid • Red Black Gauss Seidel relaxation - each patch
Uintah on Stampede: Native Model • Compile with –mmic • cross compiling • Need to build all Uintah required 3p libraries * • libxml2 • zlib • Run Uintah natively on Xeon Phi within 1 day • Single Xeon Phi Card * During early access to TACC Stampede
Uintah on Stampede: Offload Model • Use compiler directives (#pragma) • Offload target: #pragma offload target(mic:0) • OpenMP: #pragma ompparallel • Find copy in/out variables from task graph • Functions called in MIC must be defined with __attribute__((target(mic))) • Hard for Uintah to use offload mode • Rewrite highly templated C++ methods with simple C/C++ so they can be called on the Xeon Phi • Less effort than GPU port, but still significant work for complex code such as Uintah
Uintah on Stampede: Symmetric Model • Xeon Phi directly calls MPI • Use Pthreads on both host CPU and Xeon Phi: • 1 MPI process on host – 16 threads • 1 MPI process on MIC – up to 120 threads • At the time, only Intel MPI supported • mpiexec.hydra-n 8 ./sus – nthreads16 : -n 8./sus.mic –nthreads 120 • Challenges: Different floating point accuracy on host and co-processor • Result Consistency • MPI message mismatch issue: Control related FP operations
Scaling Results on Xeon Phi • Multi MIC Cards (Symmetric Model) • Xeon Phi card: 60 threads per MPI process, 2 MPI processes • Host CPU :16 threads per MPI process, 1 MPI processes • Issue: load imbalance - profiling differently on host and Xeon Phi Multiple MIC cards
p=0.421874999999999944488848768742172978818416595458984375 c=0.0026041666666666665221063770019327421323396265506744384765625 b=p/c Example: Symmetric Model FP Issue Host-only Model Symmetric Model b=162 b=162 161 162 163 164 161 162 163 164 MPI Size Mismatch Rank0: CPU MPI OK Rank0: CPU Rank1: CPU Rank1: MIC 161 162 163 164 161 162 163 164 b=162 b=161.99999999999999 Control related FP operations must use consistent accuracy model