410 likes | 540 Views
SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms. Lawrence Rauchwerger http://parasol.tamu.edu/~rwerger/ Parasol Lab, Dept of Computer Science, Texas A&M. System-Centric Computing. Application (algorithm). Application. Development, Analysis & Optimization.
E N D
SmartApps: Middleware for Adaptive Applications on Reconfigurable Platforms Lawrence Rauchwerger http://parasol.tamu.edu/~rwerger/ Parasol Lab, Dept of Computer Science, Texas A&M
System-Centric Computing Application(algorithm) Application Development,Analysis &Optimization Compiler OS Compiler(static) HW Execution System(OS & Arch) Input Data Today: System Centric Computing • Classic avenues to performance: • Parallel Algorithms • Static Compiler Optimization • OS support • Good Architecture WHAT’s MISSING ? • No Global Optimization • No matching between Application/OS/HW • intractable for the general case • Compilers are conservative • OS offers generic services • Architecture is generic
Application(algorithm) Application-Centric Computing Development,Analysis &Optimization HW Compiler (static) +run-time techniques OS Compiler Run-time System:Execution, Analysis& Optimization Application SmartApp Compiler(run-time) OS(modular) Input Data Architecture(reconfigurable) Application Control Instance-specific optimization Compiler + OS + Architecture + Data + Feedback Our Approach: SmartAppsApplication Centric Computing
STAPL Application DataBase Get Runtime Information (Sample input, system information, etc.) Predictor &Evaluator Execute Application Continuously monitor performance and adaptas necessary Predictor &Optimizer SmartApps Architecture Static STAPL Compiler Predictor &Optimizer Augmented withruntime techniques Compiled code + runtime hooks Smart Application Compute Optimal Application Predictor &Evaluator and RTS + OS Configuration Large adaptation(failure, phase change) advanced stages development stage Toolbox Recompute Application Configurer and/or Reconfigure RTS + OS Adaptive Software Adaptive RTS+ OS Runtime tuning (w/o recompile) Small adaptation (tuning)
Collaborative Effort: • STAPL(Amato – TAMU) • STAPL Compiler (Stroustrup/Quinlan TAMU - LLNL) , Cohen INRIA, France • RTS – K42 Interface & Optimizations (Krieger IBM) • Applications (Amato/Adams TAMU, Novak/Morel LLNL/LANL) • Validation on DOE extreme HW BlueGene (PERCS?)(Moreira/Krieger) Texas A&M Texas A&M (Parasol, NE) + IBM + LLNL + INRIA
SmartApps written in STAPL • STAPL (Standard Template Adaptive Parallel Library): • Collection of generic parallel algorithms, distributed containers & run-time system (RTS) • Inter-operable with Sequential Programs • Extensible, Composable by end-user • Shared Object View: No explicit communication • Distributed Objects: no replication/coherence • High Productivity Environment
User Code pAlgorithms pContainers pRange RTS + Communication Library (ARMI) OpenMP/MPI/pthreads/native The STAPL Programming Environment Interface to OS (K42)
STAPLApplication DataBase Get Runtime Information (Sample input, system information, etc.) Predictor &Evaluator Execute Application Continuously monitor performance and adaptas necessary Predictor &Optimizer SmartApps Architecture Static STAPL Compiler Predictor &Optimizer Augmented withruntime techniques Compiled code + runtime hooks Smart Application Compute Optimal Application Predictor &Evaluator and RTS + OS Configuration Large adaptation(failure, phase change) advanced stages development stage Toolbox Recompute Application Configurer and/or Reconfigure RTS + OS Adaptive Software Adaptive RTS+ OS Runtime tuning (w/o recompile) Small adaptation (tuning)
Algorithm Adaptivity • Problem: Parallel algorithms highly sensitive to: • Architecture – number of processors, memory interconnection, cache, available resources, etc • Environment – thread management, memory allocation, operating system policies, etc • Data Characteristics – input type, layout, etc • Solution: adaptively choose the best algorithm from a library of options at run-time
Adaptive Framework Installation Benchmarks Overview of Approach • GivenMultiple implementation choices for the same high level algorithm. • STAPL installationAnalyze each pAlgorithm’s performance on system and create a selection model. • Program executionGather parameters, query model, and use predicted algorithm. Data Repository Architecture & Algorithm Environment Performance Model STAPL User Parallel Algorithm Choices Code Adaptive Executable Data Characteristics Run-Time Tests Selected Algorithm
Results – Current Status • Investigated three operations • Parallel Reductions • Parallel Sorting • Parallel Matrix Multiplication • Several Platforms • 128 processor SGI Altix • 1152 nodes, dual processor Xeon Cluster • 68 nodes, 16 way IBM SMP Cluster • HP V Class 16 way SMP • Origin 2000
Adaptive Reduction Selection Framework Dynamic Adaptive Phase Static Setup Phase Application Synthetic experiments Optimizing compiler Adaptive executable Modelderivation Characteristics changed? Algo. selection code Select algo. Selected algo.
Reduction : update operation via associative and commutative operators : x = x expr DOALL i = 1 to M p = get_pid()s[p] = s[p] + B[i]sum = s[1]+s[2]+…+s[#proc] FOR i = 1 to Msum = sum + B[i] Partial acc. Final Irregular Reduction : updates of array elements through indirection. FOR i = 1 to M A[ X[i] ] = A[ X[i] ] + B[i] Reductions: Frequent Operations • Bottleneck for optimization. • Many parallellization transformations (algorithms) were proposed andnone of them always delivers the best performance.
Parallel Reduction Algorithms • Replicated Buffer :simple but won’t scale when data access pattern is sparse. • Replicated Buffer with Links [ICS02] reduced communication. • Selective Privatization: [ICS02] reduced communication and memory consumption. • Local Write [Han & Tseng] :zero communication, extra work.
Comparison of Parallel Reduction Algorithms Experimental setup
Observations: • Overall, SelPriv is overall the best performed algorithm (13/22). • No single algorithm works well for all the cases.
REAL A[N], pA[N,P] INTEGER X[2,M] DOALL i = 1, MC1 = func1() C2 = func2() pA[X[1,i], p] += C1 pA[X[2,i], p] += C2 DOALL i = 1, N A[i] += pA[ i, 1:P ] Number of shared data elements N Connectivity (M: # iterations, N: # shared data elements) Number of distinct reduction elements in one iteration. It affects the iteration replication ratio of Local Write. Mobility Other work Instrument light-weight timer (~ 100 clock cycles) in few iterations. Memory Reference Model
1 2 3 4 5 6 7 8 9 # touched elements in replicated arrays How efficient is the usage ? Sparsity= Size of replicated arrays How efficient are the regional usages ? # Clusters =# Clusters of the touched elements in replicated array Model (cont.) 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Memory access patterns of Replicated Buffer
Synthetic Parameter Values Factorial experiment ExperimentalSpeedups Experimental execution Model Generation General linear modelfor each algorithm Speedup = F(Parameters) Setup Phase — Model Generation Setup phase – off-line Parameterized Synthetic Reduction Loops
Synthetic Experiments Synthetic Reduction Loop Experimental Parameters double data[1:N] FOR j = 1,N * CON FOR i = 1,OTH Non-reduction work expr[i] = (memory read scalar ops) FOR i = 1, MOB k =index[i,j] data[k] += expr[i] index[*] Sparsity, #Cluster
C: connectivity N: the size of reduction array M: mobility O: non-reduction-work/reduction-work S: sparsity of replicated array L: # clusters Model Generation • Regression Models • Match parameters with speedup of a scheme • From a general linear modelwe sequentially select terms • Final models contain ~30 terms. Other method: Decision Tree Classification
Algorithm speedup of model-based recommendation Effectiveness = Algorithm speedup of oracle's recommendation Evaluation Q1: Can the prediction models select the right algorithm for a given loop execution instance ? Q2: How far from the best possible performance using ourprediction models ?
Speedup of algorithm chosen by alternative selection method Relative-speedup = Speedup of algorithm recommended by our models Evaluation (cont.) Q3: performance improvement using our prediction models ? • Alternative Selection Methods • RepBuf: always use Replicated Buffer • Random: randomly select algorithms (average used) • Default: use SelPriv on HP and LocalWr on IBM
Adaptive Irregular Reduction Static Irregular Reduction FOR (t = 1:steps) DO FOR (i = 1:M) DO access x[ index[i] ] FOR (t = 1:steps) DO IF (adapt(t)) THEN update index[*] FOR (i = 1:M) DO access x[ index[i] ] Phase behavior Adaptive Reductions • Reusability= # steps in a phase • Estimate phase-wise speedups by modeling the overheads of the setup phases of SelPriv and LocalWr.
Time steps Small phases Large phases Adaptation phases Moldyn • The performance of algorithms does not change much dynamically • artificially specified the Reuseability of phases.
Instrumentation • Algorithm selection module is wrapped around each invocation of the loop • Selection for each grid is reused for later instances. PP2D in FEATFLOW The program (real application) • PP2D (17K lines) nonlinear coupled equations solver using multi-grid methods. • Irregular reduction loop in GUPWD subroutine ~ 11% of program execution time. • The distributed input has 4 grids, with the largest one having ~100K nodes • Loop invoked with 4 (fixed) distinct memory access patterns in an interleaved manner.
PP2D in FEATFLOW (cont.) • Notes: • RepBuf, SelPriv, LocalWr correspond to applying fixed algorithm for all grids. • DynaSel dynamic selects once for each grid and reuses the decisions. • Relative Speedups are normalized to the best of the fixed algorithms. • Result: our framework • Introduces negligible overhead (HP system). • Can further improve performance (IBM system).
STAPLApplication DataBase Get Runtime Information (Sample input, system information, etc.) Predictor &Evaluator Execute Application Continuously monitor performance and adaptas necessary Predictor &Optimizer SmartApps Architecture Static STAPL Compiler Predictor &Optimizer Augmented withruntime techniques Compiled code + runtime hooks Smart Application Compute Optimal Application Predictor &Evaluator and RTS + OS Configuration Large adaptation(failure, phase change) advanced stages development stage Toolbox Recompute Application Configurer and/or Reconfigure RTS + OS Adaptive Software Adaptive RTS+ OS Runtime tuning (w/o recompile) Small adaptation (tuning)
Adaptive Apps Adaptive RTS Adaptive OS RTS needs to provide (among others): • Communication library (ARMI) • Thread management • Application specific Scheduling • based on Data Dependence Graph (DDG) • based on application specifics policies • thread to processor mapping • Memory management • Applications – OS bi-directional Interface
Optimizing Communication(ARMI) • Adaptive RTS Adaptive Communication (ARMI) • Minimize Applications Exec Time using application specific info. : • Use parallelism to hide latency (MT…) • Reduce Critical Path Lengths of apps. • Selectively use asynch./synch communication
K42 User-Level Scheduler • RMI service request threads may be created on: • local dispatcher and migrated to the dispatcher of the remote thread • dispatcher of the remote thread • New scheduling logic in the user-level dispatcher • Currently only FIFO ReadyQueue implementation is supported • Implementing different priority-based scheduling policies
User-level ThreadsScheduled by K42 user-level scheduler User-level Dispatchers User Level Kernel Level Dispatchers – Scheduled by the kernel K42 Kernel SmartApps RTS Scheduler • Integrating Application scheduling with K42
Priority-based Communication Scheduling • Based on type of request – SYNC or ASYNC • SYNC RMI - A new high priority thread is created ASYNC RMI – A new thread is created • SYNC RMI • New thread is scheduled to RUN immediately • ASYNC RMI • New thread is not scheduled to RUN until the current thread yields voluntarily
Eight simultaneous sweeps One sweep Priority-based Communication Scheduling • Based on application specified priorities • Discrete Ordinates Particle Transport Computation (developed in STAPL):
1 4 32 29 2 5 3 8 31 28 30 25 3 6 9 2 7 12 30 27 24 31 26 21 4 7 10 13 1 6 11 16 29 26 23 20 32 27 22 17 8 11 14 17 5 10 15 20 25 22 19 16 28 23 18 13 12 15 18 21 9 14 19 24 21 18 15 12 24 19 14 9 16 19 22 25 13 18 23 28 17 14 11 8 20 15 10 5 20 23 26 29 17 22 27 32 13 10 7 4 16 11 6 1 24 27 30 21 26 31 9 6 3 12 7 2 28 31 25 30 5 2 8 3 32 29 1 4 Dependence Graph angle-set A angle-set B angle-set C angle-set D • Numbers are cellset indices • Colors indicate processors
Dispatcher Physical Processor Ordinary Thread RMI Request Trace RMI Thread P1 P2 P3 Initial State In the Initial State, each dispatcher has a thread in RUN state
Dispatcher Physical Processor Ordinary Thread RMI Request Trace RMI Thread P1 P2 P3 P1 P2 RMI Request Initial State On a RMI request, A new thread is created to service the request on the remote dispatcher
Dispatcher Physical Processor Ordinary Thread RMI Request Trace RMI Thread P1 P2 P3 P1 P2 RMI Request Initial State For SYNC RMI requests, - Current running thread is moved to READY state - The new thread is scheduled to RUN
Dispatcher Physical Processor Ordinary Thread RMI Request Trace RMI Thread P1 P2 P3 P1 P2 RMI Request Initial State For ASYNC RMI requests, The new thread is not scheduled to RUN until the current thread voluntarily yields
Dispatcher Physical Processor Ordinary Thread RMI Request Trace RMI Thread P1 P2 P3 P1 P2 P3 RMI RMI Request Request Initial State On multiple pending requests, The scheduling logic prescribed by the application would be enforced to order the service of RMI requests
Memory Consistency Issues • Switching between threads to service RMI requests may result in memory consistency issues • Checkpoints need to be defined for stopping the execution of thread to service RMI requests • e.g. completion of a method may be a checkpoint for servicing pending RMI requests