S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

S3P: Automatic, Optimized Mapping ofSignal Processing Applications toParallel Architectures Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA This work is sponsored by United States Air Force under Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.

Acknowldegements • Matteo Frigo (MIT/LCS & Vanu, Inc.) • Charles Leiserson (MIT/LCS) • Adam Wierman (CMU)

Problem Statement • S3P Program Outline • Introduction • Design • Demonstration • Results • Summary

Beamform Latency = 2/N Filter Latency = 1/N System Latency Component Latency Local Optimum Hardware < 32 Beamform Hardware Global Optimum Latency Hardware < 32 Latency < 8 Beamform Latency < 8 Filter Filter Hardware Hardware Units (N) Example: Optimum System Latency • Simple two component system • Local optimum fails to satisfy global constraints • Need system view to find global optimum

System Optimization Challenge Signal Processing Application Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Optimal Resource Allocation (Latency, Throughput, Memory, Bandwidth …) Compute Fabric (Cluster, FPGA, SOC …) • Optimizing to system constraints requires two way component/system knowledge exchange • Need a framework to mediate exchange and perform system level optimization

S3P Lincoln Internal R&D Program • Goal: applications that self-optimize to any hardware • Combine LL system expertise and LCS FFTW approach Parallel Signal Processing Kepner/Hoffmann (Lincoln) S3P Framework Algorithm Stages N 2 Processor Mappings 1 . . . S3P brings self-optimizing (FFTW) approach to parallel signal processing systems 1 2 Best Mappings Time & Verify . . . M Self-Optimizing Software Leiserson/Frigo (MIT LCS) • Framework exploits graph theory abstraction • Broadly applicable to system optimization problems • Defines clear component and system requirements

Requirements • Graph Theory Outline • Introduction • Design • Demonstration • Results • Summary

System Requirements Decomposable into Tasks (comp) and Conduits (comm) Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Mappable to different sets of hardware Measurable resource usage of each mapping • Each compute stage can be mapped to different sets of hardware and timed

System Graph Beamform Filter Detect Edge is a conduit between a pair of task mappings Node is a unique mapping of a task • System Graph can store the hardware resource usage of every possible Task & Conduit

Path = System Mapping Beamform Filter Detect “Best” Path is the optimal system mapping Each path is a complete system mapping • Graph construct is very general and widely used for optimization problems • Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming

4.0 4.0 3.0 3.0 2.0 3.0 1.0 4.0 4.0 3.0 3.0 2.0 3.0 4.0 1.0 4.0 3.0 3.0 2.0 2.0 1.0 33 23 Example: Maximize Throughput Beamform Filter Detect Edge stores conduit time for a given pair of mappings 1.5 2.0 3.0 Node stores task time for a each mapping 3.0 4.0 6.0 6.0 8.0 More Hardware 16.0 • Goal: Maximize throughput and minimize hardware • Choose path with the smallest bottleneck that satisfies hardware constraint

Path Finding Algorithms • Graph construct is very general • Widely used for optimization problems • Many efficient techniques for choosing “best” path (under constraints) such as Dikkstra’s Algorithm and Dynamic Programming N = total hardware units M = number of tasks Pi = number of mappings for task i t = M pathTable[M][N] = all infinite weight paths for( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } } } } t = t - 1 } Initialize Graph G Initialize source vertex s Store all vertices of G in a minimum priority queue Q while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge <u,v> + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u Dynamic Programming Dijkstra’s Algorithm

Required Optional S3P Inputs and Outputs Application Algorithm Information S3P Framework “Best” System Mapping System Constraints • Can flexibly add information about • Application • Algorithm • System • Hardware Hardware Information

Application • Middleware • Hardware • S3P Outline • Introduction • Design • Demonstration • Results • Summary

Middleware (PVL) Task Conduit Map S3P Engine Input Low Pass Filter Beamform Matched Filter S3P Demonstration Testbed Multi-Stage Application Hardware (Workstation Cluster)

Input Matched Filter Beamform Low Pass Filter XIN XOUT XIN XIN XOUT FFT XIN mult FIR1 FIR2 XOUT IFFT W4 W3 W1 W2 Multi-Stage Application • Features • “Generic” radar/sonar signal processing chain • Utilizes key kernels (FIR, matrix multiply, FFT and corner turn) • Scalable to any problem size (fully parameterize algorithm) • Self validates (built-in target generator)

Class Description Parallelism Matrix/Vector Used to perform matrix/vector algebra on data spanning multiple processors Data Computation Performs signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR) Data & Task Signal Processing & Control Task Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task & Pipeline Conduit Supports data movement between tasks (i.e. the arrows on a signal flow diagram) Task & Pipeline Map Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor Data, Task & Pipeline Mapping Grid Organizes processors into a 2D layout Parallel Vector Library (PVL) • Simple mappable components support data, task and pipeline parallelism

Hardware Platform • Network of 8 Linux workstations • Dual 800 MHz Pentium III processors • Communication • Gigabit ethernet, 8-port switch • Isolated network • Software • Linux kernel release 2.2.14 • GNU C++ Compiler • MPICH communication library over TCP/IP • Advantages • Software tools • Widely available • Inexpensive (high Mflops/$) • Excellent rapid prototyping platform • Disadvantages • Non real-time OS • Non real-time messaging • Slower interconnect • Difficulty to model • SMP behavior erratic

S3P Engine Application Program Algorithm Information “Best” System Mapping S3P Engine System Constraints Hardware Information MapGenerator MapTimer MapSelector • Map Generator constructs the system graph for all candidate mappings • Map Timer times each node and edge of the system graph • Map Selector searches the system graph for the optimal set of maps

Simulated/Predicted/Measured • Optimal Mappings • Validation and Verification Outline • Introduction • Design • Demonstration • Results • Summary

8.3 12 52 8.7 49 31 3.3 46 - 2.6 42 57 7.3 16 47 8.3 17 27 9.4 21 - 8.0 28 24 44 14 - 9.1 29 - 20 - - 18 24 - Input Low Pass Filter Matched Filter Beamform 18 60 17 15 33 14 23 14 - 14 15 13 3.2 31.5 16.1 31.4 Best 30 msec (1.6 MHz BW) 1.4 15.7 9.8 18.0 Best 15 msec (3.2 MHz BW) 1.0 10.4 6.5 13.7 0.7 8.2 3.3 11.5 Optimal Throughput • Vary number of processors used on each stage • Time each computation stage and communication conduit • Find path with minimum bottleneck 1 CPU 2 CPU 3 CPU 4 CPU

Input Low Pass Filter Beamform Matched Filter S3P Timings (4 cpu max) • Graphical depiction of timings (wider is better) 4 CPU 3 CPU 2 CPU 1 CPU Tasks

S3P Timings (12 cpu max)(wider is better) 12 CPU 8 CPU 6 CPU 4 CPU 2 CPU • Large amount of data requires algorithm to find best path Input Low Pass Filter Beamform Matched Filter Tasks

Predicted and Achieved Latency(4-8 cpu max) Large (48x128K) Problem Size Small (48x4K) Problem Size Latency (sec) Latency (sec) Maximum Number of Processors Maximum Number of Processors • Find path that produces minimum latency for a given number of processors • Excellent agreement between S3P predicted and achieved latencies

Predicted and Achieved Throughput(4-8 cpu max) Large (48x128K) Problem Size Small (48x4K) Problem Size Throughput (pulses/sec) Throughput (pulse/sec) Maximum Number of Processors Maximum Number of Processors • Find path that produces maximum throughput for a given number of processors • Excellent agreement between S3P predicted and achieved throughput

SMP Results (16 cpu max) Large (48x128K) Problem Size Throughput (pulse/sec) Maximum Number of Processors • SMP overstresses Linux Real Time capabilities • Poor overall system performance • Divergence between predicted and measured

Simulated (128 cpu max) Small (48x4K) Problem Size Small (48x4K) Problem Size Throughput (pulses/sec) Latency (sec) Maximum Number of Processors Maximum Number of Processors • Simulator allows exploration of larger systems

Reducing the Search Space-Algorithm Comparison- Graph algorithms provide baseline performance Hill Climbing performance varies as a function of initialization and neighborhood definition Number of Timings Required Preprocessor outperforms all other algorithms. Maximum Number of Processors

Future Work • Program area • Determine how to enable global optimization in other middleware efforts (e.g. PCA, HPEC-SI, …) • Hardware area • Scale and demonstrate on larger/real-time system • HPCMO Mercury system at WPAFB • Expect even better results than on Linux cluster • Apply to parallel hardware • MIT/LCS RAW • Algorithm area • Exploits ways of reducing search space • Provide solution “families” via sensitivity analysis

Outline • Introduction • Design • Demonstration • Results • Summary

Summary • System level constraints (latency, throughput, hardware size, …) necessitate system level optimization • Application requirements for system level optimization are • Decomposable into components (input, filtering, output, …) • Mappable to different configurations (# processors, # links, …) • Measureable resource usage (time, memory, …) • S3P demonstrates global optimization is feasible separate from the application

S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures