1 / 32

S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures. Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA

herve
Download Presentation

S 3 P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. S3P: Automatic, Optimized Mapping ofSignal Processing Applications toParallel Architectures Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert Bond MIT Lincoln Laboratory 27 September 2001 HPEC Workshop, Lexington, MA This work is sponsored by United States Air Force under Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the author and are not necessarily endorsed by the Department of Defense.

  2. Problem Statement • S3P Program Outline • Introduction • Design • Demonstration • Results • Summary

  3. Software Morphware Hardware PCA Need: System Level Optimization Signal Processing Application (made up of PCA components) Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Applications A B • Applications built with components • Components have a defined scope • Capable of local optimization • System requires global optimization • Not visible to components • Too complex to add to application • Need system level optimization capabilities as part of PCA Components

  4. Beamform Latency = 2/N Filter Latency = 1/N System Latency Component Latency Local Optimum Hardware < 32 Beamform Hardware Global Optimum Latency Hardware < 32 Latency < 8 Beamform Latency < 8 Filter Filter Hardware Hardware Units (N) Example: Optimum System Latency • Simple two component system • Local optimum fails to satisfy global constraints • Need system view to find global optimum

  5. System Optimization Challenge Signal Processing Application Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Optimal Resource Allocation (Latency, Throughput, Memory, Bandwidth …) Compute Fabric (Cluster, FPGA, SOC …) • Optimizing to system constraints requires two way component/system knowledge exchange • Need a framework to mediate exchange and perform system level optimization

  6. S3P Lincoln Internal R&D Program • Goal: applications that self-optimize to any hardware • Combine LL system expertise and LCS FFTW approach Parallel Signal Processing Kepner/Hoffmann (Lincoln) S3P Framework Algorithm Stages N 2 Processor Mappings 1 . . . S3P brings self-optimizing (FFTW) approach to parallel signal processing systems 1 2 Best Mappings Time & Verify . . . M Self-Optimizing Software Leiserson/Frigo (MIT LCS) • Framework exploits graph theory abstraction • Broadly applicable to system optimization problems • Defines clear component and system requirements

  7. Requirements • Graph Theory Outline • Introduction • Design • Demonstration • Results • Summary

  8. System Requirements Decomposable into Tasks (comp) and Conduits (comm) Beamform XOUT = w *XIN Filter XOUT = FIR(XIN) Detect XOUT = |XIN|>c Mappable to different sets of hardware Measurable resource usage of each mapping • Each compute stage can be mapped to different sets of hardware and timed

  9. System Graph Beamform Filter Detect Edge is a conduit between a pair of task mappings Node is a unique mapping of a task • System Graph can store the hardware resource usage of every possible Task & Conduit

  10. Path = System Mapping Beamform Filter Detect “Best” Path is the optimal system mapping Each path is a complete system mapping • Graph construct is very general and widely used for optimization problems • Many efficient techniques for choosing “best” path (under constraints), such as Dynamic Programming

  11. 4.0 4.0 3.0 3.0 2.0 3.0 1.0 4.0 4.0 3.0 3.0 2.0 3.0 4.0 1.0 4.0 3.0 3.0 2.0 2.0 1.0 33 23 Example: Maximize Throughput Beamform Filter Detect Edge stores conduit time for a given pair of mappings 1.5 2.0 3.0 Node stores task time for a each mapping 3.0 4.0 6.0 6.0 8.0 More Hardware 16.0 • Goal: Maximize throughput and minimize hardware • Choose path with the smallest bottleneck that satisfies hardware constraint

  12. Path Finding Algorithms • Graph construct is very general • Widely used for optimization problems • Many efficient techniques for choosing “best” path (under constraints) such as Dikkstra’s Algorithm and Dynamic Programming N = total hardware units M = number of tasks Pi = number of mappings for task i t = M pathTable[M][N] = all infinite weight paths for( j:1..M ){ for( k:1..Pj ){ for( i:j+1..N-t+1){ if( i-size[k] >= j ){ if( j > 1 ){ w = weight[pathTable[j-1][i-size[k]]] + weight[k] + weight[edge[last[pathTable[j-1][i-size[k]]],k] p = addVertex[pathTable[j-1][i-size[k]], k] }else{ w = weight[k] p = makePath[k] } if( weight[pathTable[j][i]] > w ){ pathTable[j][i] = p } } } } t = t - 1 } Initialize Graph G Initialize source vertex s Store all vertices of G in a minimum priority queue Q while (Q is not empty) u = pop[Q] for (each vertex v, adjacent to u) w = u.totalPathWeight() + weight of edge <u,v> + v.weight() if(v.totalPathWeight() > w) v.totalPathWeight() = w v.predecessor() = u Dynamic Programming Dijkstra’s Algorithm

  13. Required Optional S3P Inputs and Outputs Application Algorithm Information S3P Framework “Best” System Mapping System Constraints • Can flexibly add information about • Application • Algorithm • System • Hardware Hardware Information

  14. Application • Middleware • Hardware • S3P Outline • Introduction • Design • Demonstration • Results • Summary

  15. Middleware (PVL) Task Conduit Map S3P Engine Input Low Pass Filter Beamform Matched Filter S3P Demonstration Testbed Multi-Stage Application Hardware (Workstation Cluster)

  16. Input Matched Filter Beamform Low Pass Filter XIN XOUT XIN XIN XOUT FFT XIN mult FIR1 FIR2 XOUT IFFT W4 W3 W1 W2 Multi-Stage Application Features • “Generic” radar/sonar signal processing chain • Utilizes key kernels (FIR, matrix multiply, FFT and corner turn) • Scalable to any problem size (fully parameterize algorithm) • Self validates (built-in target generator)

  17. Class Description Parallelism Matrix/Vector Used to perform matrix/vector algebra on data spanning multiple processors Data Computation Performs signal/image processing functions on matrices/vectors (e.g. FFT, FIR, QR) Data & Task Signal Processing & Control Task Supports algorithm decomposition (i.e. the boxes in a signal flow diagram) Task & Pipeline Conduit Supports data movement between tasks (i.e. the arrows on a signal flow diagram) Task & Pipeline Map Specifies how Tasks, Matrices/Vectors, and Computations are distributed on processor Data, Task & Pipeline Mapping Grid Organizes processors into a 2D layout Parallel Vector Library (PVL) • Simple mappable components support data, task and pipeline parallelism

  18. Hardware Platform • Network of 8 Linux workstations • Dual 800 MHz Pentium III processors • Communication • Gigabit ethernet, 8-port switch • Isolated network • Software • Linux kernel release 2.2.14 • GNU C++ Compiler • MPICH communication library over TCP/IP Advantages • Software tools • Widely available • Inexpensive (high Mflops/$) • Excellent rapid prototyping platform Disadvantages • Non real-time OS • Non real-time messaging • Slower interconnect • Difficulty to model • SMP behavior erratic

  19. S3P Engine Application Program Algorithm Information “Best” System Mapping S3P Engine System Constraints Hardware Information MapGenerator MapTimer MapSelector • Map Generator constructs the system graph for all candidate mappings • Map Timer times each node and edge of the system graph • Map Selector searches the system graph for the optimal set of maps

  20. Simulated/Predicted/Measured • Optimal Mappings • Validation and Verification Outline • Introduction • Design • Demonstration • Results • Summary

  21. 8.3 12 52 8.7 49 31 3.3 46 - 2.6 42 57 7.3 16 47 8.3 17 27 9.4 21 - 8.0 28 24 44 14 - 9.1 29 - 20 - - 18 24 - Input Low Pass Filter Matched Filter Beamform 18 60 17 15 33 14 23 14 - 14 15 13 3.2 31.5 16.1 31.4 Best 30 msec (1.6 MHz BW) 1.4 15.7 9.8 18.0 Best 15 msec (3.2 MHz BW) 1.0 10.4 6.5 13.7 0.7 8.2 3.3 11.5 Optimal Throughput • Vary number of processors used on each stage • Time each computation stage and communication conduit • Find path with minimum bottleneck 1 CPU 2 CPU 3 CPU 4 CPU

  22. Input Low Pass Filter Beamform Matched Filter S3P Timings (4 cpu max) • Graphical depiction of timings (wider is better) 4 CPU 3 CPU 2 CPU 1 CPU Tasks

  23. S3P Timings (12 cpu max)(wider is better) 12 CPU 8 CPU 6 CPU 4 CPU 2 CPU • Large amount of data requires algorithm to find best path Input Low Pass Filter Beamform Matched Filter Tasks

  24. Predicted and Achieved Latency(4-8 cpu max) Large (48x128K) Problem Size Small (48x4K) Problem Size Latency (sec) Latency (sec) Maximum Number of Processors Maximum Number of Processors • Find path that produces minimum latency for a given number of processors • Excellent agreement between S3P predicted and achieved latencies

  25. Predicted and Achieved Throughput(4-8 cpu max) Large (48x128K) Problem Size Small (48x4K) Problem Size Throughput (pulses/sec) Throughput (pulse/sec) Maximum Number of Processors Maximum Number of Processors • Find path that produces maximum throughput for a given number of processors • Excellent agreement between S3P predicted and achieved throughput

  26. SMP Results (16 cpu max) Large (48x128K) Problem Size Throughput (pulse/sec) Maximum Number of Processors • SMP overstresses Linux Real Time capabilities • Poor overall system performance • Divergence between predicted and measured

  27. Simulated (128 cpu max) Small (48x4K) Problem Size Small (48x4K) Problem Size Throughput (pulses/sec) Latency (sec) Maximum Number of Processors Maximum Number of Processors • Simulator allows exploration of larger systems

  28. Reducing the Search Space-Algorithm Comparison- Graph algorithms provide baseline performance Hill Climbing performance varies as a function of initialization and neighborhood definition Number of Timings Required Preprocessor outperforms all other algorithms. Maximum Number of Processors

  29. Future Work • Program area • Determine how to incorporate global optimization into other middleware efforts (e.g. PCA, HPEC-SI, …) • Hardware area • Scale and demonstrate on larger/real-time system • HPCMO Mercury system at WPAFB • Expect even better results than on Linux cluster • Apply to parallel hardware • RAW • Algorithm area • Exploits ways of reducing search space • Provide solution “families” via sensitivity analysis

  30. Outline • Introduction • Design • Demonstration • Results • Summary

  31. Summary • System level constraints (latency, throughput, hardware size, …) necessitate system level optimization • Application requirements for system level optimization are • Decomposable into components (input, filtering, output, …) • Mappable to different configurations (# processors, # links, …) • Measureable resource usage (time, memory, …) • S3P demonstrates global optimization is feasible separate from the application

  32. Acknowldegements • Matteo Frigo (MIT/LCS & Vanu, Inc.) • Charles Leiserson (MIT/LCS) • Adam Wierman (CMU)

More Related