1 / 40

Flexible Filters for High Performance Embedded Computing

Flexible Filters for High Performance Embedded Computing. Rebecca Collins and Luca Carloni Department of Computer Science Columbia University. Motivation. Stream Programming. Stream Programming model filter : a piece of sequential programming channels : how filters communicate

eljah
Download Presentation

Flexible Filters for High Performance Embedded Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University

  2. Motivation

  3. Stream Programming • Stream Programming model • filter: a piece of sequential programming • channels: how filters communicate • token: an indivisible unit of data for a filter • Examples: Signal processing, image processing, embedded applications A B C

  4. Example: Constant False-Alarm Rate (CFAR) Detection guard cells cell under test • with additional factors • number of gates • threshold (μ) • number of guard cells • rows and other dimensional data compare to HPEC Challenge http://www.ll.mit.edu/HPECchallenge/

  5. CFAR Benchmark uInt to float right window left window align data find targets square add Data Stream: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  6. CFAR Benchmark uInt to float right window left window align data find targets square 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  7. CFAR Benchmark uInt to float right window left window align data find targets square 2 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  8. CFAR Benchmark 1 uInt to float right window left window align data find targets square 3 2 1 add 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

  9. CFAR Benchmark 11 10 9 8 7 6 uInt to float right window left window align data find targets 13 square 12 3 2 1 11 10 9 8 add t cell under test right window left window 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 t

  10. core 1 core 2 core 3 A B C multi-core chip Mapping A B C Throughput: rate of processing data tokens

  11. multi-core chip Mapping core 1 core 3 A B C Throughput: rate of processing data tokens A C B

  12. Unbalanced Flow • Bottlenecks reduce throughput • Caused by backpressure • inherent algorithmic imbalances • data-dependent computational spikes core 1 core 2 core 3 A B C “wait”

  13. Data Dependent Execution Time: CFAR • Using Set 1 from the HPEC CFAR Kernel Benchmark • Over a block of about 100 cells • Extra workload of 32 microseconds per target

  14. Some other examples: Bloom Filters (Financial, Spam detection) Compression (Image processing) Data Dependent Execution Time

  15. Our Solution: Flexible Filters flex-merge flex-split • Unused cycles on B are filled by working ahead on filter C with the data already present on B • Push stream flow upstream of a bottleneck • Semantic Preservation core 1 core 2 core 3 A B C C

  16. Related Works • Static Compiler Optimizations • StreamIt [Gordon et al, 2006] • Dynamic Runtime Load Balancing • Work Stealing/Filter Migration • [Kakulavarapu et al., 2001] • Cilk [Frigo et al., 1998] • Flux [Shah et al., 2003] • Borealis [Xing et al., 2005] • Queue based load balancing • Diamond [Huston et al., 2005] distributed search, queue based load balancing, filter re-ordering • Combination Static+Dynamic • FlexStream [Hormati et al., 2009] multiple competing programs

  17. Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study

  18. Design Flow of a Stream Program with Flexible Filters Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks compiler/ design tools

  19. Design Considerations Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks

  20. Adding Flexibility Design stream algorithm • Mapping • Filters • Memory Profile Add Flexibility to Bottlenecks

  21. Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study

  22. Mapping Stream Programs to Multi-Core Platforms core 1 core 1 core 2 core 3 A B C A B C core 2 pipeline mapping A B C core 1 core 2 core 3 A B C A B C sharing a core SPMD mapping

  23. EB=2 EC=3 EA=2 A1 B1 C1 A2 C2 B2 B3 A3 C3 Throughput: SPMD Mapping core 1 Suppose EA = 2, EB = 2, EC = 3 A B C 1 core 2 A B C 2 core 3 A B C 3 3 tokens processed in 7 timesteps, ideal throughput = 3/7 = 0.429 SPMD mapping

  24. 1 3 2 2 1 3 4 2 A3 A2 A4 B1 B2 B3 C1 C2 Throughput: Pipeline Mapping core 3 core 1 core 2 C A B 1 latency 2 latency 3 latency 2 A1 throughput = 1/3 = 0.333 < 0.429

  25. Data Blocks core 1 core 2 core 3 A B C C • data block: group of data tokens

  26. 1 2 3 2 1 4 3 2 A3 A2 A4 B1 B2 B3 C1 C2 Throughput: Pipeline Augmented with Flexibility core 3 core 1 core 2 C A B 1 C A1 C2 throughput = 2/5 = 0.4 < 0.429 (but > 0.333)

  27. Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study

  28. Flex-Split Flex-split pop data block b from in n0 = available space on out0 n1 = |b| - n0 send n0 to out0, n1 to out1 send n0 0’s, then n1 1’s to select core 2 core 3 B C C select in0 out0 in out C flex split flex merge in1 out1 C • maintain ordering • based on run-time state of queues

  29. Flex-Merge Flex-merge pop i from select if i is 0, pop token from in0 if i is 1, pop token from in1 push token to out core 2 core 3 B C C select in0 out0 in out Overhead of Flexibility? C flex split flex merge in1 out1 C

  30. output channel 1 output channel 1 flex merge flex merge flex merge filterflex output channel 2 output channel 2 filter flex split output channel n output channel n Multi-Channel Flex-Split and Flex-Merge … select

  31. filter Multi-Channel Flex-Split and Flex-Merge input channel 1 input channel 2 … input channel n

  32. flex merge filterflex filter flex split Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 … input channel n select

  33. flex merge flex merge filterflex filterflex filter filter flex split flex split βflex split βflex split select Multi-Channel Flex-Split and Flex-Merge input channel 1 Centralized input channel 2 … input channel n select Distributed input channel 1 input channel 2 … input channel n

  34. Outline • Introduction • Design Flow of a Stream Program with Flexibility • Performance • Implementation of Flexible Filters • Experiments • CFAR Case Study

  35. Cell BE Processor • Distributed Memory • Heterogeneous • 8 SIMD (SPU) cores • 1 PowerPC (PPU) • Element Interconnect Bus • 4 rings • 205 Gb/s • Gedae Programming Language Communication Layer

  36. Gedae • Commercial data-flow language and programming GUI • Performance analysis tools

  37. CFAR Benchmark uInt to float right window left window align data find targets square add

  38. By changing threshold, change % targets 1.3 % 7.3 % Additional workload per target 16 µs 32 µs 64 µs Data Dependency % Targets Additional Workload

  39. More Benchmarks

  40. Conclusions • Flexible filters • adapt to data dependent bottlenecks • distributed load balancing • provide speedup without modification to original filters • can be implemented on top of general stream languages

More Related