240 likes | 380 Views
Enabling Multithreading on CGRAs. Aviral Shrivastava 1 , Jared Pager 1 , Reiley Jeyapaul, 1 Mahdi Hamzeh 1 2 , Sarma Vrudhula 2. Compiler Microarchitecture Lab , VLSI Electronic Design Automation Laboratory, Arizona State University, Tempe, Arizona, USA.
E N D
Enabling Multithreading on CGRAs Aviral Shrivastava1, Jared Pager1, Reiley Jeyapaul,1 Mahdi Hamzeh12, Sarma Vrudhula2 CompilerMicroarchitecture Lab, VLSI Electronic Design Automation Laboratory, Arizona State University, Tempe, Arizona, USA
Need for High Performance Computing zettaflop petaflop • Applications that need high performance computing • Weather and geophysical simulation • Genetic engineering • Multimedia streaming
Need for Power-efficient Performance 2.3% of US Electrical Consumption $4 Billion Electricity charges ITRS 2010 • Power requirements limit the aggressive scaling trends in processor technology • In high-end servers, • power consumption doubles every 5 years • Cost for cooling also increases in similar trend
Accelerators can help achievePower-efficient Performance • Power critical computations can be off-loaded to accelerators • Perform application specific operations • Achieve high throughput without loss of CPU programmability • Existing examples • Hardware Accelerator • Intel SSE • Reconfigurable Accelerator • FPGA • Graphics Accelerator • nVIDIA Tesla (Fermi GPU)
CGRA: Power-efficient Accelerator PEs communicate through an inter-connect network PE PE PE PE PE PE PE PE From Neighbors and Memory Local Instruction Memory PE PE PE PE PE PE PE PE RF FU Local Data Memory To Neighbors and Memory Main System Memory • Distinguishing Characteristics • Flexible programming • High performance • Power-efficient computing • Cons • Compiling a program for CGRA difficult • Not all applications can be compiled • No standard CGRA architecture • Require extensive compiler support for general purpose computing
Mapping a Kernel onto a CGRA Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF 1 1i PE PE PE PE 9i-6 9 Data-Dependency Graph: Data-Dependency Graph: 1 1 2 2 8i-5 8 2 2i 3i-1 3 PE PE PE PE 3 3 5 5i-2 PE PE 6i-3 PE 6 7i-4 PE 7 Spatial Mapping & 4 4 5 5 PE PE PE PE Temporal Scheduling 6 6 4i-2 4 7 7 8 8 9 9 Given the kernel’s DDG • Mark source and destination nodes • Assume CGRA Architecture • Place all nodes on the PE array • Dependent nodes closer to their sources • Ensure dependent nodes have interconnects connecting sources • Map time-slots for each PE execution • Dependent nodes cannot execute before source nodes
Mapped Kernel Executed on the CGRA Loop: t1 = (a[i]+b[i])*c[i] d[i] = ~t1 & 0xFFFF PE PE PE PE Data-Dependency Graph: Execution time slot: (or cycle) 0 1 7 5 6 2 3 4 1 2 After cycle 6, one iteration of loop completes execution every cycle PE PE PE PE 3 PE PE PE PE 17 16 90 91 12 14 15 13 11 4 5 Entire kernel can be mapped onto CGRA by unrolling 6 times PE PE PE PE 81 82 26 27 35 36 45 44 80 22 24 23 25 31 33 32 34 40 42 41 43 21 30 6 54 55 64 63 72 73 50 52 51 53 61 70 60 62 71 10 7 Iteration Interval (II) is a measure of mapping quality 20 8 9 Iteration Interval = 1
Traditional Use of CGRAs Application Input Application Output E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 • An application is mapped onto the CGRA • System inputs given to the application • Power-efficient application execution achieved • Generally used for streaming applications • ADRES, MorphoSys, ADRES, KressArray, RSPA, DART
Envisioned Use of CGRAs Processor co-processor Program thread Kernel to accelerate E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 • Specific kernels in a thread can be power/performance critical • The kernel can be mapped and scheduled for execution on the CGRA • Using the CGRA as a co-processor (accelerator) • Power consuming processor execution is saved • Better performance of thread is realized • Overall throughput is increased
CGRA as an Accelerator S1 S3 S2 Not all PEs are used in each schedule. Thread-stalls create a performance bottleneck E0 E1 E2 E3 E4 E5 E6 E7 E8 E9 E10 E11 E12 E13 E14 E15 • Application: Single thread • Entire CGRA used to schedule each kernel of the thread • Only a single thread is accelerated at a time • Application: Multiple threads • Entire CGRA is used to accelerate each individual kernel • if multiple threads require simultaneous acceleration • threads must be stalled • kernels are queued to be run on the CGRA
Proposed Solution: Multi-threading on the CGRA Threads: 2, 3 Expand to maximize CGRA utilization and performance Threads: 1, 2, 3 Shrink-to-fit mapping maximizing performance Threads: 1, 2 Maximum CGRA utilization S3 E0 E1 E2 E3 S3’ E4 E5 E6 E7 S1 S2’ S2 S3 S3 E8 E9 E10 E11 E12 E13 E14 E15 S2 S2’ S2 S3 • Through program compilation and scheduling • Schedule application onto subset of PEs, not entire CGRA • Enable dynamic multi-threading w/o re-compilation • Facilitate multiple schedules to execute simultaneously • Can increase total CGRA utilization • Reduce overall power consumption • Increases multi-threaded system throughput
Our Multithreading Technique • Static compile-time constraints to enable fast run-time transformations • Has minimal effect on performance (II) • Increases compile-time • Perform fast dynamic transformations • Takes linear time to complete with respect to kernel II • All schedules are treated independently Features: • Dynamic Multithreading enabled in linear runtime • No additional hardware modifications • Require supporting PE inter-connects in CGRA topology • Works with current mapping algorithms • Algorithm must allow for custom PE interconnects
Hardware Abstraction: CGRA Paging P0 P1 PE e0 e1 PE e2 PE e3 PE P0 P0 PE e4 PE e5 PE e6 PE e7 Local Instruction Memory P1 P1 PE e8 PE e9 PE e10 e11 PE P3 P2 P2 P2 PE e12 e13 PE PE e14 PE e15 P3 P3 Local Data Memory Main System Memory Page: conceptual group of PEs A page has symmetrical connections to each of the neighboring pages No additional hardware ‘feature’ is required. Page-level interconnects follow a ring topology
Step 1: Compiler Constraints assumed during Initial Mapping Naïve mapping could result in under-used CGRA resources Our paging methodology, helps reduce CGRA resource usage e0 e1 e2 e3 1 9 e4 e5 e6 e7 P0 P3 1 3 8 2 4 e8 e9 e10 e11 3 2 5 6 7 e12 e13 e14 e15 5 9 8 P1 P2 7 4 6 • Compile-time Assumptions • CGRA is collection of pages • Each page can interact with only one topologically neighboring page. • Inter-PE connections within a page are unmodified • These assumptions, • in most cases will not effect mapping quality • may help improve CGRA resource usage
Step 2: Dynamic Transformationenabling multiple schedules e0 e1 e2 e3 P0 P3 1 e4 e5 e6 e7 3 2 e8 e9 e10 e11 P1 P2 5 9 8 e12 e13 e14 e15 7 4 6 • Example: • application mapped to 3 pages • Shrink to execute on 2 pages • Transformation Procedure: • Split pages • Arrange pages in time order • Mirror pages to facilitate shrinking • Ensures inter-node dependency • Shrunk pages executed on altered time-schedules • Constraints • inter-page dependencies should be maintained
Step 2: Dynamic Transformationenabling multiple schedules e0 e1 e2 e3 P0 1 e4 e5 e6 e7 3 2 e8 e9 e10 e11 P3 P1 P2 5 9 8 e12 e13 e14 e15 7 4 6 • Transformation Procedure: • Split pages • Arrange pages in time order • Mirror pages to facilitate shrinking • Ensures inter-node dependency • Shrunk pages executed on altered time-schedules
Step 2: Dynamic Transformationenabling multiple schedules e0 e1 T0 T3 P0 P0,1 e4 e5 1 1 e8 e9 3 3 2 2 P1,1 P1 P2 e12 e13 T2 T4 T1 5 5 9 8 e10 e10 e11 e11 7 4 4 6 6 e14 e14 e15 e15 T2 P2 9 8 7 • Example: • application mapped to 3 pages • Shrink to execute on 2 pages • Transformation Procedure: • Split pages • Arrange pages in time order • Mirror pages to facilitate shrinking • Ensures inter-node dependency • Shrunk pages executed on altered time-schedules • Constraints • inter-page dependencies should be maintained
Experiment 1: Compiler Constraints are Liberal Constraints can degrade individual benchmark performance by limiting compiler search space Constraints can also improve individual benchmark performance by, ironically, limiting compiler search space On average, performance is minimally impacted • Mapping quality measured in Iteration Intervals • smaller II is better
Experimental Setup Thread 3 Thread 1 Thread 2 Thread 4 kernelto be accelerated Only ONE thread serviced MULTIPLE threads serviced CPU Core CPU Core CPU Core CPU Core CGRA • CGRA Configurations used: • 4x4, 6x6, 8x8 • Page configurations: • 2, 4, 8 PEs per page • Number of threads in system: • 1, 2, 4, 8, 16 • Each has a kernel to be accelerated Experiments • Single-threaded CGRA • Each thread arrives at “kernel” • thread is stalled until kernel executes • Multi-threaded CGRA • CGRA used to accelerate kernels as and when they arrive • No thread is stalled
Multithreading Improves System Performance Number of Threads Accessing CGRA: As the number of threads increases, multithreading provides increasing performance CGRA Size: As we increase CGRA size, multithreading provides better utilization and therefore better performance Number of PEs per Page: For the set of benchmarks tested, the number of optimal PEs per page is either 2 or 4
Summary • Power-efficient performance is the need of the future • CGRAs can be used as accelerators • Power-efficient performance can be achieved • Has limitations on usability due to compiling difficulties • With multi-threaded applications, need multi-threading capabilities in CGRA • Propose a two-step dynamic methodology • Non-restrictive compile-time constraints to schedule application into pages • Dynamic transformation procedure to shrink/expand the resources used by a schedule • Features: • No additional hardware required • Improved CGRA resource usage • Improved system performance
Future Work Using CGRAs as accelerator in systems with inter-thread communication. Study the impact of compiler constraints on compute-intensive and memory-bound benchmark applications? Possible use of thread-level scheduling to improve overall performance?
State-of-the-art Multi-threading on CGRAs Data Set 2 Data Set 1 Filter 1 Filter 1 Filter 1 Data Set 3 Filter 2 Filter 2 Filter 2 Filter 3 Filter 3 Output Filter 3 Output Output Filter 1 Filter 1 Filter 1 Filter 2 Filter 2 Filter 2 Filter 3 Filter 3 Filter 3 Core 1 Core 3 Core 5 Core 7 Core 2 Core 4 Core 6 Core 8 Mem Bank 1 Mem Bank 2 Mem Bank 3 Mem Bank 4 • Polymorphic Pipeline Arrays [Park 2009] • Enables dynamic scheduling • Collection of schedules make a kernel • Some schedules can be given more resources than other schedules • Limitations • Collection of schedules must be known at compile-time • Schedules are assumed to be ‘pipelining’ stages in a single kernel