160 likes | 170 Views
Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine. Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael Chu, John Wawrzynek UC Berkeley BRASS Group Andr é DeHon California Institute of Technology. Outline. Hardware Virtualization SCORE model
E N D
Analysis of QuasiStaticScheduling Techniques in aVirtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael Chu, John Wawrzynek UC Berkeley BRASS Group André DeHon California Institute of Technology
Outline • Hardware Virtualization • SCORE model • Run-time scheduler • Fully Dynamic • Quasi-Static • Results • 7x reduction in scheduling overhead • App performance improved by a factor of 2-7. • Conclusion FPGA 2002
Hardware Virtualization • Traditional Mapping Tools • Expose resource constraints to designer • HW virtualization enables: • App compatibility/longevity across a device family • Automatic performance scaling on larger devices FPGA 2002
Programming Model • Streaming dataflow graph of operators(FSM + datapath) • Dynamic data-dependent behavior • Arbitrary size operators • Run-time representation • Graph of fixed size compute pages • Akin to virtual memory pages • Run-time scheduling is required to handle dynamic page behavior Stream Computation Organized for Reconfigurable Execution (SCORE) (1) • Data-flow based framework • Programming Model • Execution Environment • Hardware Platform FPGA 2002
Hardware Platform • uP/Reconfigurable array hybrid • Array: compute pages(CP) and configurable memory blocks (CMB) • Stream interface between resources • Global Controller manages reconfiguration • Scheduler Operation • Temporal Partitioning • Buffer intermediate results • Resource Allocation/Mapping • Compute pages • Memory segments • Communication channels Stream Computation Organized for Reconfigurable Execution (SCORE) (2) • Array Reconfiguration FPGA 2002
Run-time Scheduler • Run-time scheduling (late binding of resources) • Benefit: automatic performance scaling • Extra burden: scheduler • Complex optimization with multiple simultaneous constraints(CPs, CMBs, and network) NP-hard problem • Space of scheduling solutions • Range in quality and complexity • Tradeoffs: timeslice vs asynchronous or dynamic vs static • What is the right timeslice size? • Depends on an application’s run-time behavior • Affected by the scheduler overhead (lower bound) FPGA 2002
Problem Statement • SCORE Micro-architecture • Parallel reconfiguration of independent CPs/CMBs • Reconfiguration time is thousands of cycles • Problem • Investigate scheduling cost • Reduce it to a minimum (comparable to reconfiguration time) • Understand its effect on application run-times. FPGA 2002
Version of priority-list scheduling • Availability of input tokens and output space determines the priority • Candidates are chosen by BFS • Fixed timeslice size • Large critical loop Initial Scheduling Solution • Fully Dynamic Scheduler • Perform scheduling operation each timeslice FPGA 2002
Fully Dynamic Scheduler (1) • Two types of overhead: • Scheduler (avg. 124 Kcycles) • Reconfiguration [array global controller] (avg. 3.5 Kcycles) • Average overhead per timeslice > 127 Kcycles FPGA 2002
Fully Dynamic Scheduler (2) • Total Execution Time • Scheduler Overhead is on average 36% of execution time • Timeslice Size = 250Kcycles. FPGA 2002
Pre-compute Schedule from • Graph topology • Back annotations (I/O rates) • Generate script of configuration commands. Static • Small Run-time Critical Loop: • Query Array • Issue Script Commands Quasi Quasi-Static Scheduler • Timeslice size • Dynamically controlled by array hardware stall detect. • Hardware continuously (or at small intervals) monitors array activity. FPGA 2002
Results (1) • A low overhead scheduling solution • Scheduler overhead (avg. 14Kcycles) • Reconfiguration (avg. 4Kcycles) • 7x average reduction in overhead FPGA 2002
Results (2) • 4.5x average application speedup • Reduction in overhead AND • Improvement in scheduling quality FPGA 2002
Results Summary • Tested applications: • Image de/compression – consist of both dynamic and static rate operators. • All demonstrate similar speedups under Quasi-Static scheduler. • Performance improvements can be attributed to: • Reduced scheduler overhead • Improved scheduling quality: • Global rather than local (BFS) view as in dynamic scheduler • Reduction of the lower bound of timeslice size • Expands the space of apps well suited for execution under a virtualized hardware • Retained powerful semantics of dynamic data-dependent dataflow FPGA 2002
Conclusion • Run-time scheduler • Required for automatic scaling under hardware virtualization • Run-time overhead sets lower bound on the size of scheduling step (response time): • Restricting applicability of virtualized hardware • Makes this model impractical for some apps • Low overhead run-time scheduling is achievable: • Without semantic restrictions • With higher (or comparable) scheduling quality. • 7x reduction in overhead and simultaneous • Performance improvement of 2-7x. • OS is a viable alternative to manual scheduling. FPGA 2002
Thanks to: DARPA, Xilinx and STMicro For more information http://brass.cs.berkeley.edu/SCORE Thank You FPGA 2002