230 likes | 414 Views
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching. Junghee Lee * , Hyung Gyu Lee * , Soonhoi Ha † , Jongman Kim * , and Chrysostomos Nicopoulos ‡. Presented by Junghee Lee. *. †. ‡. Introduction. Fusion.
E N D
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee*, HyungGyu Lee*, Soonhoi Ha†, Jongman Kim*, and ChrysostomosNicopoulos‡ Presented by Junghee Lee * † ‡
Introduction Fusion Massively Parallel Processing Array Many Core Multi Core Powerful cores + H/W accelerator in a single die Ex) AMD Fusion Single Core Programmable Hardware Accelerator Ex) GPGPU
MPPA as Hardware Accelerator Host CPU Interface Core Tile Core Tile Core Tile Core Tile Device Memory CPU CPU Massively Parallel Processing Array Core Tile Core Tile Core Tile Core Tile CPU CPU Core Tile Core Tile Core Tile Core Tile I/O Core Tile Core Tile Core Tile Core Tile I/O Core Tile Core Tile Core Tile Core Tile Challenges Expressiveness Debugging Memory Hierarchy Design Core Tile Core Tile Core Tile Core Tile
Related Works Expressiveness Debugging Memory GPGPU AMD Fusion SIMD Multiple debuggers Event graph Scratch-pad memory Cache Tilera Multi-threading Multiple debuggers Coherent cache Rigel Multi-threading Not addressed Software-managed cache Ambric Kahn process network Formal model Scratch-pad memory Proposed MPPA Event-driven model Inter-module debug Intra-module debug Scratch-pad memory Prefetching
Contents • Introduction • Execution Model • Hardware Architecture • Evaluation • Conclusion
Execution Model • Specification • Module = (b, Pi, Po, C, F) • b = Behavior of module • Pi = Input ports • Po = Output ports • C = Sensitivity list • F = Prefetch list • Signal • Net = (d, K) • d = Driver port • K = A set of sink ports • Semantics • A module is triggered when any signal connected to C changes • Function calls and memory accesses are limited to within a module • Non-blocking write and block read • The specification can be modified during run-time
Example • Quick sort • Pivot is selected • The given array is partitioned so that • The left segment should contain smaller elements than the pivot • The right segment should contain larger elements than the pivot • Recursively partition the left and right segments • Specifying quick sort • Multi-threading • OK but hard to debug • SIMD • Inefficient due to input dependency • Kahn process network • Impossible due to the dynamic nature
Specify Quick Sort with Event-driven Model • Partition module • b (behavior): select a pivot, partition the input array, instantiate another partition module if necessary • Pi (input port): input array and its position • Po (output port): left and right segments and their position • C (sensitivity list): input array • P (prefetch list): input array • Collection module • b (behavior): collect segments • Pi (input port): sorted segments and intermediate result • Po (output port): final result and intermediate result • C (sensitivity list): sorted segments • P (prefetch list): sorted segments and intermediate result Input array Partition Partition Partition … Collection Final result Intermediate result
Contents • Introduction • Execution Model • Hardware Architecture • Evaluation • Conclusion
MPPA Microarhitecture • Identical core tiles • Consists of uCPU, scratch-pad memory, and peripherals that support the execution model • One core tile is designated to an execution engine Core Tile Core Tile Core Tile Core Tile Core Tile Host CPU Interface Device Memory Core Tile Core Tile Core Tile Core Tile Core Tile E Core Tile Core Tile Core Tile Core Tile Core Tile • Software running on a core tile • Consists of scheduler, signal storage and interconnect directory • Supports the execution model • If necessary, it is split into multiple instances running on different core tiles Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile
Core Tile Architecture • Software-managed on-chip SRAM • Double buffering where one is for the current module and the other is for the next module to be prefetched Scratch Pad Memory uCPU For Current Module For Next Module Prefetcher • Switches the context when the current module finishes and the next module is ready • Stores information about the modules • Prefetches the code and data of the next module while the current module is running on uCPU • Stores the input data • The actual data is stored in the SPM while its information is managed by this module • Generic small processor • Treated as a black box • Stores the output data • Notifies the update event to the interconnect directory when the output is updated Context Manager Message Handler Input Signal Queue • Counter-part of the prefetcher • Sends data to the requester Output Signal Queue • Handles the system messages Message Queue • NoC router Network Interface
Execution Engine • Most of its functionality is implemented in software while the hardware facilitates communication Software implementation gives us flexibility in the number and location of the execution engine • One way to visualize our MPPA is to regard the execution engine as an event-driven simulation kernel • The execution engine interacts with modules running on other core tiles through messages
Components of Execution Engine • Scheduler • Keeps track of the status and location of modules • Maintains three queues: wait, ready and run queue • Signal storage • Stores signal values in the device memory • If a signal is updated but its value is still stored in the node, the signal storage invalidates its value and keeps the location of the latest value • Interconnect directory • Keeps track of connectivity of signals and ports • Maintains the sensitivity list
Module-Level Prefetching • Hides the overhead of the dynamic scheduling • Prefetches the next module while the current module is running uCPU Prefetcher Scheduler Interconn. Directory Signal Storage Other Node Execute a module Memory access Memory access
Illustrative Example Partition 3 Partition 4 Partition 5 Partition 0 Partition 4 Partition 1 Partition 2 uCPU uCPU uCPU Prefetcher Prefetcher Prefetcher Out Sig Q Out Sig Q Out Sig Q Msg Handler Msg Handler Msg Handler Interconnect Directory Scheduler Collection Wait Q Collection Ready Q Collection Signal Storage Run Q
Contents • Introduction • Execution Model • Hardware Architecture • Evaluation • Conclusion
Benchmark • Recognition, Synthesis and Mining (RMS) benchmark • Fine-grained parallelism: dominated by short tasks • Small memory foot print • High run-time scheduling overhead • Task-level parallelism: exhibits dependency • Hard to be implemented with GPGPU
Simulator • In-house cycle-level simulator • Parameters
Utilization 1.0 0.8 Core utilization 0.6 0.4 0.2 0 OP FS BS CF CED BT QS Benchmarks w/o prefetching w/ prefetching
Scalability 1.0 20000 Execution time (cycles) 18000 0.8 Core utilization 16000 0.6 14000 0.4 12000 0.2 10000 0 8000 64 24 32 40 48 56 Number of core tiles Util (1) Execution time (1) Util (3) Execution time (3)
Conclusion • This paper proposes a novel MPPA architecture that employs an event-driven execution model • Handles dependencies by dynamic scheduling • Hides dynamic scheduling overhead by module-level prefetching • Future works • Supports applications that require larger memory footprint • Adjusts the number of execution engines dynamically • Supports inter-module debugging
Questions? Contact info Junghee Lee junghee.lee@gatech.edu Electrical and Computer Engineering Georgia Institute of Technology