Scheduling of Parallelized Synchronous Dataflow Actors

Scheduling of Parallelized Synchronous Dataflow Actors Zheng Zhou*, Karol Desnos**, Maxime Pelcat**, Jean-François Nezan**, William Plishker*, and Shuvra S. Bhattacharyya* **Institut d'Electronique et de Telecommunications de Rennes INSA Rennes, CNRS UMR 6164, UEB, Rennes, France *Maryland DSPCAD Research Grouphttp://www.ece.umd.edu/DSPCAD/home/dspcad.htm Department of ECE, andInstitute for Advanced Computer StudiesUniversity of Maryland, College Park, 20742, USA Presentation version: 10/24/2013

Outline • Motivation • Background • Related Work • Problem Statement • Solution Approach • Experimental Setup • Experimental Results

Introductions William Plishker, research associate at University of Maryland. Expertise in dataflow representation and analysis, software defined radio, medical imaging, and high energy physics. Zheng Zhou, software engineer at Texas Instruments, alumni at University of Maryland. Expertise in dataflow models, multiprocessor programming, and task scheduling for embedded systems. Shuvra S. Bhattacharyya, professor at University of Maryland. Expertise in real-time signal processing systems, & model-based HW & SW design tools, dataflow methodologies. Karol Desnos, PhD student at IETR. Research interest in dataflow models, wireless communication, and memory management of embedded systems. Maxime Pelcat, associate professor at IETR. Expertise in dataflow models, multimedia, telecommunication, and programming of distributed embedded systems. Jean-François Nezan, professor at IETR. Expertise in dataflow programming, embedded systems, multicore and video compression.

Motivation Application FFT implementations[Zhou 2012] • An actor in the application may have multiple implementations including sequential and parallel implementations. • Choosing appropriate actor implementations together with actor execution order has major impact on system implementation performance. A C D B

Background: Dataflow Interchange Format (DIF) [Hsu 2005] • Astandard language for specifying mixed-grain dataflow models for digital signal processing (DSP) systems. • Currently supports • Synchronous Dataflow (SDF) • Homogeneous Synchronous Dataflow (HSDF) • Cyclo-static Dataflow (CSDF) • Parameterized Synchronous Dataflow (PSDF) • Multidimensional Synchronous Dataflow (MDSDF) • Boolean Dataflow (BDF) • Enable-Invoke Dataflow (EIDF) • Core Functional Dataflow (CFDF)

Background: TDIF-PPG [Zhou 2012] TDIF-PPG is a dataflow-based software design package  “Targeted dataflow interchange format / parallel processing group plug-in” • In Layer 1, the given DSP application is modeled using the DIF language. • In Layer 2, the actor interfaces are defined. • In Layer 3, generic parallel implementations of the actors are developed. • In Layer 4, a platform-specific system implementation is constructed.

Related Work on Static Task Scheduling • Independent Sequential Task Scheduling • Dynamic programming [Dogramaci 1979] • Dependent Sequential Task Scheduling • Modified critical path algorithm (Heuristic) [Wu 1990] • Genetic algorithm [Omara 2010] • Independent Parallel Task Scheduling • Approximation algorithm [Nahapetian 2009] • Dependent Parallel Task Scheduling • Network flow [Giaro 2009][Manaa 2010]  Does not consider interprocessor communication costs nor multiple implementations of the actor

Problem Statement: Parallel Actor Scheduling(PAS) • Given: • A dataflow graph G • A Symmetric Multi-Processing (SMP) platform P • An Actor Acceleration FunctionC, which provides information about how much time it takes (actual or estimated) for a given actor to execute on a given number of processors. • Determine: • How many processors will be used for each actor (processor count assignment). • The processor assignment for each actor. • The starting time of each actor. • We focus here on a special case of the PAS problem called the Fully Parallelized PAS (FPPAS) problem, where only parallel implementations of an actor are considered (we assume every actor has at least one parallel implementation).

FPPAS example

Optimal Schedule

Overall Approach Phase 1 FP-PAS instance Particle swarm Schedule length Processor Count assignment Phase 2 Heuristic solver MIP solver

Dataflow graph G(V,E) with processor count assignment MIP solver Computation Usage Graph (CUG) B1 A A1 B2 A2 B B3

MIP Solver: Variables

MIP Solver: Constraints and Objective

Heuristic Solver:“Story Scheduling” Ranking: 3, 2, 1, 6, 4, 5(Ties are broken by selecting actors with higher processor count) Free vertex list: 1, 2, 3 Scheduled vertex list: empty

Time(s) Story Scheduling Example 15 Free vertex list: 1, 2, 3 Scheduled vertex list: 3 Updated free vertex list: 1, 2 10 5 3 1st floor 8 0 Processor number

Story Scheduling Example Time(s) 15 Free vertex list: 2, 1 Scheduled vertex list: 3, 1 10 Updated free vertex list: 2 5 1st floor 8 0 Processor number 3 1

Story Scheduling Example Time(s) 15 Free vertex list: 2 Scheduled vertex list: 3, 1 10 Updated free vertex list: 2, 6, 4 5 1st floor 3 1 8 0 Processor number

Story Scheduling Example Time(s) 15 Free vertex list: 2, 6, 4 Scheduled vertex list: 3, 1, 2 2 10 Updated free vertex list: 6, 4 2nd floor 5 Selected actor: 2 C(2, 4) = 5s 1st floor 3 1 8 0 Processor number

Story Scheduling Example Time(s) 15 Free vertex list: 6, 4 Scheduled vertex list: 3, 1, 2, 4 4 2 10 Updated free vertex list: 6 2nd floor 5 Selected actor: 4 C(4, 4) = 5s 1st floor 3 1 8 0 Processor number

Story Scheduling Example Time(s) 15 Free vertex list: 6 Scheduled vertex list: 3, 1, 2, 4 4 2 10 Updated free vertex list: 6, 5 2nd floor 5 1st floor 3 1 8 0 Processor number

Story Scheduling Example 6 Time(s) 15 Free vertex list: 5 3rd floor Scheduled vertex list: 3, 1, 2, 4, 6 4 2 10 2nd floor 5 Selected actor: 6 C(6, 5) = 5s 1st floor 3 1 8 0 Processor number

Story Scheduling Example Time(s) 15 Free vertex list: empty 3rd floor Scheduled vertex list: 3, 1, 2, 4, 6, 5 4 2 10 2nd floor 5 Selected actor: 5 C(5, 3) = 5s 1st floor 6 3 1 5 8 0 Processor number

Simulation Results Benchmark set  Randomly generated SDF graphs using PREESM [Pelcat 2009]

Experimental Setup TI TMS320C6678 Platform • Software Packages: • Code Composer Studio V5.2 • IPC 1.24.2.27 • PDK 1.1.0.2 • SYS/BIOS 6.33.4.39 • BIOS-MCSDK 2.1.0 Beta.

Image Registration IRSUB Local Extrema Detection Post- Processing Difference of Gaussian Descriptor Assignment Cascade Gaussian Filter Image Reader (Reference) Key Points Matching Matching Refinement Target Image Transformation Cascade Gaussian Filter Image Reader (target) Image Writer Descriptor Assignment Difference of Gaussian Post- Processing Local Extrema Detection

Experimental Results • Implementation 1 only explores graph level parallelism: • all actors are sequential actors; • the application is assigned to 2 DSPs(optimal schedule). • Implementation 2 explores both graph level parallelism and actor level parallelism: • the actors in blue are parallel actors; • the application is assigned to 4 DSPs using the scheduling solution obtained from our 2-Phase Scheduling Framework. • Implementation 2 achieves 1.97X speedup compared to Implementation 1.

Summary • New design methods and algorithms for scheduling parallelized synchronous dataflow actors, and graphs that contain such actors • We have focused on a special case of the PAS problem called the Fully Parallelized PAS (FPPAS) problem • Only parallel implementations of an actor are considered (we assume every actor has at least one parallel implementation). • Solution approach • Decomposition into global and local search phases • Local search is in terms of fixed processor count assignments • Novel MIP and heuristic (“story scheduling”) techniques for local search • Particle swarm optimization for global search process • Experimental results using simulation on random graphs, as well as an off-the-shelf digital signal processor

References 1 • [Zhou 2012] Z. Zhou, C. Shen, W. Plishker, H. Wu, and S. S. Bhattacharyya. Systematic integration of flowgraph- and module-level parallelism in implementation of DSP applications on multiprocessor systems-on-chip. In Proceedings of the International Conference on Signal Processing, pages 402-408, Beijing, China, October 2012. • [Hsu 2005] C. Hsu, M. Ko, and S. S. Bhattacharyya. Software synthesis from the dataflow interchange format.in Proceedings of the International Workshop on Software and Compilers for Embedded Systems, Dallas, Texas, September 2005, pp. 37–49.. • [Dogramaci 1979] A. Dogramaci and J. Surkis. 1979. Evaluation of a Heuristic for Scheduling Independent Jobs on Parallel Identical Processors. Management Science 25, 12 (1979), 1208–1216. • [Wu 1990] M.-Y. Wu and D.D. Gajski. 1990. Hypertool: a programming aid for message-passing systems. IEEE Transactions on Parallel and Distributed Systems 1, 3 (1990), 330–343.

References 2 • [Pelcat 2009] M. Pelcat, P. Menuet, S. Aridhi, and J.-F. Nezan. Scalable compile-time scheduler for multi-core architectures. In Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pages 1552-1555, 2009. • [Omara 2010] F. A. Omara and M. M. Arafa. 2010. Genetic algorithms for task scheduling problem. J. Parallel and Distrib. Comput. 70, 1 (2010), 13–22. • [Nahapetian 2009] Ani Nahapetian, Philip Brisk, Soheil Ghiasi, and Majid Sarrafzadeh. 2009. An approximation algorithm for scheduling on heterogeneous reconﬁgurable resources. ACM Trans. Embed. Comput. Syst. 9, 1, Article 5 (Oct. 2009), 20 pages. • [Manaa 2010] A. Manaa and C. Chu. 2010. Scheduling multiprocessor tasks to minimise the makespan on two dedicated processors. European Journal of Industrial Engineering 4, 3 (2010), 265–279 • [Giaro 2009] K. Giaro, M. Kubale, and P. Obszarski. 2009. A graph coloring approach to scheduling of multiprocessor tasks on dedicated machines with availability constraints. Discrete Applied Mathematics 157, 17 (2009), 3625–3630.

Scheduling of Parallelized Synchronous Dataflow Actors

Scheduling of Parallelized Synchronous Dataflow Actors

Presentation Transcript

Parallelized Analytic Placer

Actors of advertising

Relative Reality—Parallelized

Clustrix Parallelized Clustered Database

Actors

PARALLELIZED CONVOLUTION

Actors of Slovakia

Dataflow Networks

Parallelized Evolution System

Exploring heuristics for Synchronous Data Flow scheduling

Dataflow Monitoring

Static Scheduling of Synchronous Data Flow Programs for Digital Signal Processing

Effect of synchronous vs. non-synchronous recordings

Dataflow I: Dataflow Analysis

Dataflow Descriptions

Dataflow

From actors to gates Notes on implementing dataflow programs in programmable hardware

Joint Minimization of Code and Data for Synchronous Dataflow Programs

Parallelized Boosting

DATAFLOW ARHITEKTURE

Joint Minimization of Code and Data for Synchronous Dataflow Programs

Dataflow Datatypes