1 / 48

Many-core processors: the integrated approach to the computational and execution models

Many-core processors: the integrated approach to the computational and execution models. Lorenzo Verdoscia and Roberto Vaccaro Institute for High Performance Computing and Networking National Research Council – Italy lorenzo.verdoscia@na.icar.cnr.it.

dick
Download Presentation

Many-core processors: the integrated approach to the computational and execution models

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Many-core processors: the integrated approach to the computational and execution models Lorenzo Verdoscia and Roberto Vaccaro Institute for High Performance Computing and Networking National Research Council – Italy lorenzo.verdoscia@na.icar.cnr.it L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  2. The Landscape of Parallel Computing Research: A View From Berkeleyhttp://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  3. What is D3AS • From our architectural point of view, this new trend raises at least two queries: • how to exploit such spatial parallelism, • how to program such systems. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  4. What is D3AS • The first query brings us to seriously reconsider the dataflow paradigm, given the fine grain nature of its operations. • In fact, instead of carrying out in sequence a set of operations like a von Neumann processor does, a many-core dataflow processor could calculate a function first connecting and configuring a number of identical simple cores as a dataflow graph and then allowing data asynchronously flow through them. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  5. What is D3AS • The second query brings us to seriously reconsider the functional programming style, given its intrinsic simplicity in writing parallel programs. • In fact, functional languages have three key properties that make them attractive for parallel programming: • They have powerful mechanisms for abstracting over both computation and coordination; • they eliminate unnecessary dependencies; • their high-level coordination achieves a largely architecture-independent style of parallelism. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  6. Agenda • The hHLDS model • CHIARA language • Dataflow graph generation and mapping • D3AS general architecture • Future work L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  7. and whose architecure has a highly scalable degree with nodes characterized by having • a dynamic configurability • a transparent hardware reconfiguration D3AS (Demand Data Driven Architecture System): • the computational model is functional • the execution model is dataflow a high performance reconfigurable computing system demonstrator, which exploits FPGA technology where • Design methodology: • develop the right computation model alongside languages & hadware L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  8. The methodological approach L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  9. Let A={a1, …, an} be the set of actors and L ={ll, …, ln} be the set of links A dataflow graph is a labelled directed graph G = (N, E) where N = A Lis the set of nodes firing of an actor E (A× L) (L× A) is the set of edges a token on each input link and no token on each output link effect consumes all input tokens and produces a token on its output link The homogeneous High Level Dataflow System (hHLDS) model Firing rules in the classical model L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  10. Merge Switch T F A B A Gate Decider L L T F R L L A B The hHLDS model Special actors in the classical model are characterized by having heterogeneous I/O conditions L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  11. firing of an actor a token on each input link effect consumes all input tokens and can produces a token on its output link b c b c ≤ a * a + + homogeneous High Level Dataflow System Any actor has two input links and one output link and consumes and produces only data tokens a+b*c If b≤c then a L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  12. The hHLDS model Comparison between the two models input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c; output (d) L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  13. CHIARA language • dialect of Backus‘s FP • tuple (O, F, F, :, D) where: • O is a set of objects; • F is a set of functions (or operators) from objects to objects; • F is a set of functional forms (functionals) from functions to functions; • : is the application operation; • D is a set of function definitions. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  14. CHIARA language CHIARA objects • Atoms: include integer fixed and floating-point numbers, Boolean constants,characters and strings • Sequences: denoted with angle brackets < 1, 2, 3 > • The empty sequence <> is the only object which is both an atom and a sequence • Undefined special object  (or UDF) called bottom, which is usually used to denote errors or exceptions. • Sequences are bottom-preserving: < 1; 2;< 3; 5 >;  > =  L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  15. CHIARA language CHIARA functions two kinds of operators that can be applied to objects: • Elementary: the commonly used binary operators and some new ones • Combinator: operators that affect the structure of the objects on which they are applied (combine sequences, transpose sequences of sequences, etc). L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  16. CHIARA language Elementary operators

  17. CHIARA language Elementary operators

  18. Combinator operators L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  19. Combinator operators L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  20. Functional forms • CHIARA functional forms are used to define new functions from existing functions and combinators • Functionals in CHIARA include the functional forms of Backus’s FP and some new ones L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  21. Functional forms L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  22. Functional forms L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  23. Functional forms L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  24. CHIARA language The assembly language • a functionally complete sub-set of elementary operators is the assembly language for a D3AS many-core processor • more complex functions are obtained applying the rule of metacomposition • dataflow graphs that are produced can be directly mapped and executed onto the hardware L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  25. a b > < + + max New functions The def construct permits the definition of new functions from existing functions, combinators, functional forms, and other already defined functions. For example: • def max = (gt ° [1,2] --> 1;2) • max:<5,6> = 6 a L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  26. Dataflow graph generation and mapping Dataflow graph mapping • communications inter many-core processors are slower than intra many-core processor • NP-hard mapping problem L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  27. Dataflow graph generation and mapping Compilation process The whole compilation process is composed of two steps: • compilation, producing the dataflow graph from CHIARA programs (function definitions plus expressions to be evaluated) • mapping, aimed at implementing the produced dataflow graph onto the D3AS prototype L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  28. Dataflow graph generation and mapping Dataflow graph generation • the CHIARA compiler, in conjunction with front-end tools, generates the Global Dataflow Graph Table (GDGT) L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  29. Global Dataflow Graph Table (GDGT) Node# Func Apply Constr Insert Left Right Out level level Level In In .. ... . . . .. .. .. 43 MUL 1 0 0 %1 %30 47 44 MUL 1 0 0 %2 %30 47 45 MUL 1 0 0 %3 %30 48 46 MUL 1 0 0 %4 %30 48 47 ADD 0 0 1 43 44 49 48 ADD 0 0 1 45 46 49 49 ADD 0 0 2 47 48 out 50 MUL 1 0 0 %1 %40 54 51 MUL 1 0 0 %2 %40 54 52 MUL 1 0 0 %3 %40 55 53 MUL 1 0 0 %4 %40 55 54 ADD 0 0 1 50 51 56 55 ADD 0 0 1 52 53 56 56 ADD 0 0 2 54 55 out .. ... . . . .. .. .. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  30. Visualization of Compiler Graph L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  31. Dataflow graph generation and mapping The next step the compiler extracts from the GDGT two tables: • Dataflow Graph Description (DGD) table, that contains, for each node, the binary operation and interconnection codes for the Graph Setter of a Processing Subsystem • Initial Input Value (IIV) table, that contains the binary information about input program data tokens L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  32. Dataflow graph mapping The presence of functionals: • permits the adoption of strategies that try to cluster parallelism exploitation • suggests handy ways to partition the dataflow graph into smaller, loosely connected graphs that can be run on the single platform-processors L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  33. D3AS general architecture Reconfigurable Hardware System (RHS) • Capable to map and execute dataflow graphs, created with the hHLDS model in a completely asynchronous manner. • Contituted by three Subsystem • Actor Realization Subsystem (ARS) Capable to create a one-to-one correspondence among graph actors and Functional Units. • Token flow Realization Subsystem (TRS) Implementing graph edges. • Graph Mapping Subsystem (GMS) Devoted to store the RHS Context Informations. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  34. D3AS general architecture ■ ARS Constituted by N identical Multipurpose Functional Unit (MPFUs). ■TRS Constituted by 3 Sets of N buffer Registers and a Crossbar Swith Interconnect. ■GMS Constituted by a set of buffers and logic circuitery. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  35. D3AS general architecture Critical Parameters in the RHS design. • NMPFU: the number of the MPFUs constituting the ARS; • CMPFU: the logical and functional complexity of the MPFUs; • INTRS: the type of interconnect for the TRS. The number of MPFU implementable on a VLSI device depends on: • interconnect complexity; • logical and functional complexity of MPFU. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  36. D3AS general architecture RHS/D3AS Fundamental Building Block Many-core Datalow Processor (MDP) A many-core chip replicating the D3AS general arcitecture with n MPFU interconnected via a non-blocking cross bar switch network. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  37. D3AS general architecture Architecture with globally pure dataflow model N: Number of Graph Actor n: Number of MPFU of MDP RHS is configured interconnecting K= N/n MPD with a 2nd level non-blocking crossbar switch interconnection network. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  38. D3AS general architecture with Hybrid Dataflow Model N>n The Graph is partitioned into subgraphs and the RHS is configured interconnecting m= N/n MDP with a 2nd level message passing interconnection network. Dataflow Graph Edge among subgraph mapped on different MDP are virtualized by messages ranted through the network. Communnicating Dataflow Processes (CDP) L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  39. D3AS general architecture demonstrator GIDEL board L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  40. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  41. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  42. i,j = 1…n Some results Matrix Multiplication • Given two matrices A(n,n)and B(n,n), their product generates a matrix C(n,n) whose generic element is given by the following formula: L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  43. Some results Matrix Multiplication • we used two values of n: n=32 and n=64 L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  44. Some results Matrix Multiplication • we compared the performance of a platform-processor with a IA32 Pentium IV • we measured performance in terms of CPI because our FPGA platform-processor executes an operation in 30 ns against 0.5 ns of the Pentium. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  45. Some results IA-32 Pentium IV vs D3AS L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  46. Some results Zeroes of a function (f=x*x+3x-1.75) assembly code generated compiling the C source code: 122 sequential assembly code lines L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  47. Some results Zeroes of a function our compiler generates a GDGT with only 28 micro-instructions organized on 12 sequential steps. L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

  48. Future work • To evalute which applications perfom better on the architecure with globally pure and hybrid dataflow model. • How to generalize pipeline inside the MDP L. Verdoscia & R. Vaccaro – Many-core processors: the integrated approach to the computational and execution models

More Related