470 likes | 955 Views
HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures. MPhil/Master dissertation presented by Jaume Joven Murillo and supervised by Dr. Jordi Carrabina Bordoll . Presentation outline. Introduction
E N D
HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures MPhil/Master dissertation presented by Jaume Joven Murillo and supervised by Dr. Jordi Carrabina Bordoll HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Presentation outline • Introduction • Basic concepts & state of the art in NoCs and MPSoCs • Design framework and working methodology • HW-SW NoC-based MPSoC implementation • Experimental results • Conclusions & future work HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
1. Introduction 1.1 - Introduction & research project analysis 1.2 - Objectives of the research project HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Introduction • The continuous evolution of the technology (Moore’s law) causes that every IC is able to contain a large number (until 2020 according SIA roadmap) • Productivity gap • Adopted solutions • Component reuse (IP cores) • Soft-cores processors • HW-SW co-design • Novel design methodologies • Communication centric • Novel on-chip paradigms • Networks-on-Chips (NoCs) • System-level languages • SystemC™, UML,… • Develop complex ICs with billion of transistors in the near future • Multiprocessor-System-on-Chip (MPSoC) / Multi-cores / Chip-multiprocessors (CMP) • Sea of tiles (IP cores) interconnected by a Network-on-Chip HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Objectives of the research project • Develop a HW-SW co-design framework for parallel distributed computing on-chip applying platform-based design concepts • Performs co-evolution strategy of two concurrent phases (HW-SW) • Hardware architecture • Scalable Distributed-Memory NoC-based MPSoC (NUMA) • Software framework • Software drivers • embedded Message Passing Interface (eMPI) • Run benchmarks & test applications • Explore concurrency and parallelism in on-chip environments HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
2. Basic Concepts and state of the art in NoCs and MPSoCs 2.1 - On-chip communication schemes 2.2 - Basic concepts on NoCs 2.3 - NoC topologies 2.4 - Switching modes & routing schemes 2.5 - Flow control & micro-network stack 2.6 - State of the art in NoCs/MPSoCs HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
On-chip communication architectures • Point-to-point • Fixed dedicated wires • Not flexible, Not shared • Null reusability • Bus-based interconnection (OCB) • Shared communication infrastructure • Multi-level, hierarchical or segmented buses • Bus becomes a bottleneck • On-chip network (NoC) • Distributed nature • Maximum flexibility & scalability • Exploits reusability, parallel operations/transactions • Regular geometry • Predictable layout and performance • Best testability & verification time • Must guarantee a certain QoS HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Basic concepts on NoCs • Tile • Computational nodes • Router/Switch • Communication nodes • Switching and routing strategy • Network adapter (NA, NI, NIC) • Decouple computation from communication • Adapts network & tile clock domains (GALS) • Links • Dedicated P2P communication channels • Flow control protocol (Handshake or credit-based) • NoC-based systems • High degree of composition and traffic diversity • It is desired to have good floorplanning & minimal buffer • Conventional/Traditional networks • Homogeneous and coarse grained HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
NoC topologies • Typical of multiprocessor systemsbut now on a chip • Regular • Predictable in terms of • Power consumption, • Performance (bandwidth, latency…) • Area usage • Good floorplanning • Non-regular • Mixing regular topologies • Mesh-Torus, Ring-Mesh, Ring-hypercube • Direct • At least one tile attached to each node • Indirect • A subset of nodes are not connected to any core • Its selection is a trade-off between • Network complexity or on-chip area costs • Communication requirements or network performance HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Switching modes & routing schemes • Circuit switching • Involves the establishment & releasing of a circuit between source and destination • Buffer-less switching scheme • Packet switching • Forwards the data to next hop • Buffering is mandatory • Different packet switching modes • Store-and-forward • Stall at two nodes and the link between them • Wormhole • Combines packet switching + circuit switching • Reduce buffer size • Stall at all nodes and links spanned by the packet • Virtual cut-through • Next hop must store the whole packet • Stall at local node • Buffering • Buffer size width, depth • Location in the router • Shared or distributed • Affects the power consumption & area usage HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
XY routing Switching modes & routing schemes • Routing schemes • Deterministic • Path determined by its source & destinations address • Easy to implement • Not optimal under congestion • Adaptive • Path decided on a per-hop basis • Complex in its implementation • Must be a deadlock/livelock free routing • Offers great benefits under congestion HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Flow control & Micro-Network stack • Flow control protocol (ensures the correct transport of packets) • Handshake • Request – acknowledge signals (req, ack/nAck) • Simpler and cheaper than credit-based • Credit-based • All network components keep counters for the available buffer space • Data received counter-- | Data sent counter++ | if counter==0 buffer full • Network stack layers • Transport • Network Adapter has to pack/unpack messages into network layer packets • Network • Where & how a packet is transmitted • Data-link • Protocol to transmit a flit/phit • Physical • Number & length of wires HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
State of the art in NoCs/MPSoCs • NoC is an emerging & hot topic during last years • Research at all stack levels • System/Application Level • Design methodologies, co-exploration • Programming models & OS support • Network Adapters • Network architecture • Link level • Research on MPSoC • HW-SW interfaces • Implantation of parallel programming models • Shared memory or message passing • ccNUMA MPSoC architecture using NIOS-II • MPSoC using segmented buses (HIBI) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
3. Design framework and working methodology 3.1 - HW-SW Co-design flow 3.2 - Proposed NoC-based MPSoC architecture 3.3 - Prototyping platform HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
HW-SW Co-design flow • System specification • Architecture exploration • P, VLIW, DSP… • NoC routers, busses • NIC interfaces • Architecture designand HW-SW Co-design • RTL architecture • IP core integration (Soft-cores) • Software design • Benchmarks/Applications • embedded MPI (eMPI) • NIC driver • Integration and system-verification • SystemC™ • On-chip co-debugging • Functional prototype Quartus II + SOPC Microsoft Visual Studio & Eclipse IDE for NIOSII ModelSim, GTKwave, Signal-Tap Synplify or QuartusII HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Proposed NoC-based MPSoC architecture • Distributed-memory NoC-based MPSoC components • NoC communication architecture • Soft/Hard IP core processors (Pi) • Distributed memory subsystem (Mi) • Network Interface Controller (NICi) • Driver for Network Interface Controller (NIC driver) • embedded Message Passing Interface (eMPI) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Proposed NoC-based MPSoC architecture • NoC topology • 2D-Mesh (regular, predictable) • XY Routing • Deterministic, minimal & deadlock-free • Switching mode • Ephemeral Circuit switching • Store & forward • Flow control • 4-phase handshake • Tile composition • NIOS-II Soft-core processor • On-chip RAM or SSRAM controller • NIC interface to NoC • Timer (IRQs, multi-threaded) • UART, JTAG, Performance Counter HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Prototyping platform • Stratix® EP1S25 DSP prototyping/development board • Altera® FPGA Stratix EP1S25F780C5 • Contains 25.660 LEs • Includes 1.944.576 bits of on-chip memory • 224 - M512 RAM blocks (32x18b) • 138 - M4K RAM blocks (128x36b) • 2 - M-RAM blocks • 6 PLLs • 597 maximum user I/O pins • Off-chip memory • 2 Mbytes of SSRAM configuredas two independent banks • 32 Mbits of flash memory • Other I/O • LEDs, RS232, buttons, switches, 7segments HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
4. HW-SW NoC-based MPSoC implementation 4.1 - NoC-based MPSoC block diagram 4.2 - Communication channel 4.3 - Design of the Network Interface Controller 4.4 - Router design 4.5 - Software design 4.6 - Applications and benchmarks HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
NoC-based MPSoC block diagram • Distributed-memory NoC-based MPSoC based on NIOS-II soft-core processor • Each NIOS-II Avalon based tile is generated effortlessly through QuartusII+SOPC • Our custom HW design • Implementation of flow control in eachcommunication channel • Design of Network Interface Controller • Design of the router HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Communication channel • Implements full-duplex 4-phase handshake protocol • Between NIC-Router or between routers • 4-phase is not ambiguous • Two independent and synchronous FSM have been designed • Packet definition • The definition of each subfield • XY address, message id, message length, sequence number, flags, priority… • Size of each subfield • Fixes the router and NIC implementation • Our packet format for a 2D-Mesh HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Design of the Network Interface Controller • NIC: interface between tiles and routers of our NoC • Decoupling tile’s computation from the NoC’s communication infrastructure • Important piece to get good packet injection rate over the NoC • Build flits/packets • Bus peripheral (slave) • Polling or IRQs • Register Memory mappings • N+1 bits of addressable bus space • Custom instruction (CI-based NIC) • Attached in the processor datapath • Is not master or slave HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
PathSwitchMatrix • Establish/Release the output communication channel • Selects the output according the request received from MeshXYRouting • XY Routing • Generate the signals toNorth, toEast,…,toLocal, where the packet will be forwarded Router design • Circuit switching • Ephemeral circuit switching • Two latency cycles • One for XY routing • Another for PathSwitchMatrix HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
XY routing • Essentially the same as before, but without taking into account circuitEstablished signals • PathSwitchMatrix • Save the incoming packet in the FIFO • Transmit packet from the FIFO to next hop • A FIFO controller is needed to perform the 4-phase handshake protocol • FIFOs should be mapped as on-chip RAM memory or using registers Router design • Packet switching • Store and forward • Full or shared/unified output queue • Now, the latency to traverse the router depends on: • FIFO capacity (depth) • Output queue policies • RR, CBQ, priority queues… HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Software design • HW-SW platform stack view of our distributed-memory NIOSII-based MPSOC with 2D Mesh interconnection strategy • Software components • NIC driver: low-level communication API • eMPI: high-level communication API for message passing • Optionally, between HdS (“drivers”) and high-level communication APIs an operation system (OS) might be included HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Software design • The NIC software driver contains 3 basic functions: • Interact transparently with a given NIC component exploiting all HW capabilities volatile int *NIC = (int*) (NIC_BASE); volatile int *NIC_TX = (int*) (NIC_BASE+0x4); • Status register masks • 0x1 dataPending • 0x2 txBusy HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Software design • The eMPI software API will be our high-level design language • Implements message passing over our on-chip network • Steps to create our eMPI • Select a minimal working subset of standard MPI functions • MPI_Init(), MPI_Finalize(), MPI_Comm_size(), MPI_Comm_rank() • MPI_Send(), MPI_Recv() • Porting process from standard defacto MPI to our on-chip network • Lightweight memory overhead message passing interface (~15-20KB) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Applications and benchmarks • The software framework let us to run parallel applications over the hardware architecture • All applications and benchmarks have been done by using NIC driver instead eMPI software API • COMMS1 & COMMS2 • Ping-pong benchmarks HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Processor1 Processor0 Processor2 Processor3 Applications and benchmarks • Parallelization of Mandelbrot set • Iterative loop using complex numbers • Complex numbers are C=a+bi (a, b are C/C++ float or double) • Ideal to perform a message passing parallelization • Mandelbrot set: eMPI function calls HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
5. Experimental results 5.1 - Hardware costs: area usage 5.2 - Hardware costs: area and power usage 5.3 - Software framework requirements 5.4 - On-chip network: throughput and bandwidth 5.5 - Application results 5.6 - Comparative results HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
HW costs: area usage • Router comparison between our Ephemeral Circuit Switching vs.our Packet Switching unified/shared queue • On a 2D-Mesh the number of ports are between 3-5 ports • Ephemeral Circuit Switching is between 2.5-3.8 times smaller than our Packet Switching unified/share output queue HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
HW costs: area usage • Evolution of NxN 2D-Mesh NoC-based MPSoC • Ratio of HW resources • CS: 20% comm. / 80% comp. • PS: 45% comm. / 55% comp. • Ephemeral circuit switching is a low cost architecture • Area resources • On-chip memory requirements Ephemeral Circuit Switching Packet switching (Store and forward) Logic elements (LEs) Logic elements (LEs) NxN 2D Mesh NxN 2D Mesh HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
HW costs: area and power usage • 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit Switching • Not use any on-chip memory • Communication infrastructure (15%) is extremely small compared to the computational components (85%) • HW resources distribution • Running at 20MHz we can achieve around 60 DMIPS • Overall system metrics • 49,65mW/MHz • 3 DMIPS/MHz • Dynamic power usage • 993,31mW • Static: 548,39 mW • Dynamic: 442,92 mW • The NoC only affects 0.5% HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Software framework requirements • It is necessary a RAM memory for each processor • Distributed-memory architecture • At least 64KB of RAM per processor • To load the software framework • Application data and algorithm • On-chip FPGA memory resources • High throughput (few cycles to access) • Low capacity (~KB) • External SSRAM available on the prototyping board • Low throughput (many cycles to access) • Large capacity (~MB) • Trade-off between capacity and throughput HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Speedup ~4x On-chip Network: throughput & bandwidth • 2x2 Mesh NoC-based MPSoC with Ephemeral Circuit Switching • Maximum channel bandwidth is about 168.84Mbps at 63.24MHz • Bandwidth decrease according the number of hops (end-to-end flow control) HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Application results • Test of the parallelization of Mandelbrot set in several architectures • Sequential execution on Simple NIOS-II monoprocessor • Parallel execution on a Dual-core NIOS-II architecture • Parallel execution on a 2x2 Mesh NoC-based with Ephemeral circuit switching Speedup ~4x Speedup ~2x HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Comparative results HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
6. Conclusions & future work 6.1 Conclusions 6.2 Future work HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Conclusions • I have proposed a complete HW-SW framework for distributed-memory NoC-based MPSoC architecture • eMPI is a viable solution to on-chip parallelism using message passing • The methodology have been formalized as a HW-SW co-design flow • Complete system level design tool chain • Validity tested on a physical platform (FPGA) • Methodology is also valid for ASIC development • This research work let us to perform effortlessly distributed parallel computing on a chip • Useful parallel on-chip platform for many high-performance computing and “low power” emerging applications • Multimedia applications • Smart cams • Software-defined radio • Lack of verification and support tools to create complex MPSoC HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Future work • Long term • Extend this architecture to implement heterogeneous systems • Extend this architecture to an hybrid memory model(shared distributed memory system) • Large memory bank as a tile • Cache coherence • Mechanism to access the shared medium • Should be useful to get a complete SystemC™ simulation model • Evolution of Ephemeral Circuit Switching architecture • Build a wormhole packet switching • Include a NIC queue in our Ephemeral circuit switching architecture • Change the fixed PriorityEncoder within PathSwitchMatrix • Test our architecture with bus-slave NIC with IRQs HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
Future work • Evolution of software framework • Improve the NIC software driver functions • Extend the eMPI SW API with other useful message passing collective communication functions • broadcast, scatter, gather, scan, reduce, allreduce, alltoall, reducescatter, barrier synchronization,… • Application-level • Take real application • Coarse grain or fine grain parallelism • Run GALS scheme with multiple clock domains HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures
The end…Thank you ! HW-SW Co-Design Framework for Parallel Distributed Computing on NoC-based MPSoC architectures