Fast Communication for Multi – Core SOPC

FastCommunication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2007 End Project Presentation Supervisor: Evgeny Fiksman Performed by: Moshe Bino Alex Tikh One year project

Table of Content Introduction Hardware Design Software Design Debug Process Results Future Research Table of Content

Problem statement Single CPU is reaching its technological limits, e.g. heat dissipation and cost/power ratio. Thus parallel computing evolved, utilizing multi core processor paradigm. Three major inter-communication techniques are: Shared memory Remote procedure calls. Message passing – (MPI) Introduction

Project Overview • Mesh topology NoC • Routing nodes • Leaf processor cores • MPI logically defines clusters • Comm • Rank • Cores amount is limited only by chip resources Introduction

The following components are to be implemented: Quad core system. NoC router (4 ports) and infrastructure for fast communication in multi-core system. Chosen MPI functions written in C. Software application demonstrating the advantages of a parallel system (written in C). Project goals Introduction

Block diagram • Multi – core IP’s • Bi - directional link FSL • Local memory • Main core connected to I/O • Multi - clock domain • System on programmable chip implemented on FPGA Introduction

Constrains: FPGA (V2P) maximum clock frequency 400MHz. MicroBlaze (MB) core maximum frequency 100MHz. Router maximum frequency 200MHz Processors Memory size 64kbyte. (code + data). Processor to FSL access time - 3 clock cycles. Maximum FSL buffer depth is 128 - equals 0.5kbyte. Interrupt handle time - 20 clock cycles (no interrupts nesting). Measurement system: Router works at 100MHz frequency. MB works at 25MHz frequency. FSL depth is 64 word - 0.25kbyte Router is designed for relatively small messages – max. 1kbyte due to processors & FPGA chip memory size. System specifications Introduction

Router Implementation Hardware Design

CROSS – BAR • Two main units: • Permission Unit • Port FSM • Time limited • Round Robin arbiter • Port to Port & broadcasting • Smart Connectivity • R – R • R - Core • Modular design Hardware Design

MicroBlaze#2 Application Network Data Physical System software Layers • Application • MPI functions interface • Network • hardware independent implementation • Data • relies on message structure • Physical • designed for FSL bus MicroBlaze#1 MPI Application Network Data Physical Design modularity in hardware and software Software design

MPI Functions set Software design

Debug – Simulation • The Test Bench write messages to the FSL pipe (MB output side) and reads messages from the pipe (MB input side). • Signals can also be viewed in ModelSim Debug Process

Debug process – Real time • Software debug using 2 MB system. • Debug mainly done using printf function for plotting results to monitor trough UART. • Hardware debug was using chip scope application and LEDs for indication Debug Process

Example applicationMatrix - Vector multiplication Results Typical example of highly parallel application. Root processor broadcasts Vector. Selected Matrix Row sent by root to each processor. Each processor computes and returns its result. Computed results are combined into a vector by root processor.

Example applicationMatrix - Vector multiplication Router MPI MPI MPI MPI Results

Matrix - Vector multiplication - results • For simple operations single processor is preferred Results * Time = ticks/clk frequency

Matrix - Vector multiplication - results • When the operation takes more time than the send and receive time the router becomes efficient Results

Router statistics • Transfer time theoretical limit is 8 clock cycle time • The limit is calculated from: • 3 clks for putfsl. • 2 clks for router read & write. • 3 clks for getfsl • total = 8 clks Results

Router statistics • Bcast takes more time then send. • The slope value (8) comes from transfer limit time. Results

Router statistics Results

Future Directions • Improve router performance ~400Mhz • Expand network to more than 4 processors Future research

QUESTIONS ?

Message payload • The Header consist of the fields: • The Tail consist of the fields: *Empty fields where left to allow network and functionality extensions. Introduction

Example 1 • At each time slot part of the message is send to it’s destination as long as the destination port is not busy. • When Port is busy the next requesting port is service (no delay). Hardware Design

Example 2 • If one port has no data (port 2) other ports are serviced by order. Hardware Design

Example 3 • Handling BCAST command and port arbitrating while 2 ports has the same destination. Hardware Design

MPI_Send: composes header and tail, and sends it with the message (body) Sending the message Software Design

Receiving the message Interrupt Vector: receives incoming messages, and stores them in suitable linked list Software Design

Return received message MPI_Recv: message details received from user. Looks for this message in linked list of already received messages Software Design

Fast Communication for Multi – Core SOPC

Fast Communication for Multi – Core SOPC

Presentation Transcript

communication powerpoint (modern) presentation content: 123

PPT Presentation

talk-ppt - PowerPoint Presentation

Multi-core computing

Multi-Core Systems

Multi-core architectures

マルチコア /Multi-Core

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-core Programming

Multi-Core Computing

Coincidence Detector on SOPC Final Presentation

Template for PowerPoint Presentation

Multi-Core Development

Fast Communication for Multi – Core SOPC

Kernel-assisted MPI Communication on Multi-core Clusters

Fast Communication

OS Implementation On SOPC Charcterization Presentation

OS Implementation On SOPC Midterm Presentation

Multi-page ppt