200 likes | 291 Views
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab. Spring 2009. Implementing a NoMC on the Gidel platform end-project presentation. Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch. Table of Contents.
E N D
Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Spring 2009 Implementing a NoMC on the Gidel platformend-project presentation Instructor: Evgeny Fiksman Students: Meir Cohen Daniel Marcovitch
In the previous semester… Problem definition: • In previous semseter we took previous “router” and converted it to work on Altera platform. • In addition we prepared system architecture and microarchitecture. • Implementing a parallel processing system which contains several NoCs, each chip containing several sub-networks of processors. PC forms part of the network using PCI. • Writing an application which utilizes parallel processing. • Measuring system performance
This semester… • Implemented the various HW modules needed for larger scale routing: • Added 5th port to all routers/switches • Fabric router • InterChip GW • PC GW • Implemented asynchronous MPI commands • (MPI commands • were implemented • both for Nios and for PC) • Wrote example application which utilizes the 64 processors to solve problem (heat transfer) • Measured system performance) 4
Putting it all together – a general view of topology • Each local cluster has 4 processors. • Each chip has 4 clusters (comms) • Gidel board has 4 chip – altogether 64 processors • PC is also part of chip – switching between 4 FPGAs is • done in software – i.e if forms a “virtual switch”.
New HW modules(1) – Fabric router • In “Local router” – forwarding is done by rank – i.e rank = port • In “Fabric router” – forwarding table is implemented.
Routing tables chip fabric local Address comm rank • Local router: • Similar comm – routing by rank. • Other comms – to 5th port. • Other routers: • Routing by comm/chip only. • myComm,myChip entry used for PC routing • Implemented using VHDL’s “generate” command to reuse existing modules. • Hex file is created for each router, loaded into ROM using parameter. • Grouping (i.e sub-network prefixes) allows us to use small routing table • (only 8 entries)
New HW modules(2) – IC GW FIFO c Remote credit release Credit counter Remote buffer (inc) (dec) Local buffer Local credit release • Primary/Secondary indicates connectivity rather than implementation • Interchip interface has increased latency – we use buffers and credits to ensure no fifo overrun • Credit counter is initialized with fifo size (i.e 32) as initial #credits • Since fifo size > end 2 end latency – block give 100% throughput
New HW modules(2) – IC routing • IC connectivity itself uses Gidel’s fastest busses: • 1. Neighbour busses between 1-2, 2-3, 3-4 • 2. Main bus between 1-4 • Both busses are wide enough to support bi-directional traffic • i/f : 32 bit data, ctrl, credit_release, push/pop [total: 35 bits X 2]
New HW modules(3) – PC GW ToPCGw FromPCGw • Needed for three reasons: • 1. FromPCGw adds start/finish “ctrl” signal (parses MPI header for “size” field) • 2. Handle PCI idiosyncrasies (minimum messaged length) • 3. Use “Gidel’s (req/ack) simple FIFO protocol rather than • Altera’s fifo protocol (push/pop) 10
Testing and debug • Since the project is multi-layered, debug can be split into several types: • HW (component) issues • Connectivity • SW (NIOS/PC) • Component testing • Small testbenches encompassing single block • Connectivity • Before running main application – we ran connectivity application to check all nios can communicate with each other. • Made Specman-E simulation emulating the router’s operation while loading and parsing the real hex files.
Testing and debug • SW/NIOS • Model Sim was used for logical simulation. • Since system was large and debugging is difficult and multi-layered (debugging application run on NIOS), we added special debug registers. • Each NIOS writes to these registers (PIO – parallel I/O) during application run, publishing its “state”. • In addition, debug registers were attached to main FIFOs to indicate traffic flow (performance counters) • When running on chip itself, • these registers are sampled and displayed during the application to give indication of system state PIO FIFO counters
Application Parallel jacobian algorithm for approximation solution for the equation . Distribute matrix among CPUs. CPUs communicate with neighbors. Uses computation-communication overlapping. Managed by the host PC. iteration compute interior send/receive boundary compute boundary matrix distribution:
Performance – application time vs number of iterations • Measurements done on dual core pentium processor running at 2.4Ghz • Constant offset indicates PCI latency • Running length is #Iterations * (communication + calculation) • Linear equation as expected: • #Iterations * (communication + calculation) + PCI offset
Performance – throughput vs injection rate • For low injection rate – routing isn’t a bottleneck => • output rate almost identical to input • As injection rate increases – router becomes bottleneck • Once maximum throughput of router is met – throughput is constant
Performance – simplified model – delay(congestion) • D(p) – delay(# packets in system) • R – average router delay • L – system latency • λ – injection rate • D(p)=R∙p + L • P=λ∙D(p) [little’s law] • D(p) =λ∙L/(1-λ∙R) R=50, L=80
Performance – packet delay vs number of injection stubs • Few stubs injection – almost no congestion – constant delay • As we approach throughput – congestion increases and delay decreases • For very high injection rate –we approach system saturation • (since fifo sizes are finite (32 entries) there is a maximum number of • packet in the system at any given moment)
Performance – packet delay vs injection rate • For low injection rate – almost no congestion – constant delay • We again see an exponential increase which peters out due to system • saturation
Summary/conclusions: Next steps: • Compare topologies (mesh / fat tree ) • Develop software to automatically create topologies out of building blocks • Simplify router and increase throughput • Original router was robust and easily expanded to support 5th port and routing tables • Debugging software written on this system posed a serious challenge, and required a certain measure of innovation. • Despite being on chip – communication between processors still constitutes a serious factor. Therefore, the overall performance system will improve as the calculation/communication ratio decreases. • For similar reasons, network can be better used if locality between nodes is utilized.
Questions Questions