1 / 22

A Study of Cyclops64 Crossbar Architecture and Performance

A Study of Cyclops64 Crossbar Architecture and Performance. Yingping Zhang April, 2005. Overview. Background Architecture Of C64 Crossbar Performance Simulation Test Result Performance Analysis Conclusion Future Work. Background. 1. What is Cyclops64?

nellie
Download Presentation

A Study of Cyclops64 Crossbar Architecture and Performance

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Study of Cyclops64 CrossbarArchitecture and Performance Yingping Zhang April, 2005

  2. Overview • Background • Architecture Of C64 Crossbar • Performance Simulation • Test Result • Performance Analysis • Conclusion • Future Work

  3. Background 1. What is Cyclops64? • Cyclops64(C64), also calledBlue Gene/C, is part of IBM Blue Gene project. • It is a cellular architecture-based supercomputer. Each chip consists of 75~80 custom designed 64-bit processors. Each processor will have two thread units, two integer units, and a floating point unit. • C64 is expected 1000 teraflops and will be one of the fastest supercomputers in the world. • The architecture was conceived by Cray award winner Monty Denneau , Verification testing and system software development is being done at our CAPSL group. 2. What is the project goal? Study of the architecture and performance of the C64 interconnection network, crossbar (part of Verification testing)

  4. ICache ICache ICache 5 5 5 TU TU TU TU TU TU FP FP FP Host IF Mickey tree Gbit ethernet C64 Processor C64 Processor C64 Processor C64 Processor C64 Processor C64 Processor Disk C64 Processor C64 Processor C64 Processor C64 Processor C64 Processor C64 Processor Mickey tree (DMA) C64 Processor C64 Processor C64 Processor Gbit ethernet (DMA) Mickey tree Gbit ethernet Disk Mickey tree (DMA) Gbit ethernet (DMA) FIFO 64-bit x 64 4 DDR2 SDRAM Controller Cyclops64 CHIP * The configuration pins are Connected to all modules except DDR and Crossbar Configuration Pin Processor# 80 ICache# 16 mpg mpg mpg mpg Crossbar • Port 0-79 for C64 processors • Port 80-83 for mpg ICache • Port 84,85 for Host IF • Port 86-89 for DRAM controller • Port 90-95 for ASw ASw (a part of 3D cube network) FPGA The other C64 chips DDR2 SDRAM DIMMs

  5. Architecture Of C64 Crossbar • On chip crossbar: • Provide communication inside a single chip • 96-way crossbar: • 96 input ports, 96 output ports. Each port can connect with any other port and itself. • Any communication among processors, ICaches, SRAM, DRAM, and ASwitches has to go through the crossbar • Pipelined crossbar: • 7 pipeline stages • When full pipelined, each port flow out one packet each cycle • Bandwidth of the crossbar = port number * length of the packet

  6. Crossbar Architecture 96 96 96 F I F O F I F O F I F O C 6 4 | M P | C O R E C 6 4 | M P | C O R E 96 96 96 MUX MUX Port# 96 102+2 102+2 102+2 TUnitA TUnitA TUnitA 102 102 102 SrcSplit SrcSplit SrcSplit 92 10 92 10 92 10 LC SrcCtl Ws Wr Rs Req Ack LC SrcCtl Ws Wr Rs Req Ack LC SrcCtl Ws Wr Rs Req Ack 92 92 92 C 6 4 | M P | C O R E 9 9 9 TarCtl TarCtl TarCtl Sel Sel Sel Arbiter Arbiter Arbiter MUX 3 3 3 92 92 92 TarCombine TarCombine TarCombine 95 95 95 TUnitB TUnitB TUnitB 95 95 95

  7. Crossbar Architecture 96 96 96 F I F O F I F O F I F O C 6 4 | M P | C O R E C 6 4 | M P | C O R E C 6 4 | M P | C O R E 96 96 96 MUX MUX MUX Port# 96 102+2 102+2 102+2 1 TUnitA TUnitA TUnitA 102 102 102 SrcSplit SrcSplit SrcSplit 92 10 92 10 92 10 2 LC SrcCtl Ws Wr Rs Req Ack LC SrcCtl Ws Wr Rs Req Ack LC SrcCtl Ws Wr Rs Req Ack 92 92 92 9 9 9 3 TarCtl TarCtl TarCtl 4 Sel Sel Sel 5 Arbiter Arbiter Arbiter 3 3 3 92 92 92 6 TarCombine TarCombine TarCombine 95 95 95 TUnitB TUnitB TUnitB 7 95 95 95

  8. Performance Simulation • Performance Measurement • Latency: The time required for a packet to traverse the network form source to destination • Throughput: The rate at which packets are delivered by the network for a particular traffic pattern • Workloads • Synthetic: Random Distributed vs Poisson Distributed • Application Driven: Hello_World, Matrix_Cthread, Laplace_Cthread, Heat_Cthread, Cnet_get_nb, Cnet_put_nb, Dev_Align, Dev_Reset • Simulators • Csim_crossbar • LAST (Both designed by Fei Chen at CAPSL)

  9. Parameters configuration PARAMETERS Arbitration Schemes Workloads Application Driven Benchmarks Synthetic Uniformly Random Circular Matrix Segmented Matrix Fixed Priority Temporal1 Characteristics Spatial2 Distributions Permutation (Neighbor & Tornado) Uniform Random Uniform Random Poisson • Describe the generation probability of message over time • Determine the communication paths between the sources and destinations

  10. Test Results: Latency - Synthetic Workloads • Latency of Uniform Random Pattern goes infinite when injection rate > 0.6 • Latency of Permutation Traffic is always 7 cycles without any change.

  11. Test Results: Throughput - Synthetic Workloads (Cont) • Uniform workload with permutation traffic pattern has linear throughput • This network is a stable network

  12. Test Results: Contention - Synthetic Workloads(Cont) • Permutation Traffic has zero contention • Uniform distribution has more contention than POISSON distribution

  13. Performance Analysis One - Synthetic Workloads • The least latency in the crossbar is 7 cycles. • The crossbar is a stable network because its throughput does not degrade beyond the saturation point. • Contention at the output causes the delay of transferring message, and permutation traffic has zero contention • Uniformly random workload with permutation traffic has the best performance. When injection rate reaches 1.0, its throughput can achieve 1.

  14. Test Results: Latency - Arbitration Schemes • Fixed Priority Scheme is the worst case, its latency goes infinite at rate 0.5 • Others have very similar latency behavior

  15. Test Results: Throughput - Arbitration Schemes (Cont) • Fixed Priority Scheme is the worst case, the network saturates at rate 0.5 • Others have very similar throughput behavior

  16. Performance Analysis Two - Arbitration Schemes • SLRU, PLRU, CIRC and RAND arbitration schemes show very similar performance behavior under uniformly random traffic pattern. • Fixed Priority arbitration scheme shows the worst performance behavior under the same situation.

  17. Test Results – Application-Driven Benchmarks • Average reverse latency increases very fast when packet number increased • Forward and reverse traffics have different latency behavior

  18. Performance Analysis -Application-Driven Benchmarks • C64 architecture classified traffic into: • Class 0 (Forward traffic): messages send out from processor, like load request and stores from processors • Class 1 (Reverse traffic): Messages send back to processors, like load return to processors • Reverse transfer delay is much bigger than forward transfer delay • Forward and reverse transfer have similar throughput

  19. Conclusion For Synthetic Workloads Verified: • C64 crossbar is a stable network • The least latency of C64 crossbar is 7 cycles. Discovered: • Traffic pattern, including temporal characteristics and spatial distribution, has sensitive affect on the crossbar performance behavior • permutation spatial traffic has the best latency behavior. It keeps to have the least latency 7 cycles because it has zero contention. • Uniform random distributed workload has better throughput behavior. • Fixed priority arbitration scheme has worst performance behavior and others are very similar For Application-Driven Workload Discovered: • Forward and reverse traffics have different latency behavior but similar throughput behavior • Reverse traffic has worse latency behavior than forward

  20. Future Work Synthetic Workloads • Investigate arbitration schemes under different traffic patterns Application-Driven Workloads • Investigate performance behavior of C64 Crossbar under different configuration constrains • Number of used thread units • Number of involved memory banks • Investigate performance behavior of C64 Crossbar under different arbitration schemes Summary of Performance Analyses Documentation

  21. Acknowledge Fei Chen Yuhei Dimitri Joseph Ted Prof. Gao All people in CAPSL group

  22. Question? Thanks!!!

More Related