1 / 18

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect. Heiner Litz University of Heidelberg. Motivation. Future Trends More cores, 2-fold increase per year [Asanovic 2006] More nodes, 200.000+ nodes for Exascale [Exascale Rep.] Consequence

Download Presentation

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as aNetwork Interconnect Heiner Litz University of Heidelberg

  2. Motivation • Future Trends • More cores, 2-fold increase per year [Asanovic 2006] • More nodes, 200.000+ nodes for Exascale [Exascale Rep.] • Consequence • Exploit fine grain parallelisim • Improve serialization/synchronization • Requirement • Low latency communication

  3. Motivation • Latency lags Bandwidth [Patterson, 2004] • Memory vs. Network • Memory BW 10GB/s • Network BW 5 GB/s • Memory Latency 50ns • Network Latency 1us • 2x vs. 20x

  4. State of the Art Clusters Scalability Ethernet TCCluster Infiniband SW DSM SMPs Tilera Larrabee Quickpath HyperTransport Lower Latency

  5. Observation • Today’s CPUs represent complete Cluster nodes • Processor cores • Switch • Links

  6. Approach • Use host interface as interconnect • Tightly Coupled Cluster (TCCluster)

  7. Background • Coherent HyperTransport • Shared memory SMPs • Cache coherency overhead • Max. 8 endpoints • Table based routing (nodeID) • Non-coherent HyperTransport • Subset of cHT • I/O devices, Southbridge,.. • PCI like protocol • “Unlimited” number of devices • Interval routing (memory address)

  8. Approach • Processors pretend to be I/O devices • Partitioned global address space • Communicate via PIO writes to MMIO

  9. Routing

  10. Programming Model • Remote Store PM • Each process has local private memory • Each process supports remotely writable regions • Sending by storing to remote locations • Receiving by reading from local memory • Synchronization through serializing instructions • No support of bulk transfers (DMA) • No support for remote reads • Emphasis on locality, low latency reads

  11. Implementation • 2x Two-socket Quadcore Shanghai Tyan Box node0 node1 node1 node0 1 1 Reset/PWR 2 2 3 3 3 3 SB HTX HTX SB ncHT link 16@3.6Gbit BOX 0 BOX 1

  12. Implementation

  13. Implementation • Software based approach • Firmware • Coreboot (LinuxBIOS) • Link de-enumeration • Force non-coherent • Link frequency & electrical parameters • Driver • Linux based • Topology & Routing • Manages remotely writable regions

  14. Memory Layout DRAM Hole RW mem UC MMIO WC 6 GB 6 GB 5 GB MMIO WC RW mem UC 5 GB Node1 WB Node1 WB 4 GB 4 GB Local DRAM node 0 WB Local DRAM node 0 WB 0 GB 0 GB BOX 1 BOX 0

  15. Bandwidth – HT800(16bit) Singlethread message-rate: 142 mio

  16. Latency – HT800(16bit) Software-2-Software Half-Roundtrip 227 ns

  17. Conclusion • Introduced novel tightly coupled interconnect • “Virtually” moved the NIC into the CPU • Order of magnitude latency improvement • Scalable • Next steps: • MPI over RSM support • Own mainboard with multiple links

  18. References • [Asanovic, 2006] Asanovic K, Bodik R, Catanzaro B, Gebis J. The landscape of parallel computing research: A view from berkeley. UC Berkeley Tech Report. 2006. • [Exascale Rep] ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems • [Patterson, 2004] Latency lags Bandwidth. Communications of the ACM, vol. 47, number 10, pp. 71-75, October 2004.

More Related