180 likes | 288 Views
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect. Heiner Litz University of Heidelberg. Motivation. Future Trends More cores, 2-fold increase per year [Asanovic 2006] More nodes, 200.000+ nodes for Exascale [Exascale Rep.] Consequence
E N D
TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as aNetwork Interconnect Heiner Litz University of Heidelberg
Motivation • Future Trends • More cores, 2-fold increase per year [Asanovic 2006] • More nodes, 200.000+ nodes for Exascale [Exascale Rep.] • Consequence • Exploit fine grain parallelisim • Improve serialization/synchronization • Requirement • Low latency communication
Motivation • Latency lags Bandwidth [Patterson, 2004] • Memory vs. Network • Memory BW 10GB/s • Network BW 5 GB/s • Memory Latency 50ns • Network Latency 1us • 2x vs. 20x
State of the Art Clusters Scalability Ethernet TCCluster Infiniband SW DSM SMPs Tilera Larrabee Quickpath HyperTransport Lower Latency
Observation • Today’s CPUs represent complete Cluster nodes • Processor cores • Switch • Links
Approach • Use host interface as interconnect • Tightly Coupled Cluster (TCCluster)
Background • Coherent HyperTransport • Shared memory SMPs • Cache coherency overhead • Max. 8 endpoints • Table based routing (nodeID) • Non-coherent HyperTransport • Subset of cHT • I/O devices, Southbridge,.. • PCI like protocol • “Unlimited” number of devices • Interval routing (memory address)
Approach • Processors pretend to be I/O devices • Partitioned global address space • Communicate via PIO writes to MMIO
Programming Model • Remote Store PM • Each process has local private memory • Each process supports remotely writable regions • Sending by storing to remote locations • Receiving by reading from local memory • Synchronization through serializing instructions • No support of bulk transfers (DMA) • No support for remote reads • Emphasis on locality, low latency reads
Implementation • 2x Two-socket Quadcore Shanghai Tyan Box node0 node1 node1 node0 1 1 Reset/PWR 2 2 3 3 3 3 SB HTX HTX SB ncHT link 16@3.6Gbit BOX 0 BOX 1
Implementation • Software based approach • Firmware • Coreboot (LinuxBIOS) • Link de-enumeration • Force non-coherent • Link frequency & electrical parameters • Driver • Linux based • Topology & Routing • Manages remotely writable regions
Memory Layout DRAM Hole RW mem UC MMIO WC 6 GB 6 GB 5 GB MMIO WC RW mem UC 5 GB Node1 WB Node1 WB 4 GB 4 GB Local DRAM node 0 WB Local DRAM node 0 WB 0 GB 0 GB BOX 1 BOX 0
Bandwidth – HT800(16bit) Singlethread message-rate: 142 mio
Latency – HT800(16bit) Software-2-Software Half-Roundtrip 227 ns
Conclusion • Introduced novel tightly coupled interconnect • “Virtually” moved the NIC into the CPU • Order of magnitude latency improvement • Scalable • Next steps: • MPI over RSM support • Own mainboard with multiple links
References • [Asanovic, 2006] Asanovic K, Bodik R, Catanzaro B, Gebis J. The landscape of parallel computing research: A view from berkeley. UC Berkeley Tech Report. 2006. • [Exascale Rep] ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems • [Patterson, 2004] Latency lags Bandwidth. Communications of the ACM, vol. 47, number 10, pp. 71-75, October 2004.