190 likes | 336 Views
Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008. WildFire: A Scalable Path for SMPs. Insight and Motivation. SMP abandoned for more scalable cc-NUMA But SMP bandwidth has scaled faster than CPU speed cc-NUMA is scalable but more complicated
E N D
Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008 WildFire: A Scalable Path for SMPs
Insight and Motivation SMP abandoned for more scalable cc-NUMA But SMP bandwidth has scaled faster than CPU speed cc-NUMA is scalable but more complicated Program/OS specialization necessary Communication to remote memory is slow May not be optimal for real access patterns SMP (UMA) is simpler More straightforward programming model Simpler scheduling, memory management No slow remote memory access Why not leverage SMP to the extent that it scales?
Multiple SMP (MSMP) • Connect few, large SMPs (nodes) together • Distributed shared memory • Weren't we just NUMA-bashing? • Several CPUs per node => many local memory refs • Few nodes => unscalable coherence protocol OK
WildFire Hardware • MSMP with 112 UltraSPARCs • Four unmodified Sun E6000 SMPs • GigaPlane bus (3.2 GB/s within a node) • 16 2CPU or I/O cards per node • WildFire Interface (WFI) is just another I/O board (cool!) • SMPs (UMA) connected via WFI == cc-NUMA (!) • But this is OK... few, large nodes • Full cache coherence, both intra- & inter-node WildFire from 30,000 ft (emphasis on a single SMP node)
WildFire Software • ABI-compatible with Sun SMPs • It's the software, stupid! • Slightly (allegedly) modified Solaris 2.6 • Threads in same process grouped onto same node • Hierarchical Affinity Scheduler (HAS) • Coherent Memory Replication (CMR) • OK, so this isn't purely a software technique
Coherent Memory Replication (CMR) • S-COMA with fixed home locations for each block • For those keeping score, that means it's not COMA • Local physical pages “shadow” remote physical pages • Keep frequently-read pages close: less avg. latency • Implementation: hardware counters • CMR page allocation handled within OS • Coherence still in hardware at block granularity • Enabled/disabled at page granularity • CMR memory allocation adjusts with mem. pressure
Hierarchical Affinity Scheduling (HAS) • Exploit locality by scheduling a process on the last node on which it executed • Only reschedule onto another node when load imbalance exceeds a threshold • Works particularly well when combined with CMR • Frequently-accessed remote pages still shadowed locally after a context switch • Lagniappe locality
WildFire Implementation A single Sun E6000 with WFI. Recall: WildFire Interface is just one of 16 standard cards on the GigaPlane bus.
WildFire Implementation • Network Interface Address Controller (NIAC) + Network Interface Data Controller (NIDC) == WFI • NIAC interfaces with GigaPlane bus and handles inter-node coherence • Four NIDCs talk to point-to-point interconnect between nodes • Three ports per NIDC (one for each remote node) • 800MB/s in each direction with each remote node
WildFire Cache Coherence • Intra-node coherence: bus + snoopy • Inter-node (global) coherence: directory • Directory state kept at a block's home node • Directory cache (SRAM) backed by memory • Home node determined by high-order address bits • MOSI • Nothing special since scalability not an issue • Blocking directory, 3-stage WB => no corner cases • NIAC sits on bus and asserts “ignore” signal for requests that global coherence must attend to • NIAC intervenes if block's state is inadequate or resides in remote memory
WildFire Cache Coherence • Coherent Memory Replication complicates matters • A local shadow page has a different physical address from its corresponding remote page • If a block's state is insufficient, must look up global address in order for WFI to issue remote request • Stored in LPA2GA SRAM • Also cache the reverse lookup (GA2LPA)
WildFire Memory Latency • WildFire compared to SGI Origin (2x R10K per node) and Sequent NUMA-Q (4x Xeon per node) • WF's remote mem. latency mediocre (2.5x Origin, similar to NUMA-Q), but less relevant because remote accesses less frequent (1/14 as many as Origin, 1/7 as many as NUMA-Q)
Evaluating WildFire • Clever performance evaluation methodology: isolate on WildFire itself by comparing single-node 16-cpu system with two-node, 8cpu/node system • Pure SMP vs. WildFire • Also compare with NUMA-fat • Basically WF with no OS support, i.e. no CMR, no HAS, no locality-aware memory allocation, no kernel replication • And compare with NUMA-thin • NUMA-fat but with small (2 CPU) nodes • Finally, turn off HAS and CMR to evaluate their contribution to WF's performance
Evaluating WildFire • WF with HAS+CMR comes within 13% of pure SMP • Speedup(HAS+CMR) >> Speedup(HAS)*Speedup(CMR) • Locality-aware allocation and large nodes are important
Evaluating WildFire • Performance trends correlate with locality of reference • HAS + CMR + Kernel Replication + Initial Allocation improve locality of access from 50% (i.e. uniform distribution between two nodes) to 87%
Summary WildFire = a few large SMP nodes + directory-based coherence between nodes + fast point-to-point interconnect + clever scheduling and replication techniques Pretty good performance (unfortunately, no numbers for 112 CPUs) Good idea? I think so, but I doubt much room for scalability Then again, that wasn't the point Criticisms? Authors are very proud of their slow directory protocol Kernel modifications may not be so slight
Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008 Questions?
Erik Hagersten and Michael Koster Presented by Andrew Waterman ECE259 Spring 2008 WildFire: A Scalable Path for SMPs