Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms Chita R. Das High Performance Computing Laboratory Department of Computer Science & Engineering The Pennsylvania State University EEHiPC-December 19, 2010

Talk Outline • Technology Scaling Challenges • State-of-the-art Design Challenges • Opportunity: Heterogeneous Architectures • Technology – 3D, TFET, Optics, STT-RAM • Processor – new devices, Core heterogeneity • Memory – STT-RAM, PCM, etc • Interconnect – Network heterogeneity • Conclusions

Computing Walls Moore’s Law Data from ITRS 2008

Computing Walls • High performance MOS started out with 12V • Current high-perf. μPs have 1V supply => (12/1)2 = 144x over 28 years. • Only (1/0.6)2 = 2.77x left in next 12 years! P ≈ CV2f Lower V can reduce P But, speed decreases with V Utilization and Power Wall 3x 1x Data from ITRS 2008

Computing Walls Memory bandwidth: Pin count increases only 4x compared to 25x increase in cores Data from ITRS 2009

Computing Walls Failure rate per transistor must decrease exponentially as we go deeper into the sub-nm regime Data from ITRS 2007 Reliability Wall

Computing Walls Global wires no longer scale

State-of-the-Art in Architecture Design Multi-coreProcessor 16nmtechnology 25KFPU(64b) 37.5TFLOPS 150W(computeonly) 64b FPU 0.015 mm2 4pJ/op 3GHz 20mm 10mm20pJ 4cycles 64b 1 mm channel 2pJ/word • Theenergyrequiredtomovea 64b-wordacrossthedieis equivalenttotheenergyfor 10 Flops • Traditionaldesignshave approximately75%ofenergy consumedbyoverhead 64b Off-Chip Channel 64pJ/word Performance = Parallelism Efficiency = Locality Bill Harrod, DARPA, IPTO, 2009

Energy cost for different operations Energydominatedbydataandinstructionmovement. *Instscolumngivesnumberofaverageinstructionsthatcanbeperformedforthis energy.

ConventionalArchitecture(90nm) Energyisdominatedbyoverhead D ConventionalArchitecture 3.0E-101.0E-102.0E-10 4.8E-101.3E-10 FPU Local Global Off-chip Overhead DRAM 1.4E-8 Dally

Where is this overhead coming from? • Complex microarchitecture: • OOO, register renaming, branch prediction…. • Complex memory hierarchy • High/unnecessary data movement • Orthogonal design style • Limited understanding of application requirements

Both Put Together…. • Power becomes the deciding factor for designing HPC systems • Joules/operation • Hardware acquisition cost no more dominatesin terms of TCO

IT Infrastructure Optimization: The new Math Spending (US$B) Installed base (M units) 1 Watt consumed at the server cascades to approximately 2.84 watts of total consumption Becoming comparable! 50 $300 45 Server Component (1 W) 2.74 W $250 40 Source: IDC 35 $200 30 25 Cumulative consumption $150 Source: Emerson 2.84 W 20 $100 1.00 W 1.18 W 15 DC-DC (0.18 W) AC-DC (0.31 W) 10 $50 5 1.49 W 0 $0 Power Distribution (0.04 W) 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 POWER & COOLING COSTS SERVER MANAGEMENT COST 1.53 W NEW SERVER SPENDING UPS (0.14 W) 1.67 W Cooling (1.07 W) Building Switchgear/ Transformer (0.10 W) Until Now: Minimize equipment, software/licenses, and service/management costs. Going Forward: Power and Physical Infrastructure costs to house the IT become equally important. Become “Greener” in the process. - 14 -

A Holistic Green Strategy Facilities Operations, Office spaces, Factories, Travel and Transportation, Energy sourcing, … Support UPS, Power distribution, Chillers, Lighting, Real estate Core Servers Storage Networking Technology for Greening Greening Of Technology Source: A. Sivasubramaniam

ProcessorPowerEfficiency BasedonExtremeScale Study BillDally’sstrawmanprocessorarchitecture •Possibleprocessordesignmethodologyforachieving28pJ/Fflop •Requiresoptimizationofcommunication,computationandmemory components 28pJ/FLOP MinimizeDRAMenergy 631pJ/FLOP MinimizeOverhead 2.5nJ/FLOP ConventionalDesign Bill Harrod, DARPA, IPTO, 2009 4

Opportunity- Heterogeneous Architectures • Multicore era • Heterogeneous multicore architectures provide the most compelling architectural trajectory to mitigate these problems Hybrid memory sub-system: SRAM, TFET, STT-RAM Hybrid cores: Big, small, accelerators, GPUs Heterogeneous interconnect

A Holistic Design Paradigm Heterogeinity in interconnect Heterogeinity in memory des. Heterogeinity in micro-arch. Heterogeinity in device/circuits

Technology Heterogeneity • Heterogeneity in technology: • CMOS based scaling is expected to continue till 2022 • Exploiting emerging technologies to design different cores/components is promising because it can enable cores with power/performance tradeoffs that were not possible before. TFETs provide higher performance than CMOS based designs at lower voltages V/F scaling of CMOS and TFET devices

Processor Cores Big core Latency critical Small core Throughput critical GPGPUs BW critical Heterogeneous Compute Nodes Accelerators/ ASIC Latency/ time critical

Memory Architecture Comparison of memory technologies Role of novel technologies in memory systems

Heterogeneous Interconnect Buffer Utilization Link Utilization • Non-uniformity is due to: non-edge symmetric network and X-Y routing. So, • Why clock the routers at the same frequency: Variable frequency routers for designing NoCs • Why allocate all routers similar area/buffer/link resources: Heterogeneous routers/NoC

Software Support • Compiler support • Thread remapping to minimize power: migrate threads to TFET cores to reduce power • Dynamic instruction morphing: instructions of a thread are morphed to match the heterogeneous hardware the thread is mapped to by the runtime system • OS support • Heterogeneity aware scheduling support • Run-time thread migration support

Current research in HPCL Problems with Current NoCs • NoC power consumption is a concern today • With technology scaling, NoC power can be as high as 40-60W for 128 nodes2 Intel 80 core tile power profile1 1. A 5-GHz Mesh Interconnect for A Teraflops Processor–Y. Hoskote, S. Vangal, A. Singh, N. Borkar, S. Borkar in IEEE MICRO 2007 2. Networks for Multi-core Chips:A contrarian view - S. Borkar in Special Session at ISLPED 2007

Network performance/power ` The proposed approach1 @low load: optimize for performance(reduce zero load latency and accelerate flits) @high load: manage congestion and power Observation: @low load: low power consumption @high load: high power consumption and congestion 1. A Case for Dynamic Frequency Tuning in On-Chip Networks, MICRO 2009

Frequency Tuning Rationale Throttle/ Frequency is lowered No change Frequency is boosted Congested Upstream router throttles depending upon its total buffer utilization No change

Performance/Power improvement with RAFT • FreqTune gives both power reduction and throughput improvement • 36% reduction in latency, 31% increase in throughput and 14% power reduction across all traffic patterns FreqBoost at low load (optimize performance) FreqThrtl at high load (optimize performance and power)

A Case for Heterogeneous NoCs • Using the same amount of link resources and fewer buffer resources as a homogeneous network, this proposal demonstrates that a carefully designed heterogeneous network can reduce average latency, improve network throughput and reduce power • Explore types, number and placement of heterogeneous routers in the network Small router Big router Narrow link Wide link

HeteroNoC Performance-Power Envelope • 22% throughput improvement • 25% latency reduction • 28% power reduction

3D Stacking = Increased Locality! Many more neighbors within a few minutes of reach!

Reduced Global Interconnect Length L L • Delay/Power Reduction • Bandwidth Increase • Smaller Footprint • Mixed Technology Integration

3D routers for 3D networks One router in one grid (Total area = 4L2) Stack layers in 3D (Total area = L2) Stack routers components in 3D (Total area = L2) Results from MIRA: A Multi-layered On-Chip Interconnect Router Architecture, ISCA 2008

Conclusions Need a coherent approach to address the submicron technology problems in designing energy-eficient HPC systems Heterogeneous multicores can address these problems and would be the future architecture trajectory But, design of such systems is extremely complex Needs an integrated technology-hardware-software-application approach

HPCL Collaborators Faculty: VijaykrishnanNarayanan Yuan Xie AnandSivasubramaniam MahmutKandemir Students: Sueng-Hwan Lim Bikash Sharma Adwait Jog Asit Mishra Reetuparna Das Dongkook Park Jongman Kim Partially supported by: NSF, DARPA, DOE, Intel, IBM, HP, Google, Samsung

THANK YOU !!! Questions???

Challenges and Opportunities in Designing Energy-Efficient High-Performance Computing Platforms