Toward a Sustainable Architecture at Extreme Scale

Toward a Sustainable Architecture at Extreme Scale Zhimin Tang, CTO tangzhm@sugon.com

Outline • Sustainable (Cost Effective) HPC • Counter-examples in the history • Current and Future Challenges • New computing forms from sensor to cloud • Silicon based IC process approaching its physical limit • Strategy • Abandon HPC only acceleration features • Design sustainable architecture for HPC and other applications

Considerations of Cost Effectiveness or Sustainability • Application (Algorithm) Requirements • High performance • Technology Constraints • CMOS vs. bipolar, Moore’s Law • Commercial MPU vs. customed ASIP • Economical Feasibility • Good eco-system • Mass production • Low energy consumption

HPCs in the History • Vector Supercomputers • CMOS Dominated, SIMD Weakness

Connection Machine • SIMD PE Array • Optimal only for someAlgorithms • Custom chips, tiny processor

MIMD with Custom CPUs • Chip Level Integration (SoC) • nCube/2, KSR-1 (COMA), … • High NRE cost due to custom design without mass production • Low node processor performance

Why No Cost Effectiveness • HPC Is a Small Market • Architectures Designed Only for HPC • Lower volume, higher cost (NRE) • No enough resource to implement a top level (wrt performance) solution • Longer time-to-market, behind Moore’s Law • Result: COTS Solutions in Last 20 Years • Commercial off-the-shelf • Co-design with the IT Ecosystem • From Cloud computers to sensors

Ecosystem Requirements • High Performance and Low Cost • Low cost is continuing a must • New factors of cost: energy/power, big NRE • Performance no longer the bottleneck • for most applications • like car, train, airplane in transportation • New appearances of performance • Computing: MIPS/MFLOPS • Transaction processing: TPM • Cloud applications: requests serviced in unit time

Energy Efficiency • Two Ends of Computing System • Cloud: large scale power dissipation • Terminal: limited battery life • Energy: compute < memory < communication • For each FLOP in Linpack • FPU spends 10pJ, Memory access 475pJ • Wireless Sensor Network • RF radio consumes most of the power • What We Need Besides Locality?

Needs New Architecture • Architecture Consuming Less Energy • Many core, custom designed for applications • Flattened software stack • Architecture for New Performance Metrics • High volume throughput computers • New Algorithms and Methodology • Complexity of computation • Complexity of memory access and communication

Constraints to Innovation • Existing Software Ecosystem • standard or de facto interfaces • e.g., ISA: Instruction Set Architecture • Pro: Compatibility of Software • Con: Obstacles of Innovation, legacy • Huge Expenses of Development • new architecture needs new processors • NRE of chip development increasing rapidly, as CMOS process approaching its limit • NRE: Non-Recurring Engineering

CMOS Technology • Approaching Limit, And No Replacement! • Moore’s law：7nm@2024, ~30 atoms • Different with the Transfer in 1990’s • Bipolar (ECL/TTL) is faster, but consumes much power • CMOS developed for 20 years, no too slow, low cost, and low power • But Now, Liquid Cooling for CMOS • In the foreseeable future, still CMOS

More and More than Moore 2011 ITRS Exec. Summary Fig. 4

Dark Silicon • At 8nm, above half of transistors must be turned off • Speedup of 4-8 for 5 process generations ISCA’11, IEEE Micro’12, CACM’13

Economical Feasibility • Moore’s Law Provides More Transistors • But switching speed no longer faster • Process development in nanometer scale increases NRE tremendously • Mass Production Is Essential • Otherwise, chip business is not sustainable • Advantages of general-purposed processors • How about Many-core Processors? • GPU, Tilera, MIC, …

Pros and Cons of MPU • Most Advanced Process, Mass Product • Stable, reliable, low cost • Mature ecosystem and solutions • Not Optimal for Many Applications • Aim: not too bad for most applications • Over allocation of resources • Waste of resources, Consumption of more energy

MPU not good for Cloud • High L1-I Cache Miss Rate • Processor idle (instruction starvation) • Small ILP and MLP • Wide issue not effective • Low Efficiency of Memory Access • Large L3 takes ½ chip area, no help to improve performance • Useless High Bandwidth On-chip • Few Data sharing among cores

Low Utilization of Resources • Only 1/3 are frequently used GPU L2 Cache L2 Cache L2 Cache L2 Cache OOOFPU OOOFPU OOOFPU OOOFPU L3 Cache

Pros and Cons of ASIP • Optimal Designed for Some Applications • high efficiency, low resource, low power • But No Lunches Are Free • Much design/verification work • Stability/Reliability? • May affect the time to market • How to amortize the huge NRE • Small market means high cost

MPU + Accelerator • GPU • Pro: mass production • Con: PCIE overhead, small memory size • MIC PHI • Mass production possible? • FPGA • Resource utilization • Ease of programming • MPU interface, e.g., QPI or PCIE

Design of New Processors • Crossing the Gap between Generaland Special • Ｍany Simple Cores • Reduce power consumption • Multiple Hardware Thread in Each Core • Massive threads on chip • Exploit concurrency, tolerate latency • Dynamic Scheduling of On-chip Threads • Improve performance for general apps

Combining Multithreadingand Vector Pipelining 流水向量处理引擎 Vector Registers IR RF ID I$ D$/SPM Switch to single thread Deep scalar pipeline Switch to vector pipeline

Thread Parallelism and DataParallelism in Two dimensions Deep thread parallelism and data parallelism Vector Register File IR RF ID I$ D$/SPM Wide data parallelism Wide thread parallelism IR RF ID I$ D$/SPM

In Conclusion • A Universal Architecture • Scalable and reconfigurable processor array • Supports thread and data level parallelism • Fulfill All Requirements from Terminal to Cloud Data Center • High performance computers • Cloud computing servers • Equipment in Core network • Terminals for Cloud and mobile Internet

Toward a Sustainable Architecture at Extreme Scale

Toward a Sustainable Architecture at Extreme Scale

Presentation Transcript

Extreme-Scale Software Overview

Sustainable IT Architecture

Simulation at Extreme Scale

Toward a vision of sustainable biomass:

Working Toward a Sustainable Manhattan Beach

Sustainable architecture

SUSTAINABLE ARCHITECTURE

SUSTAINABLE ARCHITECTURE

Toward a Sustainable humanities

Debugging at Extreme Scale using proc ++ and TBON-FS

Moving Toward a Sustainable Purchasing Policy

Architectures for Extreme-Scale Computing

The Hybrid Model: Experiences at Extreme Scale

Chapter 14 Toward a Sustainable Future

Research Computing on Multi-core and Many-core Systems: Toward Extreme-scale Computing

Sustainable Architecture

Toward a Sustainable Campus

Sustainable IT Architecture

Toward a More Sustainable Johnson County

Working Toward A Sustainable Future

Scale Model Architecture

Toward Sustainable Food Production