260 likes | 450 Views
Microservers and beyond: Pushing the boundaries of efficiency. Kevin Lim, Research Scientist HP Labs, Intelligent Infrastructure Lab. Embracing scale-out. New Data Center Construction Costs*. Now in scale-out computing era Datacenters with 100,000’s servers Billions of devices
E N D
Microservers and beyond: Pushing the boundaries of efficiency Kevin Lim, Research Scientist HP Labs, Intelligent Infrastructure Lab
Embracing scale-out New Data Center Construction Costs* • Now in scale-out computing era • Datacenters with 100,000’s servers • Billions of devices • ZB’s of data moved and processed • Must alwaysdesign for scale • Microservers can provide significant efficiency benefits • Total cost of ownership at datacenter level • Metrics: Performance / W, Performance / $ $65B $30B • *IDC 2011: Market Analysis Perspective: Worldwide Datacenter Trends and Strategies 2010
HP Labs and Microservers • Have long viewed microservers as critical building block • Multiple research publications since 2008 • When do they make sense? How to design servers? • Will cover three main works: • µBlades: Initial microserver exploration • Disaggregated memory: Addressing capacity needs • SoC architectures for hyperscale servers: System-level integration • And push microservers to an extreme: nanostores
µBlades for the cloud • Focus on internet sector: fastest growing server market • Google, Amazon, MSN’s billion dollar data centers • Extreme scale: Millions of users on 100,000s of servers • Infrastructure and power/cooling some of largest expenses • Needs for system architecture research • Cloud application benchmarking • Whole-system costs analysis • Holistic system designs with compelling performance/cost
Benchmarks and Metrics • New benchmark suite • Websearch • Unstructured data, large data sets • Webmail • User involvement, scale-out applications • Video sharing • Rich media, large streaming data • MapReduce • Cloud computing, internet as a platform • Key metric: Sustained performance / Total cost of ownership (Perf / TCO-$)
Power & Cooling Hardware Cost Analysis of Baseline Servers Holistic approach must be taken to reduce costs
µBlades: attack the inefficiencies (3) Remote shared memory blades and flash-based disk caching (1) Power-efficient, embedded-class low-cost processors (2) Compact packaging, aggregate cooling, enclosure optimizations
µBlades: performance/TCO-$ 2.0X perf/TCO Relative to cost-optimized baseline
Microservers and the memory capacity wall 30% less GB/core every 2 years • Very small physical space • Making things worse: • DRAM scaling slowing • Less modules close to cores • Non-linear $/GB curve • Growing need for memory: • Many-core scaling • Workload consolidation • In-memory DB/BI, DISC • Interactive web-scale workloads • Performance/cost impact $/GB (June 2008)
Opportunity: optimizing for the ensemble Intra-server variation (TPC-H, log scale) Inter-server variation (rendering farm) • Use same concepts as microblades/servers! • Dynamic provisioning across ensemble enables cost & power savings Time
Disaggregated memory contributions Goal: Expand capacity & provision for typical usage • New architectural building block: memory blade • Breaks traditional compute-memory co-location • Architectures for transparent memory expansion • Capacity expansion: • 8x performance over provisioning for median usage • Higher consolidation • Capacity sharing: • Lower power and costs • Better performance / dollar
General memory blade design Perf.: Accessed as memory, not swap space Commodity: Connected via PCIe or HT Memory blade (enlarged) Backplane CPUs CPUs CPUs CPUs Protocol engine DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM DIMM Memory controller Address mapping Cost: Handles dynamic memory partitioning Transparency: Enforces allocation, isolation, and mapping Cost: Leverage sweet-spot of RAM pricing Other optimizations Break CPU-memory co-location Leverage fast, shared communication fabrics
Disaggregated memory results Baseline: Mem_median Server consolidation 8X 60% 2X • Trace-based simulation (TPC, search/index, SpecJBB, SpecCPU) • Performance: 8X for memory-limited workloads, slightly worse than ideal • Consolidation: 60% server reduction (web 2.0 company traces) 13
Case for integration: Server inefficiencies today xeon SB chipset NB chipset dimm A typical 2-socket blade server • Three-chip chipset (PCIe: 27W, IO: 7W) • CPU + chipset: > 200W (and ~$2k) • General purpose, un-optimized parts dimm dimm dimm xeon disk NIC $ ser/des
Why slow adoption of server SoCs? • Aggressive SoCs are largely absent in general purpose CPUs • Mix & match one CPU chip with different peripherals • Different development timescales for CPU & peripherals • Integration trend: slower than embedded, but steady over time (and accelerating) • Today: mounting cost & energy pressure in datacenters demand technologies that first and foremost can decrease TCO • Energy fewer pin crossings, more efficient blocks (heterogeneous) • Cost fewer sockets, BoM, etc. • Creating a stronger case for SoC-based servers
System-level integration for hyperscale servers • Goal: Quantify System-level Integration benefits • Energy reduction: remove expensive pin crossings to the I/O subsystem • Cost reduction: use of silicon density, riding the mobile volume commodity curve • For example, a 100 mm2 SoC at 16nm vs. the best non-SoC: • Reduces total silicon area by ~40%, dynamic chip power by ~30% • Reduces datacenter-level cost by ~20%, energy by ~35%, TCO by ~25% • TCO reduction range15-40%, >50% with network port aggregation
Design space exploration complexity • Huge Design Space and Multilevel Design Point Chip Level: architecture, core, SoC components, resource provisioning, chip cost Board and Server Level: Socket planning, Inter-chip aggregation Datacenter Level: Intra-rack aggregation • Workload Variability Computation intensive vs. I/O intensive (Memory, Disk, Network) • Significant development of tools • Extended McPAT, paired with TCO models Partially integrated SoC
SoC scaling trends Core vs. periphery • On-chip components scale differently
Chip size TCO analysis • Analysis using small cores
Continuing microserver challenges • How to apply to more workloads? • Perhaps accelerators or heterogeneity • How to minimize unnecessary duplication at scale? • Perhaps content-based page sharing, common boot images • How to ensure balanced systems? • Perhaps further disaggregation/composition of resources • How to reduce # of network ports? • Perhaps more intelligent local routing, aggregation
Thank you! kevin.lim@hp.com • Sheng Li • John Byrne • Laura Ramirez • Alvin AuYoung • Naveen Muralimanohar • Chandrakant Patel • Special thanks to: • Partha Ranganathan • Norm Jouppi • Jichuan Chang • Paolo Faraboschi • Yoshio Turner • Jose Renato Santos