230 likes | 374 Views
System Architecture for Web-Scale Applications Using Lightweight CPUs and Virtualized I/O. *. Kshitij Sudan* Saisanthosh Balakrishnan § Sean Lie § , Min Xu § Dhiraj Mallick § , Gary Lauterbach § Rajeev Balasubramonian *. §. Exec Summary. Focus on web-scale applications
E N D
System Architecture for Web-Scale Applications Using Lightweight CPUs and Virtualized I/O * Kshitij Sudan* SaisanthoshBalakrishnan§ Sean Lie §, Min Xu § DhirajMallick §, Gary Lauterbach§ Rajeev Balasubramonian* §
Exec Summary • Focus on web-scale applications • Contribution 1: use of simple cores • This amplifies the power/cost contribution of the I/O subsystem • Contribution 2: virtualize I/O, e.g., single disk shared by many cores • Contribution 3: software stack optimizations • Contribution 4: evaluations on a production quality real design HPCA-2013
Exec Summary • Focus on web-scale applications • Contribution 1: use of simple cores • This amplifies the power/cost contribution of the I/O subsystem • Contribution 2: virtualize I/O, e.g., single disk shared by many cores • Contribution 3: software stack optimizations • Contribution 4: evaluations on a production quality real design HPCA-2013
Exec Summary • Focus on web-scale applications • Contribution 1: use of simple cores • This amplifies the power/cost contribution of the I/O subsystem • Contribution 2: virtualize I/O, e.g., single disk shared by many cores • Contribution 3: software stack optimizations • Contribution 4: evaluations on a production quality real design HPCA-2013
Web Scale Applications • Targeting datacenter platforms • Focus on power and cost (OpEx and CapEx) • Web scale applications have large datasets, high concurrency, high communication, high I/O – e.g., MapReduce • Typically, performance increases as cluster size grows, but so does power and cost HPCA-2013
Energy Efficient CPUs • For embarrassingly parallel workloads, energy per instruction (EPI) is important • For a given power/energy budget, many low-EPI cores can yield a higher throughput than a few high-EPI cores • Hence, use many light-weight energy-efficient CPUs (Atom CPU at 8.5 W) HPCA-2013
Contribution of the I/O Sub-System • With light-weight cores, the energy and cost contributions of “other” components grow • Intel Atom CPU + Chipset = 11 Watts • Typical disk, or Ethernet card = 5-25 Watts • Fans, power supplies etc… • The application only uses 20-60 MB/s disk bw, while the disk has a peak read bw of 120 MB/s HPCA-2013
Cluster-in-a-Box with Virtualized I/O • Use energy-efficient CPUs • ~10x more CPUs in same power budget than using typical server class CPUs • Virtualize I/O devices – disk and Ethernet • Balanced resource provisioning and lower cost/power • Amortize fixed server overheads by sharing components • Fans, power supplies, etc. HPCA-2013
Compute Cards Compute card – 6 CPUs share 4 ASICs (PCIe connection), ASIC implements the fabric, 4GB DDR2 memory per CPU on the back HPCA-2013
Compute Cards Compute card – 6 CPUs share 4 ASICs (PCIe connection), ASIC implements the fabric, 4GB DDR2 memory per CPU on the back HPCA-2013
Logical Organization CPU + Chipset ASIC Compute Card Storage FPGA S-Cards Ethernet FPGA E-Cards (Up to 8 per system, each with 8x1 GbEor 2x10 GbE) (Up to 8 per system each with 8xSATA HDD/SSD) 3D-Torus Interconnect formed by ASICs HPCA-2013
Physical Organization Midplane Interconnect E-Card S-Card Compute Card HDD/SSD HPCA-2013
Cluster-in-a-Box Summary • 768 CPU cores interconnected using a high bandwidth fabric in a 3D torus topology • Low-latency distributed fabric architecture based on low-power ASICs • FPGAs implement the disk and ethernet controllers • Fabric and FPGAs implement I/O virtualization • Up to 64 disks shared by 384 server nodes • Server nodes don’t require a rack-top-switch to communicate • All internal cluster communication via fabric • Entire cluster consumes < 3.5kW under full-load HPCA-2013
System Software Improvements • Implement large SATA packet sizes to reduce disk seek overheads • Other OS/ethernet configuration knobs: avoid journaling in the filesystem, jumbo TCP/IP frames, interrupt coalescing • MapReduce configuration: designate the few nodes near the S-cards as DataNodes HPCA-2013
Methodology • Compare two cluster designs with the same power envelope to evaluate TCO and power for cluster architectures • 17-node Core i7 CPU based cluster (baseline) and a 384-node Atom cluster-in-a-box • 4 kW Core i7 cluster; 3.5 kW Atom cluster-in-a-box • FourApache Hadoop benchmarks • TCO calculations based on Hamilton’s model HPCA-2013
Improvement in EDP HPCA-2013
Performance/TCO vs. Number of Disks and Number of Cores HPCA-2013
Conclusions • Datacenter power and cost are limiting factors when scaling web-scale apps • Build clusters using light-weight, low-power CPUs • Balanced resource provisioning can improve utilization, cost, power • Virtualize I/O (disk and Ethernet) • Amortize the overheads of fans, power supplies, etc. • The cluster-in-a-box system yieldsup to 6x improvement in EDP, relative to a traditional cluster HPCA-2013
Questions? Thank You
CPU and Disk Utilization 768 CPUs, 64 disks 64 CPUs, 32 disks HPCA-2013