130 likes | 302 Views
Austin Research Laboratory. Scaling and Packing on a Chip Multiprocessor. Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III. Introduction (1). performance. frequency/voltage. Want to save power without a performance hit Dynamic Voltage and Frequency Scaling (DVFS)
E N D
Austin Research Laboratory Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III
Introduction (1) performance frequency/voltage • Want to save power without a performance hit • Dynamic Voltage and Frequency Scaling (DVFS) • Slow down the CPU • Linear speed loss, quadratic CPU power drop • Efficient, but limited range power freq x voltage2 power frequency/voltage
Introduction (2) • CPU Packing • Run a workload on fewer CPU cores • Linear speed loss, linear power drop • Less efficient, greater range • Using scaling and packing? Core 0 Core 1 Core 2 Core 3 Without packing (using all 4 CPU cores) Core 0 Core 1 Core 2 Core 3 With packing (2 cores deactivated)
Hardware architecture Cluster • Memory hierarchy: • L1 & L2 cache • Per-core • Local memory • Per-socket • Remote memory • Accessible viaHyperTransport bus Node 0 Node 7 . . . Socket Socket Socket Socket Core Core Core Core Core Core Core Core HyperTransport Socket 0 Socket 1 AMD64 Core 1 AMD64 Core 0 AMD64 Core 2 AMD64 Core 3 L1 Data L1 Instr L1 Data L1 Instr L1 Data L1 Instr L1 Data L1 Instr L2 L2 L2 L2 Memory (1GB) Memory (1GB)
P-states and configurations • Scaling: • Entire socket must scale together • 5 P-states: every 200Mhz from 1.8 to 1.0GHz • Packing: • 5 configurations: • All four cores: ×4 • Three cores: ×3 • Cores 0 and 2: ×2* • Cores 0 and 1: ×2 • One core: ×1 • For multi-node tests, prepend number of nodes • 4×2: 4 nodes, 4 sockets, 8 total cores (cores 0 and 1 active) HyperTransport Socket Socket Socket Socket Core 0 Core 0 Core 1 Core 1 Core 2 Core 2 Core 3 Core 3 Memory Memory
Metrics • Performance • Execution time (s) • Throughput (work/s) • Energy (J) • Measured by WattsUp meter • “Amount of coal burned” • Power (W) • Energy per unit time (J/s) • AvgPower = Energy / ExecutionTime • Energy-Delay Product (EDP) • Energy * Execution Time • Balance the tradeoff between energy and time Avg. Power Power Energy Time
Three application classes • CPU-bound • No communication, fits in cache • Near 100% CPU utilization • High-Performance Computing (HPC) • Inter-node communication • Significant memory usage • Performance = Execution time • Commercial • Constant servicing of remote requests • Possibly significant memory usage • Performance = Throughput
(1) CPU-bound workloads Workload DAXPY: A small linear algebra kernel Run repeatedly on single core with in-cache workload Representative of entire class Scaling: Linear slowdown Quadratic power cut Packing: ×4 is most efficient ×2* is no good here ×3 is right out ×1 and ×2 save power kill performance Different p-states
(2) HPC workloads • Packing with fixed nodes (NPBs with MPI) LU ×2* speedup CG slowdown CG CPU utilization falls LU ×2* speedup Before: Chip was removed was removed from socket Now: Simulate chip removal by subtracting 20W
(2) HPC workloads • Packing with fixed number of CPU cores LU ×2* speedup again
(3) Commercial workloads • Apache with PHP workload, httperf client
Conclusions • Packing less efficient than scaling • Therefore: Scale first, then pack • Nothing can help CPU-bound apps • Memory/IO bound workloads are scalable • Commercial workloads can benefit from scaling/packing • Especially at low utilization levels • Resource utilization affects (predicts?) effectiveness of scaling and packing
Future work • How does resource utilization influence the effectiveness of scaling/packing? • A predictive model based on resource usage? • A power management engine based on resource usage? • Dynamic packing • Move processes between cores? • Yes: “CPU affinity”, “CPU sets”, etc. • …between nodes? • Yes: Virtualization allows live domain migration • Packing on the fly