1 / 13

Scaling and Packing on a Chip Multiprocessor

Austin Research Laboratory. Scaling and Packing on a Chip Multiprocessor. Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III. Introduction (1). performance. frequency/voltage. Want to save power without a performance hit Dynamic Voltage and Frequency Scaling (DVFS)

chibale
Download Presentation

Scaling and Packing on a Chip Multiprocessor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Austin Research Laboratory Scaling and Packing on a Chip Multiprocessor Vincent W. Freeh Tyler K. Bletsch Freeman L. Rawson, III

  2. Introduction (1) performance frequency/voltage • Want to save power without a performance hit • Dynamic Voltage and Frequency Scaling (DVFS) • Slow down the CPU • Linear speed loss, quadratic CPU power drop • Efficient, but limited range power  freq x voltage2 power frequency/voltage

  3. Introduction (2) • CPU Packing • Run a workload on fewer CPU cores • Linear speed loss, linear power drop • Less efficient, greater range • Using scaling and packing? Core 0 Core 1 Core 2 Core 3 Without packing (using all 4 CPU cores) Core 0 Core 1 Core 2 Core 3 With packing (2 cores deactivated)

  4. Hardware architecture Cluster • Memory hierarchy: • L1 & L2 cache • Per-core • Local memory • Per-socket • Remote memory • Accessible viaHyperTransport bus Node 0 Node 7 . . . Socket Socket Socket Socket Core Core Core Core Core Core Core Core HyperTransport Socket 0 Socket 1 AMD64 Core 1 AMD64 Core 0 AMD64 Core 2 AMD64 Core 3 L1 Data L1 Instr L1 Data L1 Instr L1 Data L1 Instr L1 Data L1 Instr L2 L2 L2 L2 Memory (1GB) Memory (1GB)

  5. P-states and configurations • Scaling: • Entire socket must scale together • 5 P-states: every 200Mhz from 1.8 to 1.0GHz • Packing: • 5 configurations: • All four cores: ×4 • Three cores: ×3 • Cores 0 and 2: ×2* • Cores 0 and 1: ×2 • One core: ×1 • For multi-node tests, prepend number of nodes • 4×2: 4 nodes, 4 sockets, 8 total cores (cores 0 and 1 active) HyperTransport Socket Socket Socket Socket Core 0 Core 0 Core 1 Core 1 Core 2 Core 2 Core 3 Core 3 Memory Memory

  6. Metrics • Performance • Execution time (s) • Throughput (work/s) • Energy (J) • Measured by WattsUp meter • “Amount of coal burned” • Power (W) • Energy per unit time (J/s) • AvgPower = Energy / ExecutionTime • Energy-Delay Product (EDP) • Energy * Execution Time • Balance the tradeoff between energy and time Avg. Power Power Energy Time

  7. Three application classes • CPU-bound • No communication, fits in cache • Near 100% CPU utilization • High-Performance Computing (HPC) • Inter-node communication • Significant memory usage • Performance = Execution time • Commercial • Constant servicing of remote requests • Possibly significant memory usage • Performance = Throughput

  8. (1) CPU-bound workloads Workload DAXPY: A small linear algebra kernel Run repeatedly on single core with in-cache workload Representative of entire class Scaling: Linear slowdown Quadratic power cut Packing: ×4 is most efficient ×2* is no good here ×3 is right out ×1 and ×2 save power kill performance Different p-states

  9. (2) HPC workloads • Packing with fixed nodes (NPBs with MPI) LU ×2* speedup CG slowdown CG CPU utilization falls LU ×2* speedup Before: Chip was removed was removed from socket Now: Simulate chip removal by subtracting 20W

  10. (2) HPC workloads • Packing with fixed number of CPU cores LU ×2* speedup again

  11. (3) Commercial workloads • Apache with PHP workload, httperf client

  12. Conclusions • Packing less efficient than scaling • Therefore: Scale first, then pack • Nothing can help CPU-bound apps • Memory/IO bound workloads are scalable • Commercial workloads can benefit from scaling/packing • Especially at low utilization levels • Resource utilization affects (predicts?) effectiveness of scaling and packing

  13. Future work • How does resource utilization influence the effectiveness of scaling/packing? • A predictive model based on resource usage? • A power management engine based on resource usage? • Dynamic packing • Move processes between cores? • Yes: “CPU affinity”, “CPU sets”, etc. • …between nodes? • Yes: Virtualization allows live domain migration • Packing on the fly

More Related