PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6th 2013 Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu

Good Morning!

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS ANIL KRISHNA Advisor: Dr. YAN SOLIHIN PhD Defense Examination, August 6th 2013 Image Source: http://en.kioskea.net/faq/372-choosing-the-right-cpu

AGENDA • RESEARCH OVERVIEW • Questions I have been researching all these years How this talk is organized • SUMMARY – Motivation, Problem, Contribution • Quick overview of my latest research • DETAILS of ReSHAPE • A performance estimation tool • VALIDATION • Does this tool work? • USE CASES • Where can it be used? • CONCLUSIONS and FUTURE DIRECTION • Where are we? Where to next?

RESEARCH OVERVIEW In the context of processor chip design trends Single Core Multi Core Scaling the bandwidth wall: challenges in and avenues for CMP scaling Brian Rogers, Anil Krishna, Gordon Bell, Ken Vu, Xiaowei Jiang, Yan Solihin International Symposium on Computer Architecture, ISCA 2009 core cache • Motivation • Off-chip bandwidth is pin limited, pins are area limited, area not growing • Problem Statement • To what extent does the bandwidth wall restrict future multi-core scaling? • To what extent can bandwidth conservation techniques help? • Contributions and Findings • Developed simple but effective analytical performance model • Core to cache ratio changes from 50:50 to 10:90 in 4 generations • Core scaling is only 3x vs. 16x in 4 generations • Different bandwidth conservation techniques have different benefits • Combining techniques can delay this problem significantly • 3D-stacked DRAM caches + link and cache compression gives >16x scaling • Motivation • Off-chip bandwidth is pin limited, pins are area limited, area not growing • Problem Statement • To what extent does the bandwidth wall restrict future multi-core scaling? • To what extent can bandwidth conservation techniques help? • Motivation • Off-chip bandwidth is pin limited, pins are area limited, area not growing

RESEARCH OVERVIEW In the context of processor chip design trends Single Core Multi Core Data sharing in multi-threaded applications and its impact on chip design Anil Krishna, Ahmad Samih, Yan Solihin Intl. Symp. on Performance Analysis of Systems and Software, ISPASS 2012 core cache • Motivation • Parallel applications moving from SMP to a single chip • No analytical models exist that can capture the effect of data sharing • Problem Statement • What is the right way to quantify the impact of data sharing on miss rates? • How can this be incorporated into an analytical performance model? • Does data sharing impact optimal on-chip core vs. cache ratios? • Contributions and Findings • Developed novel approach to quantifying the true impact of data sharing • Developed analytical performance model that incorporates data sharing • Showed that core area increases 33% to 49%; throughput increases 58% • Presence of data sharing encourages larger cores over smaller ones • Motivation • Parallel applications moving from SMP to a single chip • No analytical models exist that can capture the effect of data sharing • Problem Statement • What is the right way to quantify the impact of data sharing on miss rates? • How can this be incorporated into an analytical performance model? • Does data sharing impact optimal on-chip core vs. cache ratios? • Motivation • Parallel applications moving to a single chip, but no change in chip design • No analytical models exist that can capture the effect of data sharing

RESEARCH OVERVIEW In the context of processor chip design trends Hybrid Homogeneous Multi Core Single Core Multi Core core cache Hardware acceleration in the IBM PowerEN processor: architecture and performance Anil Krishna, Timothy Heil, Nicholas Lindberg, FarnazToussi, Steven VanderWiel International conference on Parallel Architectures and Compilation Techniques, PACT 2012 • Motivation • Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study • Problem Statement • How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform? • How did the presence of hardware accelerators impact the architecture of the rest of the chip? • Contributions and Findings • Analyzed design and performance of each hardware accelerator in PowerEN (Crypto, XML, Compression, RegX, HEA) in detail • Identified tradeoffs in what to accelerate (vs. execute on general purpose core) and when to accelerate (large vs. small packets) • Found that reducing communication overhead and easing programmability requires supporting many new features • shared memory model between cores and accelerators, direct cache injection of data from accelerators, ISA extensions • Motivation • Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study • Problem Statement • How were the hardware accelerators in IBM’s PowerEN selected and designed? How well do they perform? • How did the presence of hardware accelerators impact the architecture of the rest of the chip? • Motivation • Understand driving forces, architectural tradeoffs and performance advantages of hardware accelerators via a detailed case study

RESEARCH OVERVIEW In the context of processor chip design trends Hybrid Heterogeneous Homogeneous Multi Core Multi Core Single Core Multi Core core cache • Large design space • How many cores/cores-types? • What cache hierarchy? • Heterogeneity in caches too? • Large configuration space • How to schedule applications? • What DVFS settings to use? • What cores and caches to power-gate? ReSHAPE: Resource Sharing and Heterogeneity-aware Analytical Performance Estimator Anil Krishna, Ahmad Samih, Yan Solihin being submitted to Intl. Symposium on High Performance Computer Architecture, HPCA 2013

Design and configuration space explosion with multi-core chips • Design and configuration space explosion with multi-core chips • As number and types of cores designs need to be evaluated • Design and configuration space explosion with multi-core chips • As number and types of cores designs need to be evaluated • n! static schedules for a single design with n core types • Design and configuration space explosion with multi-core chips • As number and types of cores designs need to be evaluated • n! static schedules for a single design with n core types • Very large configuration space with per-core DVFS even in a single design with a single core type SUMMARY – Motivation Detailed simulation too slow • Detailed simulation too slow • Be it trace or execution driven, be it cycle-by-cycle simulation or discrete-event simulation Analytical models fast, but existing models lacking • Analytical models fast, but existing models lacking • Too abstract and lacking sufficient fidelity • Analytical models fast, but existing models lacking • Too abstract and lacking sufficient fidelity • Not flexible enough to handle shared caches, heterogeneity across cores, multi-program mixes.

Problem: Need a tool for early design space exploration • Problem: Need a tool for early design space exploration • Fast: At least 1000x faster than detailed simulation • Problem: Need a tool for early design space exploration • Fast: At least 1000x faster than detailed simulation • Accurate: < 20% error in performance projection • Problem: Need a tool for early design space exploration • Fast: At least 1000x faster than detailed simulation • Accurate: < 20% error in performance projection • Flexible : Able to model shared cache hierarchies, shared memory bandwidth, heterogeneity across cores and caches on chip and multi-programmed workload mixes SUMMARY – Problem, Contribution • Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator) • Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver • Flexible • Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator) • Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver • Flexible • Typically runs in under a second (10,000x faster than detailed simulation) • Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator) • Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver • Flexible • Typically runs in under a second (10,000x faster than detailed simulation) • Accuracy is promising – IPC error < 5% and cache miss rate error <15% (validated up to 4 cores) Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator) • Contribution: ReSHAPE (Resource Sharing and Heterogeneity-aware Analytical Performance Estimator) • Hybrid tool: detailed simulation for key statistics + analytical model + iterative solver

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator Core 0 Core 1 Core 0 • App-Core pair profile • Base IPC • Cache accesses per Inst. App-Core pair profile • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles • App-Core pair profile • Base IPC • Chip Configuration • core counts • core types • Frequencies • Cache hierarchy • memory bandwidth • application schedule Core 1 L1I L1D L1D L1I L1D L1I L1D ReSHAPE – Inputs and Outputs • ∞ L2 L2 L1I C0 C0 C1 C1 C2 C2 L2 C0 C1 C2 • ∞ L2 L1D L1D L1D L1D L1I L1I L1D L1D L1D L1D L1I L1D L1I L1I L1I L1I L1I L1I L2 L2 L2 L2 L2 L2 L2 L2 Core 1 Core 0 L3 L3 L1D L3 L3 L1I L1D L1I L4 L4 • ∞ L2 • ∞ L2 ReSHAPE Iterative solver of an underlying analytical model Throughput (Instructions per Second)

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator • Chip Configuration • core counts • core types • Frequencies • Cache hierarchy (sizes, latencies) • memory bandwidth • application schedule • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles ReSHAPE – The Analytical Component

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator • Chip Configuration • core counts • core types • Frequencies • Cache hierarchy (sizes, latencies) • memory bandwidth • application schedule Core 0 • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles L1I L1D ReSHAPE – The Analytical Component L2

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator • Chip Configuration • core counts • core types • Frequencies • Cache hierarchy (sizes, latencies) • memory bandwidth • application schedule Core 0 • App-Core pair profile • Base IPC • Cache accesses per Inst. • Hit Rate Profiles L1I L1D ReSHAPE – The Analytical Component L2 L3

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator Core 0 Core 1 • Novelty 1 • Separate chip into vertical silos L1I L1D L1D L1I ReSHAPE’s Novelty L2 L2 L2 L2 L3 L3 L3 ReSHAPE’s partition optimizer

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator Core 0 Core 1 • Novelty 1 • Separate chip into vertical silos L1I L1D L1D L1I • Novelty 2 • Use newly computed IPC as baseIPC • Re-evaluate traffic and partitions • Iterate until convergence (IPC change <1%) ReSHAPE’s Novelty L2 L2 • After convergence • Use final IPCs to compute throughput L3 L3 L3 L3 L3 L3

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator L3 Hits per sec Hits per sec ReSHAPE’s Cache partitioning strategy ? Cache size Cache size • Greedy Approach • O(n.k) for n cache slices and k sharers • May be sub-optimal, but does quite wellin practice L3 L3

Resource Sharing and Heterogeneity-aware Analytical Performance Estimator L3 Hits per sec Hits per sec ReSHAPE’s Cache partitioning strategy ? Cache size Cache size • Minimize Misses Strategy • O(log2n. 2k) for n cache slices and k sharers • May be too slow for large k • We use this strategy for all evaluations presented here L3 L3

Comparing ReSHAPE’s projections against SIMICS full system simulator Step 1: Analyze benchmark applications VALIDATION Loose Locality Medium Locality Tight Locality

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator Step 1: Analyze benchmark applications Step 2: Construct workload mixes 2 core 12 mixes 12 mixes 4 core 9 core 7 mixes

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator Step 1: Analyze benchmark applications Step 2: Construct workload mixes 32K 32K 32K 32K Step 3: Construct configurations to be validated 512KB 2MB 1MB 256KB 32K 32K 32K 32K 32K 32K 32K 32K 10Gb/s 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 32K 10Gb/s 1Gbp/s 100Mb/s 10MB/s 10Gb/s 10Gb/s 1Gbp/s 100Mb/s 10MB/s 128KB 128KB 2MB 2MB 256KB 256KB 512KB 512KB 1MB 1MB 512KB 512KB 256KB 1MB 10Gb/s 32K 32K 32K 32K 10Gb/s 1Gbp/s 100Mb/s 10MB/s 32K 32K 32K 32K 32K 32K 32K 32K 10Gb/s 10Gb/s 10Gb/s

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator Step 1: Analyze benchmark applications Step 2: Construct workload mixes Step 3: Construct configurations to be validated Step 4: Set up identical configurations in SIMICS and ReSHAPE Each mix is checkpointed (under SIMICS) after running for 100 Billion instructions per application At least 1 Billion instructions beyond this are used for validation run Step 5: Compare projections from SIMICS and ReSHAPE

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 256KB 10Gb/s 32K 32K 256KB Average 1-core IPC Error : 1.5% (std. dev. = 1.4%)

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 1MB 10Gb/s Average 2-core IPC Error: 2.7% (std. dev. = 2.1%)

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 1MB 10Gb/s Average miss rate projection error: 13.4 % (std. dev. = 12.6%)

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 1MB 10Gb/s Average partition size projection error: 3.7 % (std. dev. = 4.5%)

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 2MB 32K 32K 32K 32K Average 4-core IPC Error: 2.5% (std. dev. = 1.8%) 10Gb/s

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 2MB 32K 32K 32K 32K Average miss rate projection error: 12.8 % (std. dev. = 13.1%) 10Gb/s

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 2MB 32K 32K 32K 32K Average partition size projection error: 20.9% (std. dev. = 12.8%) 10Gb/s

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 32K 32K 2MB 32K 32K 32K 32K Average IPC Error: 17.3% (std. dev. = 5.4%) 10Gb/s 1Gb/s 0.1Gb/s 0.01Gb/s

VALIDATION 32K 32K 32K 32K Comparing ReSHAPE’s projections against SIMICS full system simulator 128KB 128KB 2MB 2MB 32K 32K 32K 32K Private Caches: Average 4-core IPC Error: 3.1% (std. dev. = 1.6%) 10Gb/s

VALIDATION 32K 32K 32K 32K Comparing ReSHAPE’s projections against SIMICS full system simulator 128KB 128KB 2MB 2MB 32K 32K 32K 32K Average miss rate projection error: 7.5 % (std. dev. = 7.1%) 10Gb/s

Putting ReSHAPE to use USE CASES Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both Does increasing the sources of heterogeneity buy us performance?

Putting ReSHAPE to use Up to 4! unique schedules for a 4-application workload mix USE CASES Min Max Mean What one might expect to see • Small improvement with heterogeneous caches. Some loss for bad schedules Weighted speedup normalized to Homogeneous design A A A A A B B B B B C C C C C D D D D D App0 C0 A B C D • Larger improvement with heterogeneous cores B C C D D C D D A A D A A B B A B B C C App1 C1 B C D A 1 • Even larger improvement with heterogeneous cores + heterogeneous caches D B D B C A C A C D B D B D A C A C A B App2 C2 C D A B C D B C B D A C D C A B D A D B C A B A C3 App3 D A B C Het. Cache Het. Core Het. Both Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both Does increasing the sources of heterogeneity buy us performance?

Putting ReSHAPE to use • Smaller cores hurting more than the larger cores helping • Heterogeneous caches better than heterogeneous cores in this case USE CASES Homogeneous Heterogeneous Caches Heterogeneous Cores Heterogeneous Both Does increasing the sources of heterogeneity buy us performance?

Putting ReSHAPE to use • As core count scales (4->9) benefit of heterogeneity increases significantly • Heterogeneous cores better than heterogeneous caches in this case; but schedule still crucial chart represents > 10 million ReSHAPEsims > 350,000 ReSHAPEsims USE CASES Heterogeneous Caches Heterogeneous Both Homogeneous Heterogeneous Cores 9-core designs

Putting ReSHAPE to use • How much and what form of heterogeneity needs careful analysis depending on the design being evaluated • 3-core types and 3-cache sizes does not buy any more performance USE CASES Heterogeneous Caches Heterogeneous Both Homogeneous Heterogeneous Cores 9-core designs with 3 core/cache types

Putting ReSHAPE to use 32K 32K 32K 32K • Different settings for different workload mixes; and not always the fastest setting! 2MB • Not always the slowest setting when optimizing performance/watt • Somewhere in between when optimizing Energy x Delay product 32K 32K 32K 32K USE CASES 10Gb/s 32K 32K 32K 32K 32K 32K 4GHz 16W 250MHz 0.5W 1GHz 2W

Rich design/configuration space for multi-core chips Analytical modeling can be a promising approach to tackling these large search spaces ReSHAPE extends this classical analytical performance model in novel ways CONCLUSIONS + FUTURE DIRECTION Accuracy + speed make ReSHAPE a useful tool for early exploration Future direction – extend ReSHAPE Validate across unique microarchitectures Extend key parameters and model - memory level parallelism, writeback traffic, prefetching Explore the rich constrained optimization problem of cache partitioning • Evaluate more use cases • best power-gating strategy based on workload mix • dynamic schedules based on per-phase application statistics

Thank you!

Analytical Modeling of multi-core chips Wentzlaff et al. (MIT Tech Report 2010), Li et al. (ISPASS 2005), Yavits et al. (CAL 2013) all tackle different aspects of multicore chip design, but only consider homogeneous cores. Wu et. Al (ISCA 2013) use locality profiles to identify how the application’s cache locality degrades as the application is spread across more threads – they consider multi-threaded applications. RELATED WORK Several works related to heterogeneous design/scheduling Navada et al. (PACT 2010, PACT 2013) consider simulation based, criticality driven, design space exploration and mechanisms for selecting the best way to schedule a single application across multiple cores. Kumar et al. (Micro 2003, PACT 2006, ISCA 2004) did most of the seminal work in the area of heterogeneous multi-core. However, they have typically relied on detailed simulations, private cache hierarchies and single application scheduling.

VALIDATION Comparing ReSHAPE’s projections against SIMICS full system simulator 32K 32K 256KB 10Gb/s 32K 32K 256KB Average miss rate projection error: 7.6% (std. dev. = 12.4%)

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS

PERFORMANCE MODELING AND CHARACTERIZATION OF MULTICORE COMPUTING SYSTEMS

Presentation Transcript

Performance of Windows Multicore Systems on Threading and MPI

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

OVERVIEW OF MULTICORE, PARALLEL COMPUTING, AND DATA MINING

Performance Measurements of CCR and MPI on Multicore Systems

Multicore Computing - Evolution

Modeling and Parallel Simulation of Multicore Systems with Manifold

Linear Algebra Libraries for High-Performance Computing: Scientific Computing with Multicore and Accelerators

Linear Algebra Libraries for High-Performance Computing: Scientific Computing with Multicore and Accelerators

Modeling and Performance Evaluation of Network and Computer Systems

CCR Multicore Performance

Parallel Multidimensional Scaling Performance on Multicore Systems

Performance modeling in GPGPU computing

Effects of Multicore on Cloud Computing

Modeling and Performance Evaluation of Computer Systems

Modeling and Performance Evaluation of Computer Systems

Groundwater Modeling, Inverse Characterization, and Parallel Computing

Multicore Systems

Multicore Computing Using Erlang

Performance Optimizations for NUMA-Multicore Systems

Multicore Programming (Parallel Computing)

Use of ARM Multicore Cluster for High Performance Scientific Computing

CS427 Multicore Architecture and Parallel Computing