190 likes | 412 Views
Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors : A Machine Learning Approach. Ramazan Bitirgen , Engin Ipek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI. Introduction . Resource sharing problem in CMP
E N D
Coordinated Management of Multiple Interacting Resources in Chip Multiprocessors :A Machine Learning Approach RamazanBitirgen, EnginIpek and Jose F.Martinez MICRO’08 Presented by PAK,EUNJI
Introduction • Resource sharing problem in CMP • Increasing levels of pressure on shared system resources • Efficient sharing is necessary for high utilization and performance • Multiple interacting resources • Cache Space, DRAM Bandwidth and Power Budget • Allocation of a resource affects demands of other resources • Propose a resource allocation framework • At runtime, monitors the execution of each application and learns a predictive model of performance as a function of resource allocation decisions and periodically allocates resources to each core using the model
Resource Allocation Framework • Per-application HW performance model • Use Artificial Neural Networks (ANNs) • Predict each app’s performance as a function of the resources allocated to it • Global resource manager • At every interval, searches the possible resource allocations by querying the application performance model
How to Predict a Performance?(Artificial Neural Networks) • Use ANNs • Input units, hidden units and an output unit connected via a set of weighted edges • Hidden(output) unit calculates a weighted sum of their inputs(hidden values) based on edge weights • Edge weights are trained with training examples (data sets)
How to Predict a Performance?(Adaptation to per-APP Performance Model) • Input units • L2 cache space, off-chip bandwidth, power budget • Number of read hits, read misses, write hits, and write misses over the last 20K inst and over the 1.5M inst • Fraction of cache ways that are dirty (the amount of WB traffic) • Activation function • Use sigmoid (integer to value in [0, 1]) • Model performance as a function of its allocated resources and recent behavior • Training during first 1.2 billion cycle with randomly allocated resource • Always keep a training set consisting of 300 points • Retrained at every 2,500,000 cycle
How to Predict a Performance?(Adaptation to per-APP Performance Model) • Optimization • Prevent memorizing outliers in a sample data • Cross validation • Data set is divided into N equal-sized folds (N-1 training sets and 1 test set) • Ensemble consists of N ANN models • Performance is predicted averaging the predictions of all ANNs in the ensemble • Prediction error is estimated as a function of CoV of the predictions byeach ANN in the ensemble (will be used for resource allocation) Trning Test Training Test
Resource Allocation • Make resource allocation decision (at every 500,000 cycle) using the trained per-application performance model • Discard queries involving an app with a high error estimate • Fairly distribute resources to the running applications • Predict the perf and compute the prediction error • If the performance is estimated to be inaccurate (error > 9%), app is excluded from global resource allocation • Search the space with stochastic hill climbing • It starts with a random solution, and iteratively makes small changes to the solution, each time improving it a little. • When the algorithm cannot see any improvement anymore, it terminates • 2,000 trials produces the best tradeoff between search performance and overhead
Implementation & Overhead • HW implementation • Single HW ANN and multiplex edge weights on the fly to achieve 16 ‘virtual’ ANNs • 12 * 4 + 4 multipliers as many as weighted edges • 50 entry-table-based quantized sigmoid function • Calculate in a pipelined manner • Prediction(search) takes 16 cycles for 16 virtual ANNs • Area, Power, and Delay • 3% of the chip’s area • 3W power consumption • Possible to make 2,000 queries within 5% of interval • OS Interface • Embed training set and the ANN weights to the process state • OS communicates the desired objective function through CR
Experimental Setup • Tools & architecture • Heavily modified version of SESC • With Wattch(power), HotSpot(temperature) • Baseline : Intel’s Core2Quad, DDR2-800 • 4-core CMP, frequency = 0.9GHz-4.0GHz(0.1GHz unit) • 4MB, 16-way shared L2 cache • Distributed 60W power budget among 4 apps via per-core DVFS • Outs is limited to 57W • Statically allocate 5W • Partition L2 cache space at the granularity of cache ways • Allocate one way to each app • Distribute the remaining 12 ways • Each app statically allocated 800MB/s of off-chip DRAM bandwidth and the remaining 3.2GB/s is distributed
Experimental Setup • Metrics • Weighted speedup • Sum of IPCs • Harmonic mean of normalized IPCs • Weighted sum of IPCs • Workload • 9 quad-core multi-programmed workloads from SPEC2000 and NAS suites • Classify into 3 categories • CPU-bound • Memory-bound • Cache Sensitive
Experimental Setup • Configurations • Unmanaged • Isolated Cache Management (Cache) • Utility-based cache partitioning, MICRO’2006 • Distribute L2 cache ways to minimize miss rate • Isolated Power Management (Power) • An analysis of efficient multi-core global power management policies : Maximizing performance for a given power budget, MICRO’2006 • Isolated Bandwidth Management (BW) • Fair Queuing Memory System, Micro ‘06 • Uncoordinated Cache + Power, Cache + BW, Power + BW, Cache + Power + BW • Continuous Stochastic Hill-Climbing (Coordinated-HC) • Learning based SMT processor resource distribution(issue-queue, ROB, and register file), ISCA ’06 • Fair-share • Proposed scheme (Coordinated-ANN) • ANN-based models of the applications’ IPC response to resource allocation are used to guide a stochastic hill-climbing search
Evaluation Results • Performance • Results are normalized to Fair-Share • 14% average speedup over Fair-Share • Similar for other metrics P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Evaluation Results • Sensitivity to confidence threshold • Results are normalized to Fair-Share P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Evaluation Results • Confidence estimated mechanism • Fraction of the total execution time where the ANN could predict the resource allocation optimization for each application P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Conclusions • Proposed a resource allocation framework that Manages multiple shared CMP resources in a coordinated fashion through ANNs and periodic resource allocation scheme • Coordinated approach to multiple resource management is a key to delivering high performance in multi-programmed workloads
Extras P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Extras P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P
Extras P,C,P,M M,C,P,M C,C,C,C P,C,M,C C,M,C,C C,P,C,M C,M,M,C P,C,P,M P,C,P,P