200 likes | 300 Views
Exploring the Design Space of Future CMPs. Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni. Agenda. Motivation & Goals Brief background about Multicore / CMPs Technical details presented in the paper Key results and contributions Conclusions Drawbacks
E N D
Exploring the Design Space of Future CMPs Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni
Agenda • Motivation & Goals • Brief background about Multicore / CMPs • Technical details presented in the paper • Key results and contributions • Conclusions • Drawbacks • How paper relates to class and my project • Q&A
Motivation & Goals • Motivation • Superscalar paradigm is reaching diminishing returns • Wire delays will limit area of the chip that is useful for single conventional processing core • Goals • Compare area and performance trade-offs for CMP implementations to determine how many processing cores future server CMPs should have, whether the cores should have in-order or out-of-order issue and how big the per-processor on-chip cache should be. • Related Work • Compaq Piranha
Brief background on CMPs • Metrics to evaluate CMPs • Maximize total chip performance = Maximize Job throughput • Maximizing Job throughput involves comparing • Processor Organization : Out -of-order or smaller in-order issue • Cache hierarchy : Amount of cache memory per processor • Off-chip bandwidth: Finite bandwidth limits number of cores that can be placed on the chip • Application characteristics: Applications with different memory access patterns require different CMP designs to attain maximum throughput
Brief background on CMPs • Chip Multiprocessor model • L1 and L2 cache per processor • L2 cache directly connected to off-chip DRAM through a set of distributed memory channels • Shared L2 cache • Large cache bandwidth requirements Vs slow global wires
Technical Details • Area models • The model expresses all area in terms of CBE (unit area for one byte of cache) • In-order and Out-of-order issue processors were considered taking cache sizes in to consideration • Performance per unit area – 2-way in-order (PIN) and 4-way out-of-order (POUT)
Technical Details • I/O Pin Bandwidth • Number of I/O pins built on a single chip is limited by physical technology and does not scale with transistors • Decrease in number of pins per transistor as technology advances • I/O pin speeds have not increased at the same rate as processor clock rates
Technical Details • Maximizing throughput • Performance on server workloads can be defined as aggregate performance of all the cores on the chip • If Number of cores (Nc) and performance of each core (Pi) are given Peak performance (Pcmp) of a server CMP will be • Pcmp = ∑i=1 to NcPi • Performance of individual core in a CMP is dependent on application characteristics such as available instruction level parallelism, cache behavior, and communication overhead among threads
Technical Details • Application characteristics • Ten SPEC benchmarks were chosen –mesa, mgrid, equake, gcc, ammp, vpr, parser, perlbmk, art and mcf • Taxonomy of applications • Processor Bound – Applications whose working set can be captured easily in L2 cache (Mesa, mgrid, equake) • Cache-sensitive – Applications whose performance is limited by L2 cache capacity (Gcc, ammp, vpr, parser and perlbmk) • Bandwidth-bound – Applications whose performance is limited strictly by the rate that data can be moved between processor and DRAM (Art, mcf and sphinx) • Applications are not bound to one class or another, they move along these three domains as processor, cache and bandwidth capacities are modulated
Technical Details • Experimental methodology • Used Simple scalar tool set to model both in-order and out-of-order processors PIN and POUT
Maximizing CMP Throughput • Combine area analysis and performance simulations to find out which CMP configuration will be most area efficient for future technology • Fixed chip area – 400mm² • Calculate the number of cores and cores/channel based on the chip area with different cache sizes
CMPs for Server Applications • Most commonly used server workloads • OLTP • DSS • DSS workloads – cache sensitive applications (1MB / 2MB) L2 cache • OLTP workloads – bandwidth bound applications
Conclusions • Transistor counts are projected to increase faster than pins – which limit the number of cores that can be used in future technology • Out-of-order issue cores are more efficient than in-order issue cores • For different workloads the impact of insufficient bandwidth causes throughput optimal L2 cache sizes to grow from 128KB at 100nm to 1MB at 50 and 35nm • As technology advances wire delays may be too high to add more cache per each processor
Drawbacks • SPEC benchmarks were used which are not similar to server workloads • Power consumption was not at all considered while trying to maximize performance area • The evaluation metrics estimated signaling speed to be increasing linearly with 1.5 times the processor clock. Technology advances in this area may permit larger number of processors than predicted
Paper Related to Class and Project • Relation with class – • We have been studying from the beginning of this semester regarding multi-core architecture. This paper presented how we can design CMP architecture based on the application taking in to consideration the current technology. • Relation with Project – • My project is studying the CMP architecture in relation to Mobile Edge Computing devices.