150 likes | 259 Views
Pushing Performance, Efficiency and Scalability of Microprocessors. CERCS IAB Meeting, Fall 2006 Gabriel Loh. Research Overview. Funding from state of GA, Intel, MARCO Currently 2 PhD students, 2 MS Active undergrad research as well Collaborations Universities: PSU, UO, Rutgers
E N D
Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006 Gabriel Loh
Research Overview • Funding from state of GA, Intel, MARCO • Currently 2 PhD students, 2 MS • Active undergrad research as well • Collaborations • Universities: PSU, UO, Rutgers • Industry: Intel, IBM
Research Focus • “Near-term” microprocessor design issues • ~ 5-year time scale • Power/performance/complexity • Traditional uniprocessor performance • Multi-core performance • “Longer-term” • Keeping Moore’s Law alive for the longer term • Primarily, 3D integration for now
Scaling Performance and Efficiency • Multi-cores are here, but single-thread perf still matters • Intel Core 2 Duo is multi-core, but… • Single core is more OOO than ever • Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders • But power also really matters • Lower clock speeds, different channel length transistors, more uop fusion, …
Research Focus • Maximum performance within bounds • Bounds = power, area, TDP, … • Single-core performance helps multi-core performance, too • For future multi-core systems, need to strike a good balance between 1T and MT • Most of our research is at the uarch level • Caches, branch predictors, instruction schedulers, memory queue design, memory dependence prediction, etc.
Highlight: Traditional Caching [MICRO’06] • Well known that different apps respond differently to different replacement policies • Previous work in the OS domain has described adaptive replacement with provable bounds on performance • Adapted techniques for on-chip caches
Adaptive Cache Implementation • Theoretical Guarantees • Miss rate provably bounded to be within a factor of two of the better algorithm In practice, it’s much better
Current Research • Working on multi-core generalizations of adaptive caching and other ways to manage shared resources • Uniprocessor microarchitecture • Scalable memory scheduling [MICRO’06] • Memory dependence prediction [HPCA’06] • Branch prediction […] • And more…
Longer-Term Processor Scaling • Limitations/Obstacles • Wire scaling • Latency/performance • Power • Feature size • Lithography, parametric variations • Off-chip communication
3D Integration Active Layer 1 • Wire • Power/perf. • Off-chip • Feature size • Limitations, variations Metal Layers 1 Die-to-Die Vias Metal Layers 2 Active Layer 2 Die/Wafer Stacking Less RC faster, lower-power
Wordline length halved • in our studies, WL was critical for latency 3D Bitline Stacking • Bitline length halved • BL reduction has greater impact on power savings • Split decoder no activity stacking 3D Wordline Stacking Example: Caches We’ve studied a wide variety of other CPU building blocks Simplified 2D SRAM Array
Uarch-level 3D design Smaller footprint faster and lower-power Width-based gating even lower power, close to original power density Overall: 47% performance gain at only 2 degree temperature increase Example: 4-die significance-partitioned datapath Use uarch prediction mechanism for early determination of width
3D Research Summary • Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06] • Uarch-level [MICRO’06 (w/ ),HPCA’07] • Tutorial papers [JETC’06] • Tutorial [MICRO’06] • Tools [DATE’06,TCAD’07] w/ GTCAD & • Parametric Variations w/ Jim Meindl • Funding, equip from ,
Summary • loh@cc • http://www.cc.gatech.edu/~loh • Lots of exciting work going on here