400 likes | 567 Views
What to do and why in microprocessor research. Mario Nemirovsky University of California, Santa Cruz XStream Logic, Inc. mario@ieee.org. Agenda. Microcontroller era PC era Post-PC era Directions and challenges. Microcontroller era. Intel 4004 => 8008 => 8080 => 8085 Motorola
E N D
What to do and why in microprocessor research Mario Nemirovsky University of California, Santa Cruz XStream Logic, Inc. mario@ieee.org
Agenda • Microcontroller era • PC era • Post-PC era • Directions and challenges Mario Nemirovsky
Microcontroller era • Intel • 4004 => 8008 => 8080 => 8085 • Motorola • 6800 => 68000 Mario Nemirovsky
Real time systems • First microprocessors used as controllers • Late 70’s to early 80’s • Delco Electronics (General Motors) #1 microprocessor user and producer • TIO first real time multithreaded microprocessor • Motorola dominated the market • CISC needed, why? Mario Nemirovsky
The PC era • 68K used for workstations • Apollo and the MMU issue • Apple and the A trap • Intel 80x86 • IBM introduce the PC Mario Nemirovsky
RISC vs. CISC • RISC values • simplicity • fast design cycles • small area • CICS values • small footprint • fewer number of fetch • small register file? Mario Nemirovsky
General purpose microprocessors • Performance • For for past 10 years avg. annual performance growth averages 1.59! • Architectural directions • Exploiting instruction level parallelism • Memory hierarchies • Special purpose micros not needed! Mario Nemirovsky
Hardware Techniques Pipelining Dynamic issue (i. e. superscalar) Dynamic multistreaming (SMT) Dynamic scheduling Dynamic branch prediction Dynamic disambiguation Dynamic “super”speculation Dynamic recompilation Software Techniques Static scheduling Static issue (i. e. VLIW) Static branch prediction Alias/ pointer analysis Static speculation Today’s Uniprocessor Mario Nemirovsky
Limit of ILP Mario Nemirovsky
IPC on a real machine Mario Nemirovsky
Multistreamed Superscalar Processor • Exploit thread level parallelism • Interleaved execution of instructions from distinct threads • Multiple hardware contexts (streams) • Improve performance by making better use of processor resources Mario Nemirovsky
Multistreaming Work • The beginning: the CDC6600 - J.E.Thornton • Early 80’s: the HEP – B.Smith • Mid 80’s: Delco TIO – M.Nemirovsky • Late 80’s: UCSB DISC – M.Nemirovsky • Early 90’s: UCSB MSP (SMT) – M.Nemirovsky & M. Serrano • In ISCA91 “Simultaneous Instruction Issuing” – H.Hirata • In HICSS94 “Performance Estimation of Multistreamed , Superscalar Procesors” – W.Yamamoto & M.Nemirovsky et al • In ISCA95 “Simultaneous Multithreading” – D.Tullsen • In PACT95 “Increasing superscalar performance through multistreaming” – W.Yamamoto & M.Nemirovsky Mario Nemirovsky
Multistreamed, Superscalar Processor (PACT’95, Yamamoto & Nemirovsky) Mario Nemirovsky
Performance Regions • Linear • Performance limited by workload parallelism • Saturation • Performance limited by machine parallelism Mario Nemirovsky
Limits on Performance • Machine Parallelism (mp) • Determined by the functional unit configuration and the dynamic instruction mix • Example: 2 integer, 60%; 1 memory, 40% • Workload Parallelism • Characteristic of a program • Compiler dependence Mario Nemirovsky
Functional Unit Effect on Performance (Ph.D. Dissertation (UCSB), March’94, M.Serrano) Mario Nemirovsky
Execution Profiles 1 stream 2 streams (PACT’95, Yamamoto & Nemirovsky) Mario Nemirovsky
Execution Profiles 3 streams 4 streams (PACT’95, Yamamoto & Nemirovsky) Mario Nemirovsky
Caches • Caches are shared among the streams • Miss rate increases due interstream conflicts • Individual thread performance decreases • Overall performance increases • Bus Utilization Increases • Increase is the product of the speed up and the miss rate increase • Design to maximize speed up while minimizing miss rate increase Mario Nemirovsky
Extrinsic Misses • Extrinsic misses make up a significant portion of the miss rate direct mapped-16 byte line (MTEAC’98, Nemirovsky & Yamamoto) Mario Nemirovsky
The new era • Even if the large gains in performance in last 15 years can be continued (which may be very hard), there are new applications that are growing even faster. • New applications other than PC centric • Larger diversity of requirements Mario Nemirovsky
“Post-Desktop Era” ? • Information appliances • Multiple computers per person • Internet and web centric • Access to services is “one of” the killer app • 3-D is “one of” the killer app, …… Mario Nemirovsky
Applications Fueling the Growth of the Internet 10000 Streaming Video • Video on Demand Telephony • Voice over IP (DSL) 1000 Transactions • E-commerce (v.90 access at home) Throughput (MB/s) 100 Graphics • Web browsing (direct connections at work) 10 Text • E-mail, ftp, news (low-speed connections) 1 1990 1994 1997 2000 2003 Mario Nemirovsky
Future • Larger growth outside desktop PC • New performance metrics • “DoomMarks” vs. SPECmarks, MPPs vs. MFLOPs • Wider spectrum of requirements • Performance • Power • TTM • Reliability • Real Time • Cost Mario Nemirovsky
Opportunities • Application specific processors vs. GP • “Multiple” high-end CPU designs • Low-power architectures • Better CAD support • Fault-tolerant systems • Real Time architectures • Integration - System on a chip Mario Nemirovsky
Conclusions • Processors will have new constrains • “Multiple” general-purpose processors • Stream data • Light threads • New interfaces • Cache friendliness • Internet and communication will dominate • Reliability Mario Nemirovsky
Multithreading Work in 87 • Multiprocessor Systems • Fine grained instruction interleaving (HEP) • Coarse grained instruction interleaving (Sparcle) • Embedded Real Time Control • GM engine controller The TIO has up to 33 streams actives simultaneously, each stream controls a spark, fuel, and other function per cylinder Mario Nemirovsky
Multistreaming Work in 90 • Multiprocessor Systems • Fine grained instruction interleaving (TERA) • Coarse grained instruction interleaving (Sparcle) • Embedded Real Time Control • GM engine controller • Fine grained, dynamic instruction interleaving (DISC) DISC uses dynamic interleaving where the instruction dispatch algorithm dynamically reallocates throughput to the unblocked streams. This algorithm eliminates data and control hazards without degrading single stream latency. Mario Nemirovsky
Multistreaming Work in 92 • Multiprocessor Systems • Fine grained instruction interleaving (TERA) • Coarse grained instruction interleaving (Sparcle) • Embedded Real Time Control • GM engine controller • Fine grained, dynamic instruction interleaving (DISC) • Multistreamed, Superscalar Processors • Fine grained, dynamic instruction interleaving. • Each stream is a logical superscalar processor • Multiple functional unit design Mario Nemirovsky
Multistream Performance 1 stream Mario Nemirovsky
Multistream Performance 2 streams Mario Nemirovsky
Multistream Performance 3 streams Mario Nemirovsky
Multistream Performance 4 streams Mario Nemirovsky
Multistream Performance • Performance Bounds • Workload parallelism: 1-2 streams • Machine parallelism: 3-4 streams • Data cache miss rate increased by 18% when moving from a single stream to 2 streams Mario Nemirovsky
Interference • Associativity reduces interference • Increasing capacity reduces interference for large associative caches Mario Nemirovsky
Interference • Increasing the line size increases interference Mario Nemirovsky
Interference • Increasing the number of streams increases interference 2 way set associativity Mario Nemirovsky
Overall Miss Rate • Increasing the line size: • decreases the miss rate for large caches • increases the miss rate for small caches • Multistreaming favors smaller line sizes Mario Nemirovsky
Individual Thread Performance • Round Robin Scheduling • Streams share the throughput equally • Individual thread execution time increased by 13% for 2 streams • Priority Scheduling • Streams are assigned a priority • Individual thread execution time increased by 2% for 2 streams • Lower priority stream executed at 73% of single stream performance Mario Nemirovsky
Better ways to exploit parallelism? • – Key to improving architectural gain/ transistor • – More SW & algorithmic involvement may be required! • Think about high level forms of parallelism • – More explicit, but gentle slope is crucial • Can speculative multithreading help? • More evolutionary: microarchitecture level • – Reduce importance of binary compatibility? • – Multi- purpose ISAs rather than general- purpose? • Single architecture adopts to different applications • – Possible directions • More static pipeline structures (LIW, VLIW) • Easier adoption of multiprocessing? • “Configurable” architectures (multiuse vs. g. p.) Mario Nemirovsky