430 likes | 587 Views
Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers. Theo Ungerer Systems and Networking University of Augsburg ungerer@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik/. Basic Principle of Multithreading. thread 1:. Register set 1.
E N D
Opportunities for Hardware Multithreading in Microprocessors and Microcontrollers Theo Ungerer Systems and Networking University of Augsburg ungerer@informatik.uni-augsburg.de http://www.informatik.uni-augsburg.de/sik/
Basic Principle of Multithreading thread 1: Register set 1 PC PSR 1 thread 2: Register set 2 PC PSR 2 thread 3: Thread pointer Register set 3 PC PSR 3 thread 4: Register set 4 PC PSR 4 ... ... ...
Multithreadingin High Performance Processors Hardware multithreading is the ability to pursue more than one thread within a processor pipeline. Typically features: multiple register sets, fast context switching Main objective: performance gain by latency hiding for multithreaded workloads Multithreading in high-performance microprocessors • IBM RS64 IV (SStar) • Sun UltraSPARC V • Intel Xeon TM
Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities
Todays Multiple-issue Processors Utilization of instruction level parallelism by a long instruction pipeline and by the superscalar or the VLIW-/EPIC-technique.
Problem: Low Resource Utilization by Sequential Programs issue slots horizontal loss = 1 horizontal loss = 2 processor vertical loss (= 4) cycles vertical loss (= 4) horizontal loss = 3 Losses by empty issue slots
Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities
Multithreading • Two basic multithreading techniques • Interleaved Multithreading • Block Multithreading • Simultaneous multithreading (SMT) • combines wide issue superscalar with multithreading, • issues instructions from several threads simultaneously.
Basic Multithreading Techniques Single thread Interleaved MT Block MT
SMT vs. CMP SMT CMP
Characteristics of Multithreading • Latency Utilization • The latencies that arise in the computation of a single instruction stream are filled by computations of another thread. Throughput of multithreaded workloads is increased • Power Reduction • Using less speculation • Rapid Context Switching • appropriate for real-time applications
Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities
Multithreading for Throughput Increase • Lots of research results with simulated SMT since 1995 • Some of our own research results • Performance estimation of SMT multimedia • Regard transistor count and chip-space estimation of the models.
Relevant Attributes for Rating Microprocessors Performance Resource Requirement Clock Speed Power Consumption • Two tools • Performance estimation tool • Transistor count and chip-space estimation tool
Transistor Count and Chip-space Estimator • Vision: • The resources of the baseline model should be adjusted such that the same chip space or the same transistor count is covered as in the new microachitecture models. • We use an analytical method for memory-based structures like register files or internal queues and • an empirical method for logic blocks like control logic and functional units. • half-feature size l as measure of length of basic cell • Estimator tool is available (also for SimpleScalar) at:http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/complexity/
Execution-based Simulator:Baseline SMT Multimedia Processor Model
1 2 Results of Performance and Hardware Cost Estimation • Demonstrated by two set of models: „Maximum“ processor models with an abundance of resources Small processor models Workload is a MPEG-2 decoder made multithreaded
Simulation Parameters • Fixed parameters: • 1024-entry BTAC, gshare branch predictor (2 K 2-bit counters, 8 bit history, mispred. pen. 5 cycles) • 4-way set-associative D- and I-caches with 32 byte cache lines • 32 KB local on-chip RAM • 64-bit system bus, 4 MB main memory • Varied parameters: • 8-12 execution units • 256- and 32-entry reservation stations • 10 to 4 result buses • different D-cache sizes, D- and I-caches of 4 MB and 64 KB • Parameters Varied with Number of Threads: • 32 32-bit general-purpose registers and 40 rename registers (per thread), • 32- and 16-entry issue and retirement buffers (per thread) • Fetch and decode bandwidth is scaled with issue bandwidth and number of threads: 1x1 – 8x8
1 Performance vs. Hardware CostEstimation:Maximum Processor Models 4 MB I- and D-caches, 6 integer/mm units 2 local load/store units
1 Transistor Count and Chip Space Estimation of Maximum Processor Models
2 Small Processor Models 64 KB I- and D-caches, 3 integer/mm units 1 local load/store unit 32-enty reserv. stations 16-entry issue and retirement buffers 4 result buses 2x4 fetch and decode bandwidth fixed
2 Transistor Count and Chip Space Estimation of Small Processor Models
Results • 4-threaded 8-issue SMT over a single-threaded 8-issue: • Commercial Multithreaded Processors: • Tera, MAJC, Alpha 21464, IBM Blue Gene, Sun UltraSPARC V • Network processors (Intel IXP, IBM PowerNP, Vitesse IQ2x00, Lextra,..) • IBM RS64 IV: two-threaded block MT, reported 5% overhead • Intel Xeon TM (hyperthreading): two-threaded SMT, reported 5% overhead Speedup Transistor Chip Space Increase Increase maximum model: 3 2% 9% small model: 1.5 9% 27%
Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities
SMT for Reduction of Power Consumption • Observation:Mispredictions cost energy • Todays superscalars: ~ 60% of the fetched and ~ 30% of the executed instructions are squashed • Idea: fill issue slots by less speculative instructions of other threads Simulations of Seng et al. 2000 show that ~ 22% less energy is consumed by using a power-aware scheduler
Outline of the Presentation • Motivation • State-of-the-art • Multithreading • Multithreading for throughput increase • Multithreading for power reduction • Multithreading for embedded real-time systems • Conclusions & Research Opportunities
Multithreading in Embedded Real-time Systems– The Komodo Approach • Observation:multithreading allows a context switching overhead of zero cycles • Idea:harness multithreading for embedded real-time systems • Komodo Project: Real-time Java Based on a Multithreaded Java-microcontroller http://www.informatik.uni-augsburg.de/lehrstuehle/info3/research/ komodo/indexEng.html
Real-time Requirements • run-time predictability • isolation of the threads • programmability • real-time scheduling support • fast context switching Hard real-time: a deadline may never be missed Soft real-time: a deadline may occasionally be missed
Komodo Solutions • Extremely fast context switching by hardware multithreading • Real-time scheduling in hardware • Based on a Java processor core • Predictability of all instruction executions by a careful hardware design
Hardware Real-time Scheduling • Real-time scheduler is realized in hardware (by the priority manager) • Scheduling decision every clock cycle • Four different scheduling algorithms implemented: • Fixed Priority Preemptive (FPP) • Earliest Deadline First (EDF) • Least Laxity First (LLF) • Guaranteed Percentage (GP)
e v e n t A ( 2 0 % ) s t a r t d e a d l i n e e v e n t B ( 4 0 % ) d e a d l i n e s t a r t e v e n t C ( 3 0 % ) s t a r t d e a d l i n e t i m e o n a c o n v e n t Guaranteed Percentage Scheme i o n a l p r o c e s s o r v i o la t i o n c o n t e x t s w i t c h o n a m u l t i t h r e a d e d p r o c e s s o r s u r p l u s
Simulation Results thread mix (IC, PID, and FFT) applied
Technical Data of the Komodo Prototype • Implementation of Komodo core pipeline on a Xilinx XCV800 with 800k gates • ASIC synthesis of whole microcontroller (0.18 mm technology): 340 MHz, 3 mm2 chip data bit width address space number of threads instruction window size stack size external frequency internal frequency CLBs number of gates 32 bit 19 bit 4 8 bytes 128 entries 33 MHz 8.25 MHz 9 200 133 000
Reducing Power Consumption Using Real-time Scheduling in Hardware Current work: Idea: Use information about the thread states and configurations available within the priority manager for a „fine-grained“ adaption of power consumption and performance. • Frequency and voltage adjustments in short time intervals done by hardware
State of the Komodo Project • Software simulator • FPGA prototyp • Real-time Java system • ASIC • Middleware for distributed embedded systems
Conclusions onMultithreading in Real-time Environments Multithreaded processor cores: • Performance gain due to fast context switching (for hard real-time) and latency hiding (for soft and non real-time) • More efficient event handling by ISTs • Helper threads possible (garbage collection, debugging) Real-time scheduling in hardware: • Software overhead for real-time scheduling removed • more efficient power saving mechanisms possible • better predictablility by isolation of threads (GP scheduling)
Conclusions & Research Opportunities • Multithreading proves advantageous: • Latency hiding: speed-ups of 2-3 for SMT, lots of research done, next generation of microprocessors • Power reduction: 22% savings reported, not much research up to now • Fast context switching utilized by microcontroller for real-time systems,not much research up to now • Research opportunities: • Scheduling in SMT, network processors and multithreaded real-time systems • Thread-speculation: how to speed-up single-threaded programs? • Multithreading and power consumption • Multithreading in other communities: microcontrollers, SoCs • System software based on helper threads
Acknowledgements • SMT Multimedia research group • Uli Sigmund and Heiko Oehring • Complexity estimation group • Marc Steinhaus, Reiner Kolla, Josep L. Larriba-Pey,Mateo Valero • Komodo project group • Jochen Kreuzinger, Matthias Pfeffer, Sascha Uhrig, Uwe Brinkschulte, Florentin Picioroaga, Etienne Schneider
Mikroprozessors: Technology Prognosis up to 2012 • SIA (semiconductor industries association) Prognose 1997:
Research Directions? • Increase performance of a single thread of control by • more instruction-level speculation • Better branch prediction, • Trace cache and next trace prediction, • Data dependence and value prediction • Increase throughput of a workload of multiple threads • Utilize thread-level and instruction-level parallelism • Chip-Multiprocessors • Multithreading (hardware thread = thread or process) • Thread speculation