Lecture 10: Hyper-Threading

Lecture 10: Hyper-Threading

Intel's Hyper-Threading TechnologyOverview • Intel's Hyper-Threading Technology brings the concept of simultaneous multi-threading to the Intel Architecture. • Hyper-Threading Technology makes a single physical processor appear as two logical processors. • the physical execution resources are shared and the architecture state is duplicated for the two logical processors. • From a software or architecture perspective, this means operating systems and user programs can schedule processes or threads to logical processors as they would on multiple physical processors. • From a microarchitecture perspective, this means that instructions from both logical processors will persist and execute simultaneously on shared execution resources.

Thread Level - Amdahl’s Law • Maximum Efficiency • Fraction parallel limits scalability • Key: Parallelize everything significant

CPU 1 CPU 2 CPU 3 Brief Introduction to Threads Multiprocessing • Run threads using multiple processors Multithreading + Multiple Processors = Improved Performance

Brief Introduction to Threads Functional Parallelism • Apply different operations to different data elements Open DB’s Address Book Concurrent Tasks InBox Calendar

Open File Edit Spell Check Brief Introduction to Threads Data Parallelism • Apply the same operation to different data elements function SpellCheck { loop (word = 1, words_in_file) compare_to_dictionary (word); }

Brief Introduction to Threads Thread Libraries – Win32* API • C language interfaces • Threads exist within a single process • Good for asynchronous concurrency • All threads are peers • No explicit parent-child model • Exception: main() thread • Creating Win32* Threads HANDLE CreateThread( LPSECURITY_ATTRIBUTES ThreadAttributes, DWORD StackSize, LPTHREAD_START_ROUTINE StartAddress, LPVOID Parameter, DWORD CreationFlags, LPDWORD ThreadId ); Functions are explicitly mapped to threads Thread handle is a synchronization object *Other names and brands may be claimed as the property of others

Pentium 4 Block diagram

Execution pipeline • A high-level view of the microarchitecture pipeline. • buffering queues separate major pipeline logic blocks. • The buffering queues are either partitioned or duplicated to ensure independent forward progress through each logic block.

Fetch and deliver • Alternate between logical processors • Execution trace cache/ Microcode ROM • Fetch and decode instructions • Register rename and allocation

Execution • The out-of-order execution engine consists of the allocation, register renaming, scheduling, and execution functions • Logical processors execute simultaneously • Compete for schedulers, ALUs • Schedulers map independent instructions to available execution resources

The memory subsystem • The memory subsystem includes: • The DTLB: translates addresses to physical addresses. It has 64 fully associative entries; each entry can map either a 4K or a 4MB page. • Although the DTLB is a shared structure between the two logical processors, each entry includes a logical processor ID tag. • Each logical processor also has a reservation register to ensure fairness and forward progress in processing DTLB misses. • the low-latency Level 1 (L1) data cache • the Level 2 (L2) unified cache, • and the Level 3 unified cache (the Level 3 cache is only available on the Intel® XeonTM processor MP). • Access to the memory subsystem is also largely oblivious to logical processors. • The schedulers send load or store uops without regard to logical processors and the memory subsystem handles them as they come.

Instruction retirement • Alternate between logical processors • Commit state in program order

Hyper Threading implementation • Two logical processors for very small additional die area • Alternate between logical processors • Fetch and deliver • Reorder and retire • Competitive sharing between logical processors • Rapid execution engine • Caches

OS support • From the OS point of view HT is just like multi processing • Needs BIOS support for initialization • HLT instruction (for idle) • The HLT instruction stops instruction execution and places the processor in HALT stat. An enabled interrupt, NMI or Reset resume execution. The return instruction from HLT is the next instruction • There are two modes of operation referred to as single-task (ST) or multi-task (MT). • On a processor with Hyper-Threading Technology, executing HALT transitions the processor from MT-mode to ST0- or ST1-mode, depending on which logical processor executed the HALT • Spin loops: • Use new pause instruction • For long wait use the OS call • Wait on object

Lecture 10: Hyper-Threading