290 likes | 394 Views
A Heterogeneous Lightweight Multithreaded Architecture. Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA. Outline. Heterogeneous Lightweight Multithreaded Architecture
E N D
A Heterogeneous Lightweight Multithreaded Architecture Sheng Li, Amit Kashyap, Shannon Kuntz, Jay Brockman, Peter Kogge, Paul Springer, and Gary Block University of Notre Dame MTAAP 2007,CA
Outline • Heterogeneous Lightweight Multithreaded Architecture • Simulation environments, benchmarks and results • Conclusions and future work
Architecture Highlights • Processing-In-Memory(PIM) Based • Effectively attack memory wall problem • Highly multithreaded • Successfully hide large latencies and contentions • Heterogeneous, Supports Extended Memory Semantics (EMS) • Extremely low overhead on context switch and synchronization
Multithreaded Processors • Multithreading reduces the processor idle time • Thread context is part of the processor Multithreading Machines 1960s CDC 6600 1970s I/O Processor for the Space Shuttle 1980s Denelcor HEP 1990s Cray/Tera MTA 2000+ Cray Eldorado 2000+ Intel Xeon 2000+ Sun Niagara Single Threaded Multithreaded
Lightweight Threads • Thread context (frame) is 32 double words (256 bytes) • Two double words are reserved for the thread status; 30 general purpose registers. • No other per thread state, easy for multithreading . • Frames are stored in memory (No Register File) • Registers are aliases for memory locations
Lightweight Multithreading • Thread creation is fast and inexpensive - single instruction • Contrast with pthread creation - kernel intervention and as many as 10,000’s of instructions • Unbounded Multithreading • Threads are part of the memory system rather than the processor state. • “Unlimited number” of threads per processor. • Many opportunities for issuing an instruction. • Ultra-lightweight Processing • Unbounded Multithreadingrequires low overhead thread management and synchronization • At the memory bank, Greater data bandwidth,Low overhead
Heterogeneous Architecture • Issue instruction from ready threads on each clock cycle • Architectural support for low overhead thread management Heterogeneous Architecture Lightweight Processor Chip (LPC)
64 bits of data/metadata Extension bit Extended Memory Semantics (EMS) • Memory subsystem is constructed of 65 bit dwords • 64 bits of data • 1 extension bit;1: dword is Full, 0: dword is empty • Extends Cray MTA E/F bits • Full/Empty: Contains data or not • Extra states: Metadata can contain frame pointer • Same semantics apply to thread registers
Single Producer/ Consumer on EMS • LWP behavior for load_fe with A empty. • Location A changes state to “FVE: forward value, leave empty” • Content of A is the target address of the forward operation (all registers also have a memory address).
Completing the Load • How does the LWP complete the load_fe? • store_ef arrives at A • Data associated with store is returned to T2:R2 – this completes the load_fe • Location A changes to the empty state.
A More Complex Situation • Consider a multiple producer/consumer problem such as locks. • Multiple threads (more than 3) all attempt to acquire the lock. • Memory requests will be queued up at the target location • EMS handlerthread needed to handle the bookkeeping
EMS Handler Overhead • Invoking a EMS handler • Synchronized memory operations beyond the hardware supported single producer/consumer scenario • Overhead • Creating the handler threads • To queue up memory requests, handlers need to spin on the target memory address to get exclusive access • Significant overhead on LWP CPU time, NoC traffic and memory bandwidth • How to alleviate the overhead?
Ultra-Lightweight Processor • Alleviate burden from LWP • For thread synchronization and management, Complex atomic memory operations • Simple design, Minimal circuitry • At the memory bank, Greatest data bandwidth (wide-word),no NoC traffic when accessing memory. • Multithreaded
Large-scale system Large-scale system
Outline • Heterogeneous Lightweight MultithreadedArchitecture • Simulation environments, benchmarks and results • Conclusion and future work
Simulation Environment DimC – Diminished C - An extension of the ANSI C - Expose low level architectural features - Support lightweight multithreading SALT -Simulator for the Analysis of LWP Timings -Contains LWPs, ULWPs, NoC and memory subsystems.
Benchmark Suite • Two categories of irregular problems. • Complicated control structures such as recursion. • Such programs can achieve decent performance on conventional architectures but need great effort. • Not necessarily Invoking EMS handler or ULWP • N-Queens, Fibonacci • Complicated control structures and dynamic data structures • Very hard to parallelize effectively on conventional SMPs. • EMS handler or ULWP support is necessary • Competing agents, SAT solver kernel
N-Queens • Find all solutions to the problem of placing N queens on an N*N chessboard such that no queen can attack another. • Irregular problems with dynamic parallel recursion , • Thread behavior is hard to predict.
Competing Agents • Multiple agents attempt to update a shared memory location simultaneously • Each agent is implemented by a single thread. All threads are evenly distributed over four LWPs inside a single LPC • Complicated control structures and dynamic data structures • Using separate synchronized load/stores • To characterize the effectiveness of the ULWP in reducing the cost of synchronization.
SAT Solver/zChaff • SAT-Boolean satisfiability problem (from propositional logic) • fundamental to many problems in automated reasoning, CAD, CAM, machine vision, database, robotics, IC design, computer architecture, and network design. • Given a boolean formula (usually in CNF) , check whether an assignment of boolean truth values to the variables in the formula exists, such that the formula evaluates to true. • For example, the CNF formula, x1 is true and x3 is false, then all three clauses are satisfied,regardless of the value of x2. • zChaff , the modern variants of the DPLL algorithm, is used to implement SAT solver.
N-Queens • Successfully deploy all the parallelism • Completely dynamic, Ideal speedup • Saturation is only due to small data set • Good performance can be achieved on conventional SMPs but need great extra effort
Competing Agents • EMS handler is the bottleneck in high contention situation • Heterogeneous architecture can achieve unbounded scalability • High contention is not a problem any more in the heterogeneous architecture
SAT Solver/zChaff on Conventional SMPs • Parallel implementation lead to performance degeneration • The more processors, the worse performance • Very hard to achieve good performance on conventional SMPs Data from Parallel Multithreaded Satisfiability Solver: Design and Implementation By Yulik Feldman, etc. @ Intel
SAT Solver/zChaff on Heterogeneous architecture • Ideal speedup • saturation is only due to small data set • Successfully deployed all the parallelism Speedup Speedup Over serial version
Outline • Heterogeneous Lightweight MultithreadedArchitecture • Simulation environments, benchmarks and results • Conclusions and future work
Conclusions • The Heterogeneous Lightweight Multithreaded Architecture • is a good solution for irregular problem that are hard/impossible to parallelize over conventional SMPs • Has very low overhead on context switching and synchronization • Can successfully hide latencies and contentions • Can provide unbounded multithreading and scalability • Can deploy all possible parallelism inside an irregular problem
Future Work • Provide standard language support • Benchmark suites • Large-scale system performance • Comparison with conventional large-scale systems
Acknowledgments • DARPA • This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under its Contract No. NBCH3039003. • University of Notre Dame • Caltech/JPL • Cray