Scenario-Oriented Design for Single Chip Heterogeneous Multiprocessors

Scenario-Oriented Design for Single Chip Heterogeneous Multiprocessors JoAnn M. PaulElectrical and Computer Engineering DepartmentCarnegie Mellon UniversityPittsburgh, PA Presented by: Mohammad Farsakh

What is this paper about • Challenges of selection, programming, and coordination for future single chip computers designs • Consisting of processing elements (PEs) • Heterogeneous type • Outlines the differences between • Next generation single chip systems designs • Traditional designs. • Focus on Scenario-Oriented Design (SOD) strategy • Applications, schedulers, and hardware viewed as a system • Leveraging one against the other. • Reducing the modeling detail of each design domain within a system in high level simulation.

Introduction • The design process of digital computation has three categories • Models • Tools • Strategies • Existing models, tools and strategies are failing to permit designers of single chip computers, to efficiently capture the design space at hand. • Tools do not capture software as part of the system model. • Instruction Set Simulators (ISS) are too detailed and slow to capture systems with many processors • Designers are left to their own devices, which limits the effective realization of many potentially significant designs. • Models, tools and strategies for both the software and hardware of single chip heterogeneous multiprocessors is required.

Introduction (cont) • The future individual, programmable processors will be like registers in a larger framework. • Processor blocks will be differentiated by • Capabilities of the hardware in the processors, • The way they are programmed, • Their manner of interconnection. • Should maximize the ratio DQ/DE for a given design to initiate the next design level • DQ = Design Quality • DE = Development Effort • A common basis for design, at a modeling level is required to manipulate the design decisions that has the most impact.

Programmable HeterogeneousMultiprocessors (PHMs) • Collections of PEs must be considered programmable • The chip is a programmable collection of processors grouped dynamically. • Different design challenges in heterogeneous multiprocessors because three primary reasons: • A single chip is a finite resource, unlike wide-area networks. • The design will be semi-custom. • Under hardware more customized to the application space than traditional programmable system • Traditional heterogeneous multiprocessors , provide transaction-like services on a diverse collection of resources • Single chip devices such as SoCs are customized to meet fixed latency requirements as a reactive system. • PHMs will be semi-custom and have aspects of both design styles. • Coordination of system resources is required. • The large differential in on-chip vs. off-chip communications will force efficient utilization and management of on-chip system resources — including processing elements, memory, communications bandwidth and chip I/O.

Design Environment of Single Chip PHM • H , Single Chip heterogeneous multiprocessor. • Data Inputs • DP, time stamped system inputs that are conceptually presented to the system hardware on I/O pins. • DM, data values reside in some external memory. Analogous to jobs, packets or other requests in a queue waiting to be processed by H. • Programs • BC, clocked benchmarks programs with fixed latency requirements with required latency specified to a fixed time reference. Designed to meet the worst-case demands that are presented to the system by DP. Programs have fixed performance requirements. • BI, programmatic inputs benchmarks for which performance is calculated by the internal timing of the processing capabilities of the design. This run over many PEs. • BX, schedulers programs, that acts as a means of resolving the other benchmarks to the architecture.

Design Output • Single output Q, has the quality metric of the design including the performance for the two classes of behaviors (BC and BI) • General form of such environment E = {D, B, H, Q} • In case of E = {BC, DP, Q} • Pass/fail Quality metric • Fully specified by DP and BC and not a separately performance-evaluated architecture. • Hardware Description Language (HDL) • In case of E = {BC, BX, DP, H, Q} where H is a single processor • Pass/fail Quality metric • Kind of analytical modeling typical of research in real-time operating systems (RTOSs). • RTOS • In case of E = {BI, DM, H, Q} • H is a single processor executing at the instruction set simulator level or below • It is typical of simulators such as Simple scalar used to model a micro architecture or ISA. • Simple ISS • Complexity of the application space • Current day approach ISS, can’t permit effective exploration of the design space • Complete level of detail required in the model • Takes long time to generate any single value of Q.

Scenario-Oriented Design • A novel design strategy • Orients heterogeneous multiprocessor single chip design according to a blend of performance requirements, • Implemented in new chip-wide programmer’s views. • Leverages increased heterogeneity in the future application space • Results in greater efficiency in design process and Resource utilization.

Fixed performance (FP) • To meet the current systems requirement for system with Dp and BC current system must be overdesigned for two reasons: • The capacity of system resources is wasted, with the time taken to matching functionality to available processing power, to make sure that the WC behavior is met. • The irregular loading situations and data dependent processing times contribute to underutilized processing resources except in peak loading situations with WC.

Throughput performance (TP) • Bi designed to be a broad representative set of program types used to evaluate and optimize a programmable device’s throughput performance (TP). • Optimize a common case (CC) instead of ensuring that WC behaviors are met. • Like network switches dropping packets presumed to be resent. • Applies to caches, branch predictors, OS scheduling strategies

Future Vs. Current Designs • Two design strategies are worlds apart. • worst case (WC) with fixed performance • common case (CC) with throughput performance (TP) • Future single chip designs • Execute a mix of the BC and BI to handle a mix of DP and DM • FP behaviors are met • CC behaviors are optimized. • Currently, systems with FP and TP performance oriented design • Separated into different devices • General purpose programming resides on the general purpose processor, • Other processors utilize individual RTOSs to ensure WC behaviors are met, or WC behaviors are ensured by implementation in custom hardware.

Layered, SOD approach to SoC Design • SOD can satisfy performance for FP functionality and provide a basis for a TP-optimized remainder architecture. • Hardware architecture and a remainder architecture are co-designed. • Map the FP functionality across the entire chip, consuming part of the proposed architecture • Leverages the presence of both classes • Optimize design time • Optimize design quality • Measuring exact execution times for FP is not required at the start of design • Hardware architecture and a remainder architecture are co-designed.

SoC Hardware View • Different Processing Elements (PE) • Different functionality • Common communication channel

SoC With Remainder Architecture • Software partitioning • PE divided to two parts • PE = {F-I,R-i} • COMM ={R-COMM, F-COMM} • Functional Overlay, {F-i}, BC to Processing resources • Remainder architecture carry BI, R = {R-I,R-COMM}

Layered, SOD approach to SoC Design • New layer between R-i and F-I • Enlarge the boundary between performance group partitions • Reduce design time • FP mapped to chip need not be known beforehand • Optimize TP • SOD partitioning produces a chip-wide, horizontal view • Hardware resources in the bottom layer • Schedulers in the middle layer ( permit the co-operation of …. ) • General software at the top layer • Last two layers could have multiple internal layers • Layering concept, leverages schedulers as a basis for a soft partitioning of a hardware design.

Simulation Foundation — MESH • The Modeling Environment for Software and Hardware (MESH) is a good simulator • Provide a layered modeling basis above ISS models • Use schedulers to model concurrent, high level software running on high level models of processor resources. • Resolve the timing through design layers where unrestricted software executes on hardware models without relying upon ISS

Modeling Environment for Software and Hardware (MESH) • ThLij — One of j logical threads (software) that will execute on processor i. • ThPi — A model of the ith physical resource in the system, such as a processor. • UPi — A scheduler that selects logical threads intended to execute on resource ThPi. • ULi — A logical scheduler that can schedule M threads to N resources. e.g., a pthread scheduler

How Mesh Works • Dynamic number of logical threads • Execution is scheduled onto a single resource • Scheduling decisions based on the state of the threads and other system state. • Resolves the logical events of the software threads to physical timing • Schedulers serve two roles: • Modeling scheduling decisions, • Resolving logical computation to physical time. • Complex system have many resources (ThPi) • Two dimensions of scheduling: • Based on physical time • Based logical state. • M threads may dynamically mapped to N resources • 2.5 times faster than an internal ISS level simulator

Conclusion • Challenges for future designs • Performance • Power • Chip size • Future computer design should be evaluated as a system • SOD: strategy result from considering applications, schedulers, and hardware as they interact to form a system • Leveraging each against the other • Reducing modeling details

Thanks

Scenario-Oriented Design for Single Chip Heterogeneous Multiprocessors