230 likes | 388 Views
A Distributed Colouring Algorithm for Control Hazards in Asynchronous Pipelines. Qianyi Zhang School of Computer Science, University of Birmingham (Supervisor: Dr Georgios Theodoropoulos). Outline. Asynchronous Hardware Handling the Control Hazards Problem In a pipeline architecture
E N D
A Distributed Colouring Algorithm for Control Hazards in Asynchronous Pipelines Qianyi Zhang School of Computer Science, University of Birmingham (Supervisor: Dr Georgios Theodoropoulos)
Outline • Asynchronous Hardware • Handling the Control Hazards Problem • In a pipeline architecture • In asynchronous hardware • In a multistage asynchronous pipeline • A Generic Distributed Colouring Solution • Multi-colour vector • Function of each pipeline stage and arbitrate unit • A Constructive Proof • Some Results • Summary
Sender Receiver The Problem of Synchronisation • Digital system: • a collection of subsystems performing different computations and communicating to exchange information. • Before a communication transaction the subsystems need to synchronise: wait for a common control state to be reached which guarantees the validity of data exchanged. • Synchronous: Global clock defines the points in time when communication can take place (Time Driven) • Problems: • Clock Skew • Power Skew • Modularity • Performance
Sutherland’s micropipeline 2-phase Handshake protocol Asynchronous Logic • An alternative digital design philosophy: • It allows each sub-system to operates at its own rate and communicates with its peers only when it needs to exchange information. • The synchronisation is achieved by the communication protocol: local request and acknowledge signals which provide information regarding the validity of data signals.
Asynchronous Logic • Asynchronous design techniques have been explored since1950s but failed to become mainstream: difficulty to enforce specific orderings of operations and to deal with circuit hazards and dynamic states • Last decade has witnessed a resurgence of interest in Asynchronous Logic • Solution to clock skew problem : No clock = no skew! • Potential for low power: Circuit components activated only when necessary • potential for higher performance: lower power allows increased supply voltages; average case optimisation • Potential for better technology migration: Modularity • Better EMC: generatelow, uncorrelated Electro-Magnetic Interference • However, it’s also bring new problems: • may result in a larger circuits: REQ & ACK signals • More difficult to design and understand their behaviour
Data Mem Ins Mem Register Bank ALU IF ID EX MEM WB Handling Control Hazards • The Control hazards problem in a pipelined architecture • Control hazards arise when an instruction such as a branch a jump, or the occurrence of an unpredictable event such as an exception, changes the flow of control. • In a pipeline architecture: the prefetched instructions following a hazard must be removed from the pipeline before the new stream comes. • The processor must be able to distinguish between instructions originating from the branch or the exception target and instructions already prefetched SUB $2, $1, $3 AND $3, $2, $4 OR $4, $1, $2 JR $25SW $15, 100($2)
Data Mem Ins Mem Register Bank ALU SUB $2, $1, $3 Handling Control Hazards • Control hazards in Synchronous vs Asynchronous Hardware • Synchronous: the depth of prefetching is defined by the clock cycles and is therefore deterministic. • Asynchronous: the exact number of the prefetched instructions is nondeterministic and therefore unpredictable: the depth of the prefetching depends on the precise point that the branch or the exception takes place. SUB $2, $1, $3 AND $3, $2, $4 OR $4, $1, $2 JR $25 SW $15, 100($2) Need a new strategy !
Using “Colour” • Technique devised for AMULET1 processor (Manchester) • When a control hazard occurs the colour of the processor changes • Each instruction address issued to memory, carries the latest operating colour of the processor which will be used to mark the corresponding fetched instruction. • The colour bit of an instruction which arrives at the datapath for execution, is compared with the current colour of the processor and if a match is not found, the instruction is discarded. New stream … AND $3, $2, $4 OR $4, $1, $2 JR $25 0 10
Control Hazard in Multiple Stages • One colour bit is not enough • How many colours we need? • How to arbitrate if more than one stages send requests simultaneously • Two basic observations • The state of the system is distributed • Stages that are deeper in the pipeline have higher priority than stages before them: a control transfer event that occurs at a pipeline renders other events that may occur in pipeline stages earlier in the pipeline irrelevant and invalid, event if the latter precede the former in time.
A Generic Distributed Technique • A colour vector with priorities: • One colour bit per stage • A vector C = (c1, c2, c3, …,cn,) in the set Cn, where C is the set of colours C = {0,1}, n is the number of stages in the pipeline and ci is the colour of the stage i. • Priority of ci > Priority of cj, i>j • Two arbitrations are made • An Address Arbitration Unit (AAU) : reject the invalid control hazard request • Each Stage: discard the prefetched instructions following the hazard S1 S2 S3 S4
A Generic Distributed Technique • The Address Arbitration Unit • Operates as an autonomous unit issuing to memory instruction addresses as they arrive from the Program Counter (normal operation) or from the pipeline stages (in the case a control hazard occurs). • Keeps a record of the colour state of the processor (vector c) • If a new transfer address arrives from stage Sk: • If any higher priority colour bit (cj where j>k) in the address is different than the corresponding colour bit of the AAU, rejects the address • Otherwise lets it through and updates own copy of vector c
A Generic Distributed Technique • Each stage Sk in the pipeline • Keeps a record of the colour state of the processor (vector c) which it reads from the instructions as they get through • For each new instruction that arrives: • If any higher priority colour bit (cj where j>k) in the instruction is different than the corresponding colour bit of the stage: lets instruction through and updates own copy of vector c • Otherwise: • If own colour bit different rejects instruction • Otherwise executes instruction S1 S2 S3 S4
000000000000000000000000 0100010001000100 0000000000000000 0100 0000 A Constructive Proof (1) 000000000000000000000000 0000000000000000 0000000000000000 AAU 0000 0100 * * 0000 0100 0000
000000000000000000000000 0100010001000100 0000000000000000 A Constructive Proof (1) AAU 0100 * 0100 * 0100 0000 0100
A Constructive Proof (2) 000000000000000000000000 0100010001000100 0001000100010001 000000000000000000000000 0000000000000000 0000000000000000 000000000000000000000000 0100010001000100 0000000000000000 AAU 0000 * 0001 0100 * * * 0100 0000 0000 0001
0100 0000 0000 A Constructive Proof (2) 000000000000000000000000 0100010001000100 0001000100010001 AAU 0001 * * * * 0100 0001
A Constructive Proof (2) 000000000000000000000000 0100010001000100 0001000100010001 AAU 0001 * * 0100 0001 * * 0100 0001 0001
0001 0000 A Constructive Proof (2) 000000000000000000000000 0000000000000000 0000000000000000 000000000000000000000000 0000000000000000 0001000100010001 AAU 0000 * 0001 * * * 0100 0000 0001 0000
A Constructive Proof (2) 000000000000000000000000 0000000000000000 0001000100010001 AAU 0001 * * 0001 * * 0100 0001 0001
£380,000+ - for 3 years starting April 2003 • Objectives: • Exploit compositionality of designs to enable automatic support for refinement checking, equivalence checking and deadlock detection. • Investigate applicability of data independence as a means to automate datapath abstraction and verification of parameterised component descriptions. • Investigate applicability of semi-formal techniques in the context of asynchronous hardware. • Develop algorithms and techniques for partitioning, load balancing, synchronisation and monitoring to support the distributed simulation. • Develop a prototype CSP-oriented integrated environment for the specification, distributed simulation and formal verification of asynchronous VLSI systems. • Develop test cases and conduct experiments to test and evaluate our approach. An Integrated Framework for Formal Verification and Distributed Simulation of Asynchronous HardwareEPSRC Project No. GR/S11091/01 & GR/S11084/01
Evaluation • Synthesisable asynchronous implementation of MIPS R3000 processor core • Compatible Instruction Set with R3000 • 5-stage pipeline datapath • With precise exceptions • Balsa: • a synthesis tool for Asynchronous Hardware, developed by AMULET group • A asynchronous hardware description language based on CSP • A discrete event simulator on RTL level • A compiler for gate level netlist SAMIPS
Evaluation • The Balsa model of S1 • The Balsa model of AAU Cost comparison with SAMIPS Cost Estimation of Stages
Summary and Future Work • A distributed colouring algorithm for dealing with control hazards in asynchronous pipeline • The main advantages: • It provides flexibility in designing the pipeline of the processor, enabling perfeching at any depth • Low extra cost introduced in terms of silicon area • This approach has just been integrated to SAMIPS and proved correct in functionality. • We will evaluate the performance of this approach and the overhead it imposes in terms of time and power