430 likes | 515 Views
Design of an Asynchronous Reconfigurable Cell for Conformal Computing. Mariam Hoseini. Advisor: Dr. Chao You Supervisor: Dr. Mark Pavicic Committee members: Dr. Rajendra Katti, Dr. Subbaraya Yuvarajan, Dr. Deying Li. North Dakota State University April 2009. Agenda.
E N D
Design of an Asynchronous Reconfigurable Cell for Conformal Computing Mariam Hoseini Advisor: Dr. Chao You Supervisor: Dr. Mark Pavicic Committee members: Dr. Rajendra Katti, Dr. Subbaraya Yuvarajan, Dr. Deying Li North Dakota State University April 2009
Agenda • Conformal Computing • Asynchronous circuit design • Handshake protocols • Data encodings • Signaling protocol • Asynchronous design methodologies • Asynchronous primitives • Constructing an array of cells • PCC cell design and simulations • Conclusion North Dakota State University 2
Conformal Computing (1/3) • Computers are typically rigid boards or boxes with a fixed computational capability. • The available computers may have the undesired size or shape, or have less computing capability than is needed. • The program investigates a more flexible form of computer which easily conforms to the physical and computational needs of an application. • Potential applications: • Sorting, cryptography, cellular neural nets, etc • The computational material can be integrated with arrays of sensors and/or actuators North Dakota State University 3
Conformal Computing (2/3) • Potential problems: • Easily changing the physical shape of the computer • Adjusting the computational capability • Propagation delays, synchronization, power distribution, and heat dissipation. • One approach is: • To form extensible arrays of simple reconfigurable computing elements (cells) into thin wallpaper-like sheets. • Long signal wires are eliminated. • Communications are local and synchronized with cell to cell pulses. • This research presents a cell design, called a pulsed conformal computer cell (PCC cell). North Dakota State University 4
Conformal Computing (3/3) • PCC cell has significant similarities to cellular automata (CA): • Simple fine-grained elements, • Integration of processing and storage, • Local communication • CA can model the elements of digital computers, using patterns of cells to perform the functions of wires, logic, & registers • The same model is used in the PCC cell design • The function and connections of PCC cell are reconfigurable, similar to FPGAs. • FPGAs are not as fine-grained • FPGAs are not as regular • The PCC cell array uses only short-range wires that connect adjacent cells North Dakota State University 5
Asynchronous Circuit Design • Two major styles of circuit design: Synchronous & Asynchronous • Advantages of asynchronous design, in terms of: • Clock skew • Speed • Meta-stability • Modularity • Power • Disadvantages of asynchronous design: • More difficult to design for a hazard free behavior and a correct ordering of operations. • Additional hardware to initiate, advance, and indicate the completion of operations. • Asynchronous systems are specified by handshake protocol, data encoding, underlying delay model. North Dakota State University 6
Handshake Protocols • Handshaking is the alternate for clocking in asynchronous systems. • Data transfer between two processes is synchronized with signals that are generated by the same processes. • Asynchronous operation can also be done without handshaking. • Handshaking is used to separate successive uses of a component. • It may not be necessary to separate the use of a component or the separation can be done by delaying the operations. • Handshaking can be done at higher levels in an asynchronous system. North Dakota State University 7
Data Encodings • Bundled data: • Normal Boolean levels encodes data values • Separate request and acknowledge wires are used • Dual rail: • Two wires are used to carry a single bit • Request wire is encoded in dual rail data wires • Dual rail data encoding is used in PCC cell design North Dakota State University 8
Signaling Protocol • Pulse Signaling: • Each request and acknowledge is a pulse • Simple and small cycle like transition signaling • Dealing with levels like level signaling • Better noise immunity than single-track signaling • Potential problem: robustness of sending pulses over long wires. • Pulse signaling is used in PCC cell design & there is no problem of long wires. start event Request event done One cycle Acknowledge North Dakota State University 9
Asynchronous Design Methodologies (1/2) • Bounded delay • Simplest model • Delays of circuit elements and wires are assumed to be known or bounded. • Delay insensitive (DI) • Both gates and wires have unbounded and unknown delays. • Completion detection mechanism is needed at receiver • Quasi delay insensitive (QDI) • DI + Isochronic forks = QDI • Isochronic forks are capable of indication • All input transitions should be indicated by an output signal transition d2 d3 d1 C B A North Dakota State University 10
Asynchronous Design Methodologies (2/2) • In an asynchronous systems, interfaces and inside modules can be designed with different timing models • In the PCC cell design, for timing management: • Internal of a cell is governed by a bounded delay model • Communications between the cells is done by a QDI model North Dakota State University 11
Asynchronous Primitives (1/2) • In synchronous systems, Boolean circuits can be constructed from a primitive like a NAND-gate • Logic gates provide only logic functionality, not timing functionality, so not sufficient to make asynchronous circuits • Asynchronous systems can be made from a set of primitives • The set of primitives must provide both universal logic and timing functionalities • Different sets of primitives have been introduced, such as Keller’s, Patra’s, Lee’s, and etc North Dakota State University 12
Asynchronous Primitives (2/2) The set of primitives used in a PCC cell: • Wire • Transfers the output of a component to input of another one. • Fork • The output of one component is the input to several components • Merge • Sends one of its input to the output • Join • Data from several independent components are needed to be synchronized. I I1 I2 12 O2 O1 O I1 I I2 I1 O O O1 North Dakota State University 13
Constructing an Array of Cells (1/2) • An array of cells each having a simple one-bit processing unit • Von Neumann neighborhood for local connections • A routing problem occurs: • A possible solution: North Dakota State University 14
Constructing an Array of Cells (2/2) • Another approach is to combine every two to make a double cell • The same routing capability with fewer neighboring connections • A further step is to group 4 cells together to make a quad cell • The same routing capability with simple connections to 4 nearest neighbors North Dakota State University 15
PCC Cell Design • Logic Unit Design • Synchronization • Pulse Regenerator • Top Level Design • Configuration Circuitry • PCC Cell Simulations • One-bit full adder • Ring oscillator • Shift register • Implementing Pipelines North Dakota State University 16
Logic Unit (1/3) • There is a logic unit (LU) and an output register in each quarter • Each LU has two inputs and one output North Dakota State University 17
Logic Unit (2/3) • Dual rail inputs • Dual rail outputs • Switches should be set before inputs arrival • 8 switches to define a function • 16 functions • Avoids floating nodes by pull down resistors North Dakota State University 18
Logic Unit (3/3) • AND function • D, E , F, G are “0001” North Dakota State University 19
Primitives (1/2) • Wire one output pulse triggers the LU inputs of the neighbor cell in the same direction. • Merge is realized by 2:1 Muxs, pulses do right turns (90 degree) • Fork Each turn triggers a neighbor quarter and also a neighbor cell, • a single computation forks into multiple parallel computations North Dakota State University 20
Primitives (2/2) Join • A completion detection circuitry • All the participating quarters should have their LU outputs ready • Complements a fork by combining multiple parallel computations into a single computation. • QDI Communications North Dakota State University 21
Timing models of Internal Forks • Fork1 • Only when a pulse turns • LU should use only the turned pulse • Fork2 & Fork4 • No timing assumptions • Fork3 & Fork5 • Bounded delay model North Dakota State University 22
Pulse Regenerator (PRG) • When a pulse travels through many cells, the width of the pulse may increase or decrease • Too short pulse may not be detectable at all, too long pulse may catch up other pulses • A PRG produces an output pulse with a certain constant width, independent of the width of the input pulse. • D1 is the delay by which the input pulse is stretched • D2 determines the width of the output pulse D1 D2 A B C D E North Dakota State University 23
Top Level Design (1/2) North Dakota State University 24
Top Level Design (1/2) • In a PCC cell : (W/L)p / (W/L)n ≈ 1.6 • In an inverter: • Equivalent resistance of a MOS : (R≈ L/W) • To match PMOS and NMOS resistances (W/L)p / (W/L)n = 3 ~ 3.5 • tpHL = .69* Rn* CL & tpLH = .69* Rp* CL if Rn = Rp tpHL = tpLH • A bigger PMOS improves the tpLH by increasing the charging current. • A bigger PMOS degrades the tpHL by causing a larger parasitic capacitance. • tp = (tpHL + tpLH)/2 is not minimal. • The ratio for an optimal speed performance equals to √(Rp/Rn) • The device can be speed up device by reducing the size of PMOS North Dakota State University 25
Configuration (1/3) • Configuration bits (16 bits for LU switches, 8 bits for Merge MUXs & 4 bits for Join, i.e. total of 28 bits) should be loaded • Only some parts of the array may need to be configured • One solution is to make a long chain of shift registers of all the cells & configure all of them • A better solution is to form the chain of shift registers only by the cells that are needed to be configured. • In each cell, a controller: • decides whether the cell is wanted to be configured or not • directs the bit flow to one of the cell neighbors • stops the shift registers whenever all the intended cells are configured North Dakota State University 26
clk-N clk-N Configuration (2/3) Decoder clk-W OR clk-E clk-S 11 clk-S Decoder clk-W 10 clk-E data-N data-N Decoder data-S 01 data-W Controller data-E data-W 00 data-E data-S OR Shows that the shift register is filled Shows that the cell is the last one in the chain of shift register Determines that the cell should/should not be configured Defines the neighbor to which the bits should be forwarded North Dakota State University 27
Configuration (3/3) North Dakota State University 28
PCC Cell Simulations (1/3) • PCC cell was implemented in TSMC 250 nm CMOS using S-Edit. • The simulation was done by Pspice • The supply voltage is 5V • Input pulse widths are 400ps • Propagation delay through a cell • is 480ps ~ 500ps. • Better speed: • Slope ≤ gate propagation delay • Slope of the external inputs • are 12ps. • No overshoots and undershoots North Dakota State University 29
PCC Cell Simulations (2/3) Voltage source =5V Average current = 6 mA for 1.4 ns & 17 mA for 8.6 ns For 20 pulses: Energy = (5 * 6* 1.4) + (5 * 17 * 8.6) = 773 pJ North Dakota State University 30
PCC Cell Simulations (3/3) For 1 pulse (1-bit of operation): Voltage source= 5 V Average current = 5 mA Energy = 5 * 5 *1.5 ns =37.5 pJ • Voltage source= 3.3 V Average current = 3 mA Energy = 3 * 3.3 *1.8 ns=17.8 pJ North Dakota State University 31
One-Bit Adder • Sum = A B C 1 1 1= 1 • Carry= AB + BC + AC = AB + (A+B)C 1.1 + (1+1).1=1 • Sum & carry products are ready after 0.5ns & 1.8ns North Dakota State University 32
Ring Oscillator (1/3) • Loops are important for many circuits such as sequential circuits, iterative computations and For, If, and While constructs • The ring oscillator represents two capabilities of PCC cell: • A loop can be controlled externally (started & stopped) • Utilizing Join of pulses, communications can be QDI Start Pulse ‘0’ 0 1 0 0 1 1 0 1 Output is always a ‘1’ North Dakota State University 33
Ring Oscillator (2/3) • Ring oscillator implemented in an array of PCC cells One One Pass XOR WR WR • ‘0’ pulses are shown in blue, ‘1’ pulses are shown in red • The input Mux is configured to receive a ‘0’ pulse only from external of the 1st cell and a ‘1’ pulse only from a turn. Nand One Pass North Dakota State University 34
Ring Oscillator (3/3) Simulation Results: North Dakota State University 35
Shift Register An input bit stream of “1010” is used. North Dakota State University 36
Pipeline (1/2) • If handshaking is done for every component, the components can form a pipeline. • Each component should supply an Ack to indicate that it is available for re-use. Delay(1) = 3X + (n-2)5X + 3X= (5n - 4)X Ack is received Ack is received LU LU LU LU LU LU Ack Ack LU LU LU North Dakota State University 37
Pipeline (2/2) • Some cells don’t handshake & they are cascaded. The cascaded cells form a unit of a pipeline. So, handshaking is done only at higher level. Delay(2) = 3X + (n-2)2X + 3x= (2n +2)X Delay(2)/Delay(1) = (2n + 2)X=(5n-4)X = 2/5 Ack is received LU LU LU LU Ack Ack LU LU LU A unit of the pipeline A unit of the pipeline North Dakota State University 38
Conclusion(1/2) Performance: Speed very good Energy good Area average North Dakota State University 39
Conclusion (2/2) • Contribution: • Utilizing asynchrony, reconfigurability, and the properties of CA to make an extensible array with more regular and finer grained cells than that of FPGAs. • Future works: • Improving the performance of the cell in terms of area and thermal management North Dakota State University 40
Aknowledgment • Express my deepest gratitude to my supervisors, Dr. Mark Pavicic and Dr. Chao You. • Gratitude are also due to graduate committee, Dr. Rajendra Katti, Dr. Subbaraya Yuvarajan, Dr. Deying Li. • Express my love and gratitude to my beloved spouse, Hamed. North Dakota State University 41
Q & A North Dakota State University 42