460 likes | 608 Views
CprE / ComS 583 Reconfigurable Computing. Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW Codesign. Quick Points. Midterm graded and returned Average – 84.5 Median – 85.0 Maximum – 95.0 Minimum – 72.0
E N D
CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW Codesign
Quick Points • Midterm graded and returned • Average – 84.5 • Median – 85.0 • Maximum – 95.0 • Minimum – 72.0 • Standard Deviation – 6.65 CprE 583 – Reconfigurable Computing
HW #4 Discussion • Problem 1 – did just a simple adder work? • Problem 2 – how did you implement the permutation table? • Problem 3 – did you use a counter? CprE 583 – Reconfigurable Computing
AES-128E Algorithm Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext CprE 583 – Reconfigurable Computing
51 ca 0 b7 d0 09 ba e1 cd 60 e7 53 04 63 70 8c e0 7c c8 25 98 1 83 32 89 b5 c7 ef a3 81 d1 0c 82 fd 23 37 77 aa 4f 3a 11 13 40 c9 00 b5 2c 2 93 0d 25 fb 1a 6d ec 0d dc 7d 7b 2e 8f c3 3 0a 66 26 11 ed a6 d5 06 82 43 f2 2a fa 40 4 20 18 1b 56 5f 36 92 09 00 fa 6b 5 42 3f 43 1b af 2c 12 4a c7 4a cc 6f 1b dd 20 6 41 b5 46 ed 6f 21 34 ba 25 12 25 3a 35 4a 25 a6 7 81 c5 99 45 a6 60 8c b5 8c dc 3a bb af 97 e1 6d 46 fa 20 23 c5 4f 30 55 43 4f b5 8 b5 fc 0d 9 43 43 c3 e1 00 09 aa f2 95 63 b5 01 a6 12 ee 44 ee 88 ff c7 a cb 36 aa 25 dc 1b 17 8c c3 67 b5 33 5a 0c 4a bc 1e 6d ef 57 4f ed 2b 2a 0d b b5 60 a6 0d 60 51 a6 8e 00 4a 67 33 fe 69 6f 19 04 c bc 48 b5 d 13 00 3b 84 d7 25 e1 1f 4f fd 25 2a fa 3a ba 15 1b 89 0f 43 77 ed 00 ab 60 e 09 b5 60 6f 8c 09 ba dc aa f dc 4a 73 49 20 76 51 4f a6 4a 22 5d e c 8 a 2 0 6 4 9 f 1 3 5 b 7 d x y Overview of AES (cont.) • 128-bit input is copied into a two-dimensional (4x4) byte array referred to as the state • Round transformations operate on the state array • Final state copied back into 128-bit output • AES makes use of a non-linear substitution function that operates on a single byte • Can be simplified as a look-up table (S-box) S-box CprE 583 – Reconfigurable Computing
Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns S'0,0 S'0,1 S'0,2 S'0,3 S0,0 S0,1 S0,2 S0,3 S'1,0 S'1,2 S'1,3 S1,0 S1,2 S1,3 S'r,c Sr,c AddRoundKey SubBytes 128-bit plaintext S'2,0 S'2,1 S'2,2 S'2,3 S2,0 S2,1 S2,2 S2,3 S'3,0 S'3,1 S'3,2 S'3,3 S3,0 S3,1 S3,2 S3,3 No round = 10? Yes 128-bit ciphertext AES-128E Modules: SubBytes SubBytes • S-box transformation performed independently on each byte of the state S-box state[i] state'[i] CprE 583 – Reconfigurable Computing
Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext AES-128E Modules: ShiftRows ShiftRows • Bytes in the last three rows of the state are shifted cyclically over variable offsets S0,0 S0,1 S0,2 S0,3 S'0,0 S'0,1 S'0,2 S'0,3 S1,0 S1,1 S1,2 S1,3 S'1,1 S'1,2 S'1,3 S'1,0 state[i] state'[i] S2,0 S2,1 S2,2 S2,3 S'2,2 S'2,3 S'2,0 S'2,1 S3,0 S3,1 S3,2 S3,3 S'3,3 S'3,0 S'3,1 S'3,2 CprE 583 – Reconfigurable Computing
Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? S'0,0 S0,0 S0,2 S'0,2 S0,3 S'0,3 S'0,1 S0,1 S'1,0 S1,0 S'1,2 S1,2 S1,3 S'1,3 S1,1 S'1,1 Yes S2,0 S'2,0 S2,2 S'2,2 S'2,3 S2,3 S2,1 S'2,1 128-bit ciphertext S'3,0 S3,0 S'3,2 S3,2 S'3,3 S3,3 S'3,1 S3,1 AES-128E Modules: MixColumns MixColumns • Modulo polynomial-basis multiplication performed on each column of the state • Can be simplified as series of AND and XOR operations {03h} state[i] state'[i] {02h} CprE 583 – Reconfigurable Computing
MixColumns Implementation -- Multiply by 2 t1 := STATE_IN(i mod 4)(j)(6 downto 0) & '0'; if (STATE_IN(i mod 4)(j)(7) = '1') then t1 := t1 xor x"1b"; end if; -- Multiply by 3 t2 := STATE_IN((i+1) mod 4)(j)(6 downto 0) & '0'; if (STATE_IN((i+1) mod 4)(j)(7) = '1') then t2 := t2 xor x"1b"; end if; t2 := t2 xor STATE_IN((i+1) mod 4)(j); tSTATE(i)(j) <= t1 xor t2 xor STATE_IN((i+2) mod 4)(j) xor STATE_IN((i+3) mod 4)(j); end loop; end loop; end process; entity MixColumns is port (STATE_IN : in STATEtype; RNUM_IN : in RNUMtype; STATE_OUT : out STATEtype); end MixColumns; architecture behavior of MixColumns is signal tSTATE : STATEtype; begin process(STATE_IN) variable t1, t2 : std_logic_vector(7 downto 0); begin for i in 0 to 3 loop for j in 0 to Nb-1 loop CprE 583 – Reconfigurable Computing
Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext AES-128E Modules: AddRoundKey AddRoundKey • Words from the round-specific key are XORed into columns of the state S0,0 S0,2 S0,3 S'0,0 S'0,2 S'0,3 S0,1 S'0,1 S1,0 S1,2 S1,3 S'1,0 S'1,2 S'1,3 S1,1 S'1,1 state[i] state'[i] S2,0 S2,2 S2,3 S'2,0 S'2,2 S'2,3 S2,1 S'2,1 S3,0 S3,2 S3,3 S'3,0 S'3,2 S'3,3 S3,1 S'3,1 w[1] w[0] w[2] w[3] Rkey[i] CprE 583 – Reconfigurable Computing
AddRoundKey Implementation entity AddRoundKey is port(STATE_IN : in STATEtype; KEY_IN : in KEYtype; STATE_OUT : out STATEtype); end AddRoundKey; architecture behavior of AddRoundKey is begin process(STATE_IN, KEY_IN) begin for j in 0 to (Nb-1) loop STATE_OUT(0)(j) <= STATE_IN(0)(j) xor KEY_IN(j)(31 downto 24); STATE_OUT(1)(j) <= STATE_IN(1)(j) xor KEY_IN(j)(23 downto 16); STATE_OUT(2)(j) <= STATE_IN(2)(j) xor KEY_IN(j)(15 downto 8); STATE_OUT(3)(j) <= STATE_IN(3)(j) xor KEY_IN(j)(7 downto 0); end loop; end process; end behavior; CprE 583 – Reconfigurable Computing
Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext AES-128E Modules: KeyExpansion KeyExpansion • Initial 128-bit key is converted into separate keys for each of the 10 required rounds • Consists of Sbox transformations and some XORs Rkey[1] Rkey[2] w[0] S w[4] Rkey[3] Rkey[4] w[1] S w[5] 128-bit key Rkey[5] Rkey[6] w[2] S w[6] Rkey[7] w[3] S w[7] Rkey[8] Rkey[9] Rkey[10] rcon CprE 583 – Reconfigurable Computing
Design Decisions • Online/offline key generation • Inter-round layout decisions • Round unrolling • Round pipelining • Intra-round layout decisions • Transformation pipelining • Transformation partitioning • Technology mapping decisions • S-box synthesis as Block SelectRAM, distributed ROM primitives, or logic gates CprE 583 – Reconfigurable Computing
Round Unrolling / Pipelining • Unrolling replaces a loop body (round) with N copies of that loop body • AES-128E algorithm is a loop that iterates 10 times – N є [1, 10] • N = 1 corresponds to original looping case • N = 10 is a fully unrolled implementation • Pipelining is a technique that increases the number of blocks of data that can be processed concurrently • Pipelining in hardware can be implemented by inserting registers • Unrolled rounds can be split into a certain number of pipeline stages • These transformations will increase throughput but increase area and latency CprE 583 – Reconfigurable Computing
Round Unrolling / Pipelining (cont.) Unrolling factor = 10 Unrolling factor = 2 Unrolling factor = 1 Unrolling factor = 5 Round pipelining = ON R1 R2 R3 R4 R5 Input plaintext Output Ciphertext R10 R9 R8 R7 R6 CprE 583 – Reconfigurable Computing
Transformation Partitioning/Pipelining • FPGA maximum clock frequency depends on critical logic path • Inter-round transformations can’t improve critical path • Individual transformations can be pipelined with registers similar to the rounds • Transformations that are part of the maximum delay path can be partitioned and pipelined as well • Can result in large gains in throughput with only minimal area increases CprE 583 – Reconfigurable Computing
Partitioning / Pipelining (cont.) Transformation pipelining = ON Transformation partitioning = ON SubBytes ShiftRows MixColumns AddRoundKey KeyExpansionA KeyExpansionB KeyExpansion KeyExpansionC CprE 583 – Reconfigurable Computing
S-box Technology Mapping • With synthesis primitives, can map the S-box lookup tables to different hardware components • Two S-boxes can fit on a single Block SelectRAM constantSSYNROMSTYLE: string:= “select_rom”; -- {logic, select_rom} entity Sboxis port(BYTE_IN: in std_logic_vector(7 downto 0); BYTE_OUT: out std_logic_vector(7 downto 0)); attribute syn_romstyle:string; attribute syn_romstyle of BYTE_OUT: signal is SSYNROMSTYLE; end Sbox; ... Sample VHDL code CprE 583 – Reconfigurable Computing
Recap – Retiming CprE 583 – Reconfigurable Computing
Recap – Retiming (cont.) weight(e) = weight(e) + lag(head(e)) - lag(tail(e)) CprE 583 – Reconfigurable Computing
Retiming and Pipelining • Can use this retiming to pipeline • Assume have enough (infinite supply) of registers at edge of circuit • Retime them into circuit • See [WeaMar03A] for details CprE 583 – Reconfigurable Computing
Recap – Retiming and Covering CprE 583 – Reconfigurable Computing
Outline • HW #4 Discussion • Recap • HW/SW Codesign • Motivation • Specification • Partitioning • Automation CprE 583 – Reconfigurable Computing
Hardware/Software Codesign • Definition 1 – the concurrent and co-operative design of hardware and software components of an embedded system • Definition 2 – A design methodology supporting the cooperative and concurrent development of hardware and software (co-specification, co-development, and co-verification) in order to achieve shared functionality and performance goals for a combined system [MicGup97A] CprE 583 – Reconfigurable Computing
Motivation • Not possible to put everything in hardware due to limited resources • Some code more appropriate for sequential implementation • Desirable to allow for parallelization, serialization • Possible to modify existing compilers to perform the task CprE 583 – Reconfigurable Computing
Why put CPUs on FPGAs? • Shrink a board to a chip • What CPUs do best: • Irregular code • Code that takes advantage of a highly optimized datapath • What FPGAs do best: • Data-oriented computations • Computations with local control CprE 583 – Reconfigurable Computing
Memory FPGA Computational Model • Most recent work addressing this problem assumes relatively slow bus interface • FPGA has direct interface to memory in this model Memory bus General- Purpose Processor CprE 583 – Reconfigurable Computing
Hardware/Software Partitioning if (foo < 8) { for (i=0; i<N; i++) x[i] = y[i]*z[i]; } CPU HW Accelerator CprE 583 – Reconfigurable Computing
Methodology • Separation between function, and communication • Unified refinable formal specification model • Facilitates system specification • Implementation independent • Eases HW/SW trade-off evaluation and partitioning • From a more practical perspective: • Measure the application • Identify what to put onto the accelerator • Build interfaces CprE 583 – Reconfigurable Computing
System-Level Methodology Informal Specification, Constraints Component profiling System model Performance evaluation Architecture design HW/SW implementation Fail Test Prototype Success Implementation CprE 583 – Reconfigurable Computing
Concurrency • Concurrent applications provide the most speedup No data dependencies CPU if (a > b) ... x[i] = y[i] * z[i] accelerator CprE 583 – Reconfigurable Computing
Process 1 Process 2 Process 3 Partitioning • Can divide the application into several processes that run concurrently • Process partitioning exposes opportunities for parallelism if (i>b) … for (i=0; i<N; i++) … for (j=0; j<N; j++) ... CprE 583 – Reconfigurable Computing
process (a, b, c) in port a, b; out port c; { read(a); … write(c); } Specification Automating System Partitioning Line () { a = … … detach } Interface • Good partitioning mechanism: • Minimize communication across bus • Allows parallelism both hardware (FPGA) and processor operating concurrently • Near peak processor utilization at all times (performing useful work) Partition Model FPGA Capture Synthesize Processor CprE 583 – Reconfigurable Computing
Partitioning Algorithms Software Hardware • Assume everything initially in software • Select task for swapping • Migrate to hardware and evaluate cost • Timing, hardware resources, program and data storage, synchronization overhead • Cost evaluation and move evaluation similar to what we’ve seen regarding mincut and simulated annealing task List of tasks List of tasks CprE 583 – Reconfigurable Computing
Multi-threaded Systems • Single thread: • Multi-thread: CprE 583 – Reconfigurable Computing
Single threaded: Find longest possible execution path Multi-threaded with no synchronization: Find the longest of several execution paths Multi-threaded with synchronization: Find the worst-case synchronization conditions Performance Analysis CprE 583 – Reconfigurable Computing
Multi-threaded Performance Analysis • Synchronization causes the delay along one path to affect the delay along another ta tb synchronization point tc td Delay = max(ta, tb) + td CprE 583 – Reconfigurable Computing
Control • Need to signal between CPU and accelerator • Data ready • Complete • Implementations: • Shared memory • Handshake • If computation time is very predictable, a simpler communication scheme may be possible CprE 583 – Reconfigurable Computing
Application Program Operating System I/O driver I/O bus Communication Levels Send, Receive, Wait • Easier to program at application level • (send, receive, wait) but difficult to predict • More difficult to specify at low level • Difficult to extract from program but timing and resources easier to predict Application hardware (custom) Register reads/writes I/O driver Interrupt service Bus transactions I/O bus Interrupts CprE 583 – Reconfigurable Computing
Other Interface Models • Synchronization through a FIFO • FIFO can be implemented either in hardware or in software • Effectively reconfigure hardware (FPGA) to allocate buffer space as needed • Interrupts used for software version of FIFO r3 p1 p2 p3 r2 d1 FPGA Control/Data FIFO d3 d2 CprE 583 – Reconfigurable Computing
Debugging • Hard to test a CPU/accelerator system: • Hard to control and observe the accelerator without the CPU • Software on CPU may have bugs • Build separate test benches for CPU code, accelerator • Test integrated system after components have been tested CprE 583 – Reconfigurable Computing
Formal Verification Compilers Partitioning Simulation POLIS Codesign Methodology ................ ESTEREL Graphical EFSM CFSMs Sw Synthesis Hw Synthesis Intfc + RTOS Synthesis Sw Code + RTOS Logic Netlist Rapid prototyping CprE 583 – Reconfigurable Computing
Codesign Finite State Machines • POLIS uses an FSM model for • Uncommitted • Synthesizable • Verifiable Control-dominated HW/SW specification • Translators from • State diagrams, • Esterel, ECL, ReactiveJava • HDLs Into a single FSM-based language CprE 583 – Reconfigurable Computing
CFSM behavior • Four-phase cycle: • Idle • Detect input events • Execute one transition • Emit output events • Software response could take a long time: • Unbounded delay assumption • Need efficient hw/sw communication primitive: • Event-based point-to-point communication CprE 583 – Reconfigurable Computing
Network of CFSMs • Globally Asynchronous, Locally Synchronous (GALS) model F B=>C C=>F G C=>G C=>G F^(G==1) C C=>A CFSM2 CFSM1 CFSM1 CFSM2 C A C=>B B C=>B (A==0)=>B CFSM3 CprE 583 – Reconfigurable Computing
Summary • Hardware/software codesign complicated and limited by performance estimates • Algorithms not generally as good as human partitioning • Other interesting issues include dual processors, special memory interfaces • Will likely evolve at faster rate as compilers evolve CprE 583 – Reconfigurable Computing