1 / 46

Reconfigurable Computing Lecture 21 - HW/SW Codesign Discussion

This lecture discusses the grading of the midterm, as well as the discussion of problems related to a simple adder, permutation table implementation, and use of a counter. It also introduces the AES-128E algorithm and its round transformations.

Download Presentation

Reconfigurable Computing Lecture 21 - HW/SW Codesign Discussion

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CprE / ComS 583Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW Codesign

  2. Quick Points • Midterm graded and returned • Average – 84.5 • Median – 85.0 • Maximum – 95.0 • Minimum – 72.0 • Standard Deviation – 6.65 CprE 583 – Reconfigurable Computing

  3. HW #4 Discussion • Problem 1 – did just a simple adder work? • Problem 2 – how did you implement the permutation table? • Problem 3 – did you use a counter? CprE 583 – Reconfigurable Computing

  4. AES-128E Algorithm Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext CprE 583 – Reconfigurable Computing

  5. 51 ca 0 b7 d0 09 ba e1 cd 60 e7 53 04 63 70 8c e0 7c c8 25 98 1 83 32 89 b5 c7 ef a3 81 d1 0c 82 fd 23 37 77 aa 4f 3a 11 13 40 c9 00 b5 2c 2 93 0d 25 fb 1a 6d ec 0d dc 7d 7b 2e 8f c3 3 0a 66 26 11 ed a6 d5 06 82 43 f2 2a fa 40 4 20 18 1b 56 5f 36 92 09 00 fa 6b 5 42 3f 43 1b af 2c 12 4a c7 4a cc 6f 1b dd 20 6 41 b5 46 ed 6f 21 34 ba 25 12 25 3a 35 4a 25 a6 7 81 c5 99 45 a6 60 8c b5 8c dc 3a bb af 97 e1 6d 46 fa 20 23 c5 4f 30 55 43 4f b5 8 b5 fc 0d 9 43 43 c3 e1 00 09 aa f2 95 63 b5 01 a6 12 ee 44 ee 88 ff c7 a cb 36 aa 25 dc 1b 17 8c c3 67 b5 33 5a 0c 4a bc 1e 6d ef 57 4f ed 2b 2a 0d b b5 60 a6 0d 60 51 a6 8e 00 4a 67 33 fe 69 6f 19 04 c bc 48 b5 d 13 00 3b 84 d7 25 e1 1f 4f fd 25 2a fa 3a ba 15 1b 89 0f 43 77 ed 00 ab 60 e 09 b5 60 6f 8c 09 ba dc aa f dc 4a 73 49 20 76 51 4f a6 4a 22 5d e c 8 a 2 0 6 4 9 f 1 3 5 b 7 d x y Overview of AES (cont.) • 128-bit input is copied into a two-dimensional (4x4) byte array referred to as the state • Round transformations operate on the state array • Final state copied back into 128-bit output • AES makes use of a non-linear substitution function that operates on a single byte • Can be simplified as a look-up table (S-box) S-box CprE 583 – Reconfigurable Computing

  6. Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns S'0,0 S'0,1 S'0,2 S'0,3 S0,0 S0,1 S0,2 S0,3 S'1,0 S'1,2 S'1,3 S1,0 S1,2 S1,3 S'r,c Sr,c AddRoundKey SubBytes 128-bit plaintext S'2,0 S'2,1 S'2,2 S'2,3 S2,0 S2,1 S2,2 S2,3 S'3,0 S'3,1 S'3,2 S'3,3 S3,0 S3,1 S3,2 S3,3 No round = 10? Yes 128-bit ciphertext AES-128E Modules: SubBytes SubBytes • S-box transformation performed independently on each byte of the state S-box state[i] state'[i] CprE 583 – Reconfigurable Computing

  7. Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext AES-128E Modules: ShiftRows ShiftRows • Bytes in the last three rows of the state are shifted cyclically over variable offsets S0,0 S0,1 S0,2 S0,3 S'0,0 S'0,1 S'0,2 S'0,3 S1,0 S1,1 S1,2 S1,3 S'1,1 S'1,2 S'1,3 S'1,0 state[i] state'[i] S2,0 S2,1 S2,2 S2,3 S'2,2 S'2,3 S'2,0 S'2,1 S3,0 S3,1 S3,2 S3,3 S'3,3 S'3,0 S'3,1 S'3,2 CprE 583 – Reconfigurable Computing

  8. Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? S'0,0 S0,0 S0,2 S'0,2 S0,3 S'0,3 S'0,1 S0,1 S'1,0 S1,0 S'1,2 S1,2 S1,3 S'1,3 S1,1 S'1,1 Yes S2,0 S'2,0 S2,2 S'2,2 S'2,3 S2,3 S2,1 S'2,1 128-bit ciphertext S'3,0 S3,0 S'3,2 S3,2 S'3,3 S3,3 S'3,1 S3,1 AES-128E Modules: MixColumns MixColumns • Modulo polynomial-basis multiplication performed on each column of the state • Can be simplified as series of AND and XOR operations {03h} state[i] state'[i] {02h} CprE 583 – Reconfigurable Computing

  9. MixColumns Implementation -- Multiply by 2 t1 := STATE_IN(i mod 4)(j)(6 downto 0) & '0'; if (STATE_IN(i mod 4)(j)(7) = '1') then t1 := t1 xor x"1b"; end if; -- Multiply by 3 t2 := STATE_IN((i+1) mod 4)(j)(6 downto 0) & '0'; if (STATE_IN((i+1) mod 4)(j)(7) = '1') then t2 := t2 xor x"1b"; end if; t2 := t2 xor STATE_IN((i+1) mod 4)(j); tSTATE(i)(j) <= t1 xor t2 xor STATE_IN((i+2) mod 4)(j) xor STATE_IN((i+3) mod 4)(j); end loop; end loop; end process; entity MixColumns is port (STATE_IN : in STATEtype; RNUM_IN : in RNUMtype; STATE_OUT : out STATEtype); end MixColumns; architecture behavior of MixColumns is signal tSTATE : STATEtype; begin process(STATE_IN) variable t1, t2 : std_logic_vector(7 downto 0); begin for i in 0 to 3 loop for j in 0 to Nb-1 loop CprE 583 – Reconfigurable Computing

  10. Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext AES-128E Modules: AddRoundKey AddRoundKey • Words from the round-specific key are XORed into columns of the state S0,0 S0,2 S0,3 S'0,0 S'0,2 S'0,3 S0,1 S'0,1 S1,0 S1,2 S1,3 S'1,0 S'1,2 S'1,3 S1,1 S'1,1 state[i] state'[i] S2,0 S2,2 S2,3 S'2,0 S'2,2 S'2,3 S2,1 S'2,1 S3,0 S3,2 S3,3 S'3,0 S'3,2 S'3,3 S3,1 S'3,1 w[1] w[0] w[2] w[3] Rkey[i] CprE 583 – Reconfigurable Computing

  11. AddRoundKey Implementation entity AddRoundKey is port(STATE_IN : in STATEtype; KEY_IN : in KEYtype; STATE_OUT : out STATEtype); end AddRoundKey; architecture behavior of AddRoundKey is begin process(STATE_IN, KEY_IN) begin for j in 0 to (Nb-1) loop STATE_OUT(0)(j) <= STATE_IN(0)(j) xor KEY_IN(j)(31 downto 24); STATE_OUT(1)(j) <= STATE_IN(1)(j) xor KEY_IN(j)(23 downto 16); STATE_OUT(2)(j) <= STATE_IN(2)(j) xor KEY_IN(j)(15 downto 8); STATE_OUT(3)(j) <= STATE_IN(3)(j) xor KEY_IN(j)(7 downto 0); end loop; end process; end behavior; CprE 583 – Reconfigurable Computing

  12. Round Transformation round++ KeyExpansion 128-bit key ShiftRows MixColumns AddRoundKey SubBytes 128-bit plaintext No round = 10? Yes 128-bit ciphertext AES-128E Modules: KeyExpansion KeyExpansion • Initial 128-bit key is converted into separate keys for each of the 10 required rounds • Consists of Sbox transformations and some XORs Rkey[1] Rkey[2] w[0] S w[4] Rkey[3] Rkey[4] w[1] S w[5] 128-bit key Rkey[5] Rkey[6] w[2] S w[6] Rkey[7] w[3] S w[7] Rkey[8] Rkey[9] Rkey[10] rcon CprE 583 – Reconfigurable Computing

  13. Design Decisions • Online/offline key generation • Inter-round layout decisions • Round unrolling • Round pipelining • Intra-round layout decisions • Transformation pipelining • Transformation partitioning • Technology mapping decisions • S-box synthesis as Block SelectRAM, distributed ROM primitives, or logic gates CprE 583 – Reconfigurable Computing

  14. Round Unrolling / Pipelining • Unrolling replaces a loop body (round) with N copies of that loop body • AES-128E algorithm is a loop that iterates 10 times – N є [1, 10] • N = 1 corresponds to original looping case • N = 10 is a fully unrolled implementation • Pipelining is a technique that increases the number of blocks of data that can be processed concurrently • Pipelining in hardware can be implemented by inserting registers • Unrolled rounds can be split into a certain number of pipeline stages • These transformations will increase throughput but increase area and latency CprE 583 – Reconfigurable Computing

  15. Round Unrolling / Pipelining (cont.) Unrolling factor = 10 Unrolling factor = 2 Unrolling factor = 1 Unrolling factor = 5 Round pipelining = ON R1 R2 R3 R4 R5 Input plaintext Output Ciphertext R10 R9 R8 R7 R6 CprE 583 – Reconfigurable Computing

  16. Transformation Partitioning/Pipelining • FPGA maximum clock frequency depends on critical logic path • Inter-round transformations can’t improve critical path • Individual transformations can be pipelined with registers similar to the rounds • Transformations that are part of the maximum delay path can be partitioned and pipelined as well • Can result in large gains in throughput with only minimal area increases CprE 583 – Reconfigurable Computing

  17. Partitioning / Pipelining (cont.) Transformation pipelining = ON Transformation partitioning = ON SubBytes ShiftRows MixColumns AddRoundKey KeyExpansionA KeyExpansionB KeyExpansion KeyExpansionC CprE 583 – Reconfigurable Computing

  18. S-box Technology Mapping • With synthesis primitives, can map the S-box lookup tables to different hardware components • Two S-boxes can fit on a single Block SelectRAM constantSSYNROMSTYLE: string:= “select_rom”; -- {logic, select_rom} entity Sboxis port(BYTE_IN: in std_logic_vector(7 downto 0); BYTE_OUT: out std_logic_vector(7 downto 0)); attribute syn_romstyle:string; attribute syn_romstyle of BYTE_OUT: signal is SSYNROMSTYLE; end Sbox; ... Sample VHDL code CprE 583 – Reconfigurable Computing

  19. Recap – Retiming CprE 583 – Reconfigurable Computing

  20. Recap – Retiming (cont.) weight(e) = weight(e) + lag(head(e)) - lag(tail(e)) CprE 583 – Reconfigurable Computing

  21. Retiming and Pipelining • Can use this retiming to pipeline • Assume have enough (infinite supply) of registers at edge of circuit • Retime them into circuit • See [WeaMar03A] for details CprE 583 – Reconfigurable Computing

  22. Recap – Retiming and Covering CprE 583 – Reconfigurable Computing

  23. Outline • HW #4 Discussion • Recap • HW/SW Codesign • Motivation • Specification • Partitioning • Automation CprE 583 – Reconfigurable Computing

  24. Hardware/Software Codesign • Definition 1 – the concurrent and co-operative design of hardware and software components of an embedded system • Definition 2 – A design methodology supporting the cooperative and concurrent development of hardware and software (co-specification, co-development, and co-verification) in order to achieve shared functionality and performance goals for a combined system [MicGup97A] CprE 583 – Reconfigurable Computing

  25. Motivation • Not possible to put everything in hardware due to limited resources • Some code more appropriate for sequential implementation • Desirable to allow for parallelization, serialization • Possible to modify existing compilers to perform the task CprE 583 – Reconfigurable Computing

  26. Why put CPUs on FPGAs? • Shrink a board to a chip • What CPUs do best: • Irregular code • Code that takes advantage of a highly optimized datapath • What FPGAs do best: • Data-oriented computations • Computations with local control CprE 583 – Reconfigurable Computing

  27. Memory FPGA Computational Model • Most recent work addressing this problem assumes relatively slow bus interface • FPGA has direct interface to memory in this model Memory bus General- Purpose Processor CprE 583 – Reconfigurable Computing

  28. Hardware/Software Partitioning if (foo < 8) { for (i=0; i<N; i++) x[i] = y[i]*z[i]; } CPU HW Accelerator CprE 583 – Reconfigurable Computing

  29. Methodology • Separation between function, and communication • Unified refinable formal specification model • Facilitates system specification • Implementation independent • Eases HW/SW trade-off evaluation and partitioning • From a more practical perspective: • Measure the application • Identify what to put onto the accelerator • Build interfaces CprE 583 – Reconfigurable Computing

  30. System-Level Methodology Informal Specification, Constraints Component profiling System model Performance evaluation Architecture design HW/SW implementation Fail Test Prototype Success Implementation CprE 583 – Reconfigurable Computing

  31. Concurrency • Concurrent applications provide the most speedup No data dependencies CPU if (a > b) ... x[i] = y[i] * z[i] accelerator CprE 583 – Reconfigurable Computing

  32. Process 1 Process 2 Process 3 Partitioning • Can divide the application into several processes that run concurrently • Process partitioning exposes opportunities for parallelism if (i>b) … for (i=0; i<N; i++) … for (j=0; j<N; j++) ... CprE 583 – Reconfigurable Computing

  33. process (a, b, c) in port a, b; out port c; { read(a); … write(c); } Specification Automating System Partitioning Line () { a = … … detach } Interface • Good partitioning mechanism: • Minimize communication across bus • Allows parallelism  both hardware (FPGA) and processor operating concurrently • Near peak processor utilization at all times (performing useful work) Partition Model FPGA Capture Synthesize Processor CprE 583 – Reconfigurable Computing

  34. Partitioning Algorithms Software Hardware • Assume everything initially in software • Select task for swapping • Migrate to hardware and evaluate cost • Timing, hardware resources, program and data storage, synchronization overhead • Cost evaluation and move evaluation similar to what we’ve seen regarding mincut and simulated annealing task List of tasks List of tasks CprE 583 – Reconfigurable Computing

  35. Multi-threaded Systems • Single thread: • Multi-thread: CprE 583 – Reconfigurable Computing

  36. Single threaded: Find longest possible execution path Multi-threaded with no synchronization: Find the longest of several execution paths Multi-threaded with synchronization: Find the worst-case synchronization conditions Performance Analysis CprE 583 – Reconfigurable Computing

  37. Multi-threaded Performance Analysis • Synchronization causes the delay along one path to affect the delay along another ta tb synchronization point tc td Delay = max(ta, tb) + td CprE 583 – Reconfigurable Computing

  38. Control • Need to signal between CPU and accelerator • Data ready • Complete • Implementations: • Shared memory • Handshake • If computation time is very predictable, a simpler communication scheme may be possible CprE 583 – Reconfigurable Computing

  39. Application Program Operating System I/O driver I/O bus Communication Levels Send, Receive, Wait • Easier to program at application level • (send, receive, wait) but difficult to predict • More difficult to specify at low level • Difficult to extract from program but timing and resources easier to predict Application hardware (custom) Register reads/writes I/O driver Interrupt service Bus transactions I/O bus Interrupts CprE 583 – Reconfigurable Computing

  40. Other Interface Models • Synchronization through a FIFO • FIFO can be implemented either in hardware or in software • Effectively reconfigure hardware (FPGA) to allocate buffer space as needed • Interrupts used for software version of FIFO r3 p1 p2 p3 r2 d1 FPGA Control/Data FIFO d3 d2 CprE 583 – Reconfigurable Computing

  41. Debugging • Hard to test a CPU/accelerator system: • Hard to control and observe the accelerator without the CPU • Software on CPU may have bugs • Build separate test benches for CPU code, accelerator • Test integrated system after components have been tested CprE 583 – Reconfigurable Computing

  42. Formal Verification Compilers Partitioning Simulation POLIS Codesign Methodology ................ ESTEREL Graphical EFSM CFSMs Sw Synthesis Hw Synthesis Intfc + RTOS Synthesis Sw Code + RTOS Logic Netlist Rapid prototyping CprE 583 – Reconfigurable Computing

  43. Codesign Finite State Machines • POLIS uses an FSM model for • Uncommitted • Synthesizable • Verifiable Control-dominated HW/SW specification • Translators from • State diagrams, • Esterel, ECL, ReactiveJava • HDLs Into a single FSM-based language CprE 583 – Reconfigurable Computing

  44. CFSM behavior • Four-phase cycle: • Idle • Detect input events • Execute one transition • Emit output events • Software response could take a long time: • Unbounded delay assumption • Need efficient hw/sw communication primitive: • Event-based point-to-point communication CprE 583 – Reconfigurable Computing

  45. Network of CFSMs • Globally Asynchronous, Locally Synchronous (GALS) model F B=>C C=>F G C=>G C=>G F^(G==1) C C=>A CFSM2 CFSM1 CFSM1 CFSM2 C A C=>B B C=>B (A==0)=>B CFSM3 CprE 583 – Reconfigurable Computing

  46. Summary • Hardware/software codesign complicated and limited by performance estimates • Algorithms not generally as good as human partitioning • Other interesting issues include dual processors, special memory interfaces • Will likely evolve at faster rate as compilers evolve CprE 583 – Reconfigurable Computing

More Related