170 likes | 316 Views
Presentation 1 MAD MAC 525. Farhan Mohamed Ali (W2-1) Jigar Vora (W2-2) Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5). W2. 1 st February, 2006 Architecture Proposal. Project Objective:
E N D
Presentation 1 MAD MAC 525 Farhan Mohamed Ali (W2-1)Jigar Vora (W2-2)Sonali Kapoor (W2-3) Avni Jhunjhunwala (W2-4) Shiven Seth (W2-5) W2 1st February, 2006 Architecture Proposal Project Objective: Design a crucial part of a GPU called the Multiply Accumulate Unit (MAC) which will revolutionize graphics.
MAD MAC 525 Status: • Project chosen • Specifications defined • Architecture • Design • Behavioral Verilog • Testbenches • To be done • Verilog : Gate Level Design • Schematic • Floor plan • Layout • Extraction, LVS, post-layout simulation
Overview - MAD MAC 525 • Multiply Accumulate unit (MAC) • Executes function AB+C on 16 bit floating point inputs • Multiply and add in parallel to greatly speed up operation • Rounding is only performed only once so greater accuracy than individual multiply and add functions. • MAD MAC accelerates FP16 blending to enable true HDR graphics • Bright things can be really bright • Dark things can be really dark • And the details can be seen in both
Quick Overview of FP • A = 1.11010 x 22 • B = 1.01110 x 25 • C = 1.11000 x 28 • Step 1: A*B • Multiply the Significands: 1.1101 * 1.01110 = 10.011010110 • Exponent of result is expA + expB = 7 • A*B = 10.011010110 x 27 • Step 2: Align C • To add two FP’s, their exponents must be the same • Shift by expA + expB – expC = 2 + 5 – 8 = -1 • Shift the significand of C left by 1 • 1.11000 -> 11.1000
Quick Overview of FP (contd.) • Step 3: Depending on signs of A*B and C, add or subtract the two • Suppose A, B, and C are all positive • A*B + C = 10.011010110 + 11.1000 = 101.111010110 • Step 4: Normalize the Result • Currently the significand is 101.111010110 and the exponent is expA + expB = 7 • Normalized to 1.01111010110 x 29 • Step 5: Round the Result • The significand needs to be fit in 10 bits • Based on bits 11 through 13, the signficand is rounded and fit in 10 bits
Block Diagram Input Input Input 16 16 16 5 RegArray A RegArray B RegArray C 10 10 5 10 5 Multiplier Exp Calc Align 5 22 14 35 Control Logic & Sign Dtrmin Leading 0 Anticipator Adder/Subtractor 36 4 Normalize 14 5 Round 10 5 1 Reg Y 16 Output
Design Decisions (Week 2): • Implementing a 16 bit (fp16) format • 1 bit sign, 10 bit significand and 5 bit exponent • Compatible with OpenEXR format used in latest games • Enable Ultra-Threading • Implements high speed register arrays and fast thread switching logic to instantaneously switch to another available thread if the executing thread runs out of data • Implementation: High speed register-arrays for each input
Design Decisions (contd.): • Multiplier Implementation • 11 x 11 Carry-Save Multiplier • Reasons: • Fast because it avoids having ripple carry in every stage • Enables Compact Layout
Design Decisions (contd.): • 2’s Complement Adder/Subtractor • Variable Length Carry-Select Adder • Reason: Reduces delay through Muxes • Use the signs of the inputs to determine addition or subtraction • Output: 35-bits from Align + 1 Carry Out = 36 bits
Design Decisions (contd.): • Leading Zero Counter • Carry-Save Adder to count the leading zeroes of C • Reason: To pre-compute the amount of shifting the result of A*B+C to normalize it • This will speed up our design because the Leading Zero Counter will not be in the critical path (which is through our multiplier)
Design Decisions (contd.): • Align Exponent • Always align the exponent of C to expA + expB • Shift the significand of C by (expA + expB – expC) • If negative, shift left because C is bigger than A*B • If positive, shift right because C is smaller than A*B • Implementation: n-Pass Shifter • Normalize • Format the result of A*B + C to IEEE Format (i.e. change the significand from 101.011… to 1.01011…) • Align the exponent of the result as necessary • n-Pass Shifter to shift the result of the adder by the amount given by the Leading Zero Counter • Round • The result needs to be fit into 16 bits • To preserve precision, we round the result based on the last 3 bits • Implementation: Incrementer and Shifter
Updated Estimated Transistor Count • Registers (input, output, pipelining) 2500 • Threading Logic 3000 • Carry-Save Multiplier 5000 • Carry-Select Adder 2000 • Alignment Shifter 1500 • Leading 0 Anticipator 700 • Normalize 2000 • Rounding 1500 • Special Cases and Control Logic 2000 • Total 20200
Problems and Questions? • Difficulty finding a high-level simulator to exhaustively test our behavioral verilog because both Matlab and C use the IEEE 32-bit format. Currently we are thoroughly testing our behavioral verilog and coming up with different test cases by hand. • Suggested Solutions: • - Make a scalable 32-bit version of our behavioral verilog and test it against C • - Finding code written for software simulation by the VAX, PDP microprocessors.