CPRE 583 Reconfigurable Computing Lecture 23: Wed 11/16/2011 (High-level Acceleration Approaches)

CPRE 583Reconfigurable ComputingLecture 23: Wed 11/16/2011(High-level Acceleration Approaches) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders • HW3: will be assigned as extra credit • Exam 2 • Reminder push back to Friday after Thanksgiving week • Weekly Project Updates due: Friday’s (midnight)

Project Grading Breakdown • 50% Final Project Demo • 30% Final Project Report • 20% of your project report grade will come from your 5-6 project updates. Friday’s midnight • 20% Final Project Presentation

Projects Ideas: Relevant conferences • Micro • Super Computing • HPCA • IPDPS • FPL • FPT • FCCM • FPGA • DAC • ICCAD • Reconfig • RTSS • RTAS • ISCA

Projects: Target Timeline • Teams Formed and Topic: Mon 10/10 • Project idea in Power Point 3-5 slides • Motivation (why is this interesting, useful) • What will be the end result • High-level picture of final product • Project team list: Name, Responsibility • High-level Plan/Proposal: Fri 10/14 • Power Point 5-10 slides (presentation to class Wed 10/19) • System block diagrams • High-level algorithms (if any) • Concerns • Implementation • Conceptual • Related research papers (if any)

Projects: Target Timeline • Work on projects: 10/19 - 12/9 • Weekly update reports • More information on updates will be given • Presentations: Finals week • Present / Demo what is done at this point • 15-20 minutes (depends on number of projects) • Final write up and Software/Hardware turned in: Day of final (TBD)

Initial Project Proposal Slides (5-10 slides) • Project team list: Name, Responsibility (who is project leader) • Team size: 3-4 (5 case-by-case) • Project idea • Motivation (why is this interesting, useful) • What will be the end result • High-level picture of final product • High-level Plan • Break project into mile stones • Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. • System block diagrams • High-level algorithms (if any) • Concerns • Implementation • Conceptual • Research papers related to you project idea

Weekly Project Updates • The current state of your project write up • Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section • The current state of your Final Presentation • Your Initial Project proposal presentation (Due Wed 10/19). Should make for a starting point for you Final presentation • What things are work & not working • What roadblocks are you running into

Common Questions

Overview • Discuss some high-level approaches for accelerating applications.

What you should learn • Start to get a feel for approaches for accelerating applications.

Profiling Applications • Finding bottlenecks • Profiling tools • gprof: http://www.cs.nyu.edu/~argyle/tutorial.html • Valgrind

Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 1 DFF delay per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D

Pipelining (Systolic Arrays) Dynamic Programming • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner.

Pipelining (Systolic Arrays) Dynamic Programming • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1

Pipelining (Systolic Arrays) Dynamic Programming • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)

Dr. James Moscola (Example) ROOT0 S0 g a c c a g IL1 IR2 1 2 3 MATP1 MP3 ML4 MR5 D6 ROOT0 1 MATP1 3 IL7 IR8 2 MATL2 MATL2 END3 ML9 D10 IL11 END3 E12

Example RNA Model ROOT0 S0 g a c c a g IL1 IR2 1 2 3 MATP1 MP3 ML4 MR5 D6 ROOT0 1 MATP1 3 IL7 IR8 2 MATL2 MATL2 END3 ML9 D10 IL11 END3 E12

Baseline Architecture Pipeline END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline

Processing Elements -INF -INF .40 -INF .44 .30 -INF .30 .72 .22 ML4 d  0 1 2 3 0 1 j  IL7,3,2 + 2 ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 ML4,3,3 = .22 + + ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi

Baseline Results for Example Model • Comparison to Infernal software • Infernal run on Intel Xeon 2.8GHz • Baseline architecture run on Xilinx Virtex-II 4000 • occupied 88% of logic resources • run at 100 MHz • Input database of 100 Million residues • Bulk of time spent on I/O (41.434s)

Expected Speedup on Larger Models • Speedup estimated ... • using 100 MHz clock • for processing database of 100 Million residues • Speedups range from 500x to over 13,000x • larger models with more parallelism exhibit greater speedups

Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM

Short Overview of Good Reference • Achieving High Performance with FPGA-Based Computing • Reading #11 • Martin C. Herbordt, 2007

Next Class • Evolvable Hardware

Questions/Comments/Concerns • Write down • Main point of lecture • One thing that’s still not quite clear • If everything is clear, then give an example of how to apply something from lecture OR

CPRE 583 Reconfigurable Computing Lecture 23: Wed 11/16/2011 (High-level Acceleration Approaches)