CPRE 583 Reconfigurable Computing Lecture 10: Wed 9/24/2010 (High-level Acceleration Approaches)

CPRE 583Reconfigurable ComputingLecture 10: Wed 9/24/2010(High-level Acceleration Approaches) Instructor: Dr. Phillip Jones (phjones@iastate.edu) Reconfigurable Computing Laboratory Iowa State University Ames, Iowa, USA http://class.ee.iastate.edu/cpre583/

Announcements/Reminders • HW2: Due Wed 10/6 • Problem 2 will have a separate deadline (to be announced) • MP2: Due Fri 10/1 (you can work in pairs) • Make sure to read the README file in the MP2 distribution • Contains info on how to fix a Gigabit core licensing issue ISE has • Start thinking of class projects and forming teams • Submit teams and project ideas: Mon 10/11 midnight • Project proposal presentations: Wed 10/20

Projects • Expectations • Working system • Write up that can potentially be submitted to a conference • Will use DAC format as write up guide line • 15-20minute PowerPoint Presentation • DAC (Design Automation Conference) • http://www2.dac.com/ • Conference papers • Due Date: 5pm (MT) Thur 11/18/2010 • Student Design Contest • Due Date: 5pm (MT) Wed 11/24/2010,Cash Prizes!

Projects Ideas: Relevant conferences • Micro • Super Computing • HPCA • IPDPS • FPL • FPT • FCCM • FPGA • DAC • ICCAD • Reconfig • RTSS • RTAS • ISCA

Initial Project Proposal Slides (5-10 slides) • Project team list: Name, Responsibility (who is project leader) • Project idea • Motivation (why is this interesting, useful) • What will be the end result • High-level picture of final product • High-level Plan • Break project into mile stones • Provide initial schedule: I would initially schedule aggressively to have project complete by Thanksgiving. Issues will pop up to cause the schedule to slip. • System block diagrams • High-level algorithms (if any) • Concerns • Implementation • Conceptual • Research papers related to you project idea

Weekly Project Updates • The current state of your project write up • Even in the early stages of the project you should be able to write a rough draft of the Introduction and Motivation section • The current state of your Final Presentation • Your Initial Project proposal presentation (Due Wed 10/20). Should make for a starting point for you Final presentation • What things are work & not working • What roadblocks are you running into

Projects: Target Timeline • Teams Formed and Idea: Mon 10/11 • Project idea in Power Point 3-5 slides • Motivation (why is this interesting, useful) • What will be the end result • High-level picture of final product • Project team list: Name, Responsibility • High-level Plan/Proposal: Wed 10/20 • Power Point 5-10 slides • System block diagrams • High-level algorithms (if any) • Concerns • Implementation • Conceptual • Related research papers (if any)

Projects: Target Timeline • Work on projects: 10/22 - 12/8 • Weekly update reports • More information on updates will be given • Presentations: Last Wed/Fri of class • Present / Demo what is done at this point • 15-20 minutes (depends on number of projects) • Final write up and Software/Hardware turned in: Day of final (TBD)

Common Questions

Overview • First 15 minutes of Google FPGA lecture • How to run Gprof • Discuss some high-level approaches for accelerating applications.

What you should learn • Start to get a feel for approaches for accelerating applications.

Why use Customize Hardware? • Great talk about the benefits of Heterogeneous Computing • http://video.google.com/videoplay?docid=-4969729965240981475#

Profiling Applications • Finding bottlenecks • Profiling tools • gprof: http://www.cs.nyu.edu/~argyle/tutorial.html • Valgrind

Pipelining How many ns to process to process 100 input vectors? Assuming each LUT Has a 1 ns delay. Input vector <A,B,C,D> output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D How many ns to process 100 input vectors? Assume a 1 ns clock 1 DFF delay per output A 4-LUT 4-LUT 4-LUT 4-LUT B C DFF DFF DFF DFF D

Pipelining (Systolic Arrays) Dynamic Programming • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner.

Pipelining (Systolic Arrays) Dynamic Programming • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1

Pipelining (Systolic Arrays) Dynamic Programming • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1 How many ns to process if CPU can process one cell per clock (1 ns clock)?

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1 How many ns to process if FPGA can obtain maximum parallelism each clock? (1 ns clock)

Pipelining (Systolic Arrays) Dynamic Programming 1 3 6 • Start with base case • Lower left corner • Formula for computing • numbering cells • 3. Final result in upper • right corner. 1 2 3 1 1 1 What speed up would an FPGA obtain (assuming maximum parallelism) for an 100x100 matrix. (Hint find a formula for an NxN matrix)

Dr. James Moscola (Example) ROOT0 S0 g a c c a g IL1 IR2 1 2 3 MATP1 MP3 ML4 MR5 D6 ROOT0 1 MATP1 3 IL7 IR8 2 MATL2 MATL2 END3 ML9 D10 IL11 END3 E12

Example RNA Model ROOT0 S0 g a c c a g IL1 IR2 1 2 3 MATP1 MP3 ML4 MR5 D6 ROOT0 1 MATP1 3 IL7 IR8 2 MATL2 MATL2 END3 ML9 D10 IL11 END3 E12

Baseline Architecture Pipeline END3 MATL2 MATP1 ROOT0 E12 IL11 D10 ML9 IR8 IL7 D6 MR5 ML4 MP3 IR2 IL1 S0 u g g c g a c a c c c residue pipeline

Processing Elements -INF -INF .40 -INF .44 .30 -INF .30 .72 .22 ML4 d  0 1 2 3 0 1 j  IL7,3,2 + 2 ML4_t(7) = 3 IR8,3,2 + ML4_t(8) = ML9,3,2 + ML4_t(9) = D10,3,2 ML4,3,3 = .22 + + ML4_t(10) ML4_e(A) ML4_e(C) ML4_e(G) ML4_e(U) input residue, xi

Baseline Results for Example Model • Comparison to Infernal software • Infernal run on Intel Xeon 2.8GHz • Baseline architecture run on Xilinx Virtex-II 4000 • occupied 88% of logic resources • run at 100 MHz • Input database of 100 Million residues • Bulk of time spent on I/O (41.434s)

Expected Speedup on Larger Models • Speedup estimated ... • using 100 MHz clock • for processing database of 100 Million residues • Speedups range from 500x to over 13,000x • larger models with more parallelism exhibit greater speedups

Distributed Memory ALU Cache BRAM BRAM PE BRAM BRAM

Next Class • Models of Computation (Design Patterns)

Questions/Comments/Concerns • Write down • Main point of lecture • One thing that’s still not quite clear • If everything is clear, then give an example of how to apply something from lecture OR

CPRE 583 Reconfigurable Computing Lecture 10: Wed 9/24/2010 (High-level Acceleration Approaches)