1 / 58

Reconfigurable Architectures

Reconfigurable Architectures. Forces that drive a Reconfigurable Architecture Price Mass production 100K to millions Experimental 1 to 10’s Granularity of reconfiguration Fine grain Course Grain Degree of system integration/coupling Tightly Loosely.

ima
Download Presentation

Reconfigurable Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Reconfigurable Architectures • Forces that drive a Reconfigurable Architecture • Price • Mass production 100K to millions • Experimental 1 to 10’s • Granularity of reconfiguration • Fine grain • Course Grain • Degree of system integration/coupling • Tightly • Loosely All are a function of the application that will run on the Architecture

  2. Example Points in (Price,Granularity,Coupling) Space $1M’s Exec Int Intel / AMD Decode Store float RFU Processor Price Coupling Tight $100’s Loose Coarse PC Ethernet Granularity ML507 Fine

  3. What’s the point of a Reconfigurable Architecture • Performance metrics • Computational • Throughput • Latency • Power • Total power dissipation • Thermal • Reliability • Recovery from faults Increase application performance!

  4. Typical Approach for Increasing Performance • Application/algorithm implemented in software • Often easier to write an application in software • Profile application (e.g. gprof) • Determine where the application is spending its time • Identify kernels of interest • e.g. application spends 90% of its time in function matrix_multiply() • Design custom hardware/instruction to accelerate kernel(s) • Analysis to kernel to determine how to extract fine/coarse grain parallelism (does any parallelism even exist?) Amdahl’s Law!

  5. Granularity

  6. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  7. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  8. Granularity: Coarse Grain • rDPA: reconfigurable Data Path Array • Function Units with programmable interconnects Example ALU ALU ALU ALU ALU ALU ALU ALU ALU

  9. Granularity: Fine Grain CLB CLB CLB CLB CLB CLB CLB CLB Configurable Logic Block CLB CLB CLB CLB CLB CLB CLB CLB • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates

  10. Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB

  11. Granularity: Fine Grain Configurable Logic Block • FPGA: Field Programmable Gate Array • Sea of general purpose logic gates CLB CLB CLB CLB CLB CLB CLB CLB

  12. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Microprocessor 1024-bits

  13. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 1024-bits

  14. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op 3 A 3 10-LUT Microprocessor 3 B 4 op 1024-bits 3 A 3 B 3 4 op 3 3 A B 3

  15. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 op A 3 10-LUT Microprocessor 3 3 B 1024-bits op A 3 3 3 B 4 op 3 A 3 B 3

  16. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 4 4 op op A A 3 3 10-LUT Microprocessor 3 3 3 3 B B 1024-bits 4 op 3 A 3 3 B

  17. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits

  18. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT 10-LUT Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  19. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT A 10-LUT B Bit logic and constants 1024-bits (A and “1100”) or (B or “1000”)

  20. Granularity: Trade-offs Trade-offs associated with LUT size Example: 2-LUT (4=2x2 bits) vs. 10-LUT (1024=32x32 bits) 1024-bits 2-LUT AND 4 A 10-LUT 1 Bit logic and constants 1024-bits OR Area that was required using 2-LUTS (A and “1100”) or (B or “1000”) 0 OR 4 B It’s much worse, each 10-LUT only has one output

  21. Granularity: Example Architectures • Fine grain: GARP • Course grain: PipeRench

  22. Granularity: GARP Memory D-cache I-cache CPU RFU Config cache Garp chip

  23. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip

  24. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip Example computations in one cycle A<<10 | (b&c) (A-2*b+c)

  25. Granularity: GARP Memory • Impact of configuration size • 1 GHz bus frequency • 128-bit memory bus • 512Kbits of configuration size D-cache I-cache On a RFU context switch how long to load a new full configuration? CPU RFU 4 microseconds An estimate of amount of time for the CPU perform a context switch is ~5 microseconds Config cache Garp chip ~2x increase context switch latency!!

  26. Granularity: GARP Memory RFU Execution (16, 2-bit) control (1) D-cache I-cache CPU RFU N Config cache PE (Processing Element) Garp chip • “The Garp Architecture and C Compiler” • http://www.cs.cmu.edu/~tcal/IEEE-Computer-Garp.pdf

  27. Granularity: PipeRench • Coarse granularity • Higher (higher) level programming • Reference papers • PipeRench: A Coprocessor for Streaming Multimedia Acceleration (ISCA 1999): http://www.cs.cmu.edu/~mihaib/research/isca99.pdf • PipeRench Implementation of the Instruction Path Coprocessor (Micro 2000): http://class.ee.iastate.edu/cpre583/papers/piperench_Micro_2000.pdf

  28. Granularity: PipeRench PE PE PE PE PE PE PE PE PE 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU 8-bit ALU Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Reg file Interconnect Global bus Interconnect

  29. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 PE PE PE PE PE PE PE PE PE PE PE PE

  30. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 PE PE PE PE PE PE PE PE PE PE PE PE

  31. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 PE PE PE PE 1 PE PE PE PE PE PE PE PE

  32. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 2 PE PE PE PE PE PE PE PE

  33. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 PE PE PE PE 3 PE PE PE PE

  34. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 4 PE PE PE PE

  35. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE

  36. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2

  37. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0

  38. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 1

  39. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 1 1 2

  40. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 1 1 1 2 2

  41. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 1 1 1 4 2 2 2

  42. Granularity: PipeRench Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 3 4 0 0 0 0 PE PE PE PE 1 1 1 2 2 2 PE PE PE PE 3 3 3 4 4 PE PE PE PE Cycle 1 2 3 4 5 6 Pipeline stage 0 1 2 0 0 0 3 3 3 1 1 1 4 4 2 2 2 0

  43. Degree of Integration/Coupling • Independent Reconfigurable Coprocessor • Reconfigurable Fabric does not have direct communication with the CPU • Processor + Reconfigurable Processing Fabric • Loosely coupled on the same chip • Tightly coupled on the same chip

  44. Degree of Integration/Coupling CPU Execute Write Back Memory Decode ALU Fetch FPU DMA Controller L1 Cache Main Memory Memory Controller L2 Cache I/O Controller USB PCI PCI-Express SATA Hard Drive NIC

  45. Degree of Integration/Coupling CPU Execute Write Back Memory Decode ALU Fetch FPU DMA Controller L1 Cache Main Memory Memory Controller L2 Cache I/O Controller USB PCI PCI-Express SATA RPF Hard Drive NIC

  46. Degree of Integration/Coupling CPU Execute Write Back RPF Memory Decode ALU Fetch FPU DMA Controller L1 Cache Main Memory Memory Controller L2 Cache I/O Controller USB PCI PCI-Express SATA Hard Drive NIC

  47. Degree of Integration/Coupling CPU Execute Write Back Memory Decode ALU Fetch FPU DMA Controller L1 Cache Config I/F Main Memory Memory Controller L2 Cache I/O Controller RPF USB PCI PCI-Express SATA Hard Drive NIC

  48. Degree of Integration/Coupling CPU Execute Write Back Memory Decode ALU Fetch FPU DMA Controller L1 Cache Config I/F Main Memory Memory Controller L2 Cache I/O Controller RPF USB PCI PCI-Express SATA Hard Drive NIC

  49. Degree of Integration/Coupling CPU Execute Write Back Memory Decode ALU Fetch FPU DMA Controller L1 Cache Config I/F Main Memory Memory Controller L2 Cache I/O I/O Controller RPF USB PCI PCI-Express SATA Hard Drive NIC

  50. Degree of Integration/Coupling CPU Execute Write Back Memory Decode ALU Fetch FPU RFU DMA Controller L1 Cache Main Memory Memory Controller L2 Cache I/O Controller USB PCI PCI-Express SATA Hard Drive NIC

More Related