1 / 31

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS. MAPLD-99 Laurel, MD September 29, 1999 Senthil Natarajan, Ben Levine, Chandra Tan, Danny Newport and Don Bouldin Electrical & Computer Engineering University of Tennessee Knoxville, TN 37996-2100

whitby
Download Presentation

AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AUTOMATIC MAPPING OF KHOROS-BASED APPLICATIONS TO ADAPTIVE COMPUTING SYSTEMS MAPLD-99 Laurel, MD September 29, 1999 Senthil Natarajan, Ben Levine, Chandra Tan, Danny Newport and Don Bouldin Electrical & Computer Engineering University of Tennessee Knoxville, TN 37996-2100 TEL: (423)-974-5444 FAX: (423)-974-8245 dbouldin@utk.edu

  2. Input Image Output Image Correctly Detected 3 Tanks But 1 False Target (Truck) Low Pass Filter TargetRecon. Region Merge Stat. Calc. Mask Convol. INITIAL DESIGN CAPTURE AND ALGORITHM VERIFICATION USING KHOROS

  3. KHOROS/CANTATA IS A VISUAL PROGRAMMING LANGUAGE FOR PROTOTYPING ALGORITHMS

  4. Multiple FPGAs on a printed circuit board can be tightly coupled with a host CPU to accelerate low-level computations. Our Wildforce Board has five Xilinx FPGAs (13K--85K gates), each with 512K x 32-bit RAM. ADAPTIVE COMPUTING SYSTEMS CONSIST OF ACCELERATOR BOARDS OF FPGAS

  5. CURRENT STATE-OF-THE-ART KHOROS MISSING LINK PARTITIONING ONTO MULTIPLE FPGAS MISSING LINK

  6. CHAMPION WILL AUTOMATICALLY MAP KHOROS DESIGNS ONTO ADAPTIVE COMPUTING SYSTEMS

  7. CHAMPION WILL IMPROVE PRODUCTIVITY Manual Mapping Onto An Adaptive Computing System KHOROS • GOAL: Automate the mapping of Khoros-based applications onto adaptive computing systems to improve designer productivity by 100x. • IMPACT: • More application designers will be able to achieve higher quality implementations in less time. • Adaptive computing systems will be utilized more effectively and by a wider audience. ACS TIME (WEEKS) Champion Will Improve Productivity By Using Estimation and Automatic Mapping of Precompiled Library Primitives KHOROS ESTIMATION ACS TIME (WEEKS)

  8. OUTLINE OF THIS PRESENTATION • Application Development Flow • Library Development and Verification • Manual Implementation • ATR Executions on ACS • Automated Partitioning Algorithms • Lessons Learned and Future Plans

  9. APPLICATION DEVELOPMENT FLOW APPLICATION KHOROS/CANTATA DATA WIDTH MATCHING & SYNCHRONIZATION PARTITIONING Precompiled Libraries SYNTHESIS & PLACEMENT/ROUTING Destination Hardware Architecture ADAPTIVE COMPUTING SYSTEM

  10. KHOROS/CANTATA IMPLEMENTATIONTOP LEVEL

  11. KHOROS/CANTATA IMPLEMENTATION--> FIND TARGETS

  12. KHOROS/CANTATA IMPLEMENTATION--> MARK FRAME PIXELS

  13. Target Pixel Map Frame Map Identify Target Region InsertTarget Frame REPEAT 6 TIMES Merge Images Output Image Input FLIR Image ALGORITHM STRUCTURE Find Targets and Label Image The target pixel map is then used to identify square regions that are considered to contain targets. These target regions are then masked off (it is assumed that there is only one target per region). The target region location is then used to draw a frame that will identify the target in the output image. This is repeated six times.

  14. DEVELOP AND PRECOMPILE LIBRARY CELLS Test Inputs Responses KHOROS--C Floating Point KHOROS--C Fixed Point VHDL Each Library Primitive Will Be Developed at Each Level, Verified, and Characterized. FPGA

  15. KHOROS AND CHAMPION LIBRARY CELLS • Khoros Traditional Cells: • Some Khoros cells have multiple functions for user to select. • A single cell can handle all input dimension sizes. • Cells can handle inputs of any data type. • Data between cells are stored on the host CPU system as temp files. • Khoros handles data movement between cells. Each cell begins its execution only after all its inputs have been written onto the host file system. • Champion Cells: • Each hardware cell has only one specific function and one data type. • Hardware cells are parametrized to correspond to the desired data bit widths. • Data is transferred between hardware cells sequentially one pixel at a time per clock cycle. • Synchronization of data arrival to hardware cells is necessary through the insertion of delay elements by Champion. ·

  16. 8 11 10 10 9 9 9 9 8 8 8 8 8 8 8 8 8 8 DATA BIT-WIDTHS MUST BE MATCHED IN CONVSTREAM_8_256_256 ADD RIGHT SHIFT 3 ADD_8 ADD_9 CLIP HIGH 255 ADD_8 ADD_10 ADD_8 ADD_9 ADD_8 OUT

  17. 9 9 12 12 10 10 9 11 9 8 8 8 8 8 8 8 8 8 8 DATA MUST BE SYNCHRONIZED DUE TO DIFFERENT PATH DELAYS Data synchronization error! Input times are not equal. IN T=0 11 T= 257 PAD_HIGH_8_11 L = 0 CONVSTREAM_8_256_256 L = 257 ADD_11 RIGHT_SHIFT_12_ 3 12 ADD_8 L = 1 T= 258 ADD_9 L = 1 T=260 T= 259 CLIP_HIGH_12_ 255 ADD_8 L = 1 T= 258 ADD_10 L = 1 ADD_8 L = 1 T= 258 T= 259 ADD_9 L = 1 TRUNCATE_HIGH_12_8 ADD_8 L = 1 T= 258 8 T=257 OUT

  18. ORIGINAL KHOROS TASK GRAPH S:32 R RAM_Read_pf4_var_8 D:- 8 S:404 M Sobel_8_8_256_256 D:262 S:346 8 M Lowpass_8_8_256_256 D:262 8 S:354 M START_Mean_SD D:S+14 8 S:354 M START_Mean_SD 8 D:S+14 8 8 8 8 9 S:0 shift_left_9_1 D:0 10 S:0 9 9 shift_left_10_2 D:0 S:12 8 add_9 10 D:1 10 S:13 10 add_10 D:1 S:4 and_1 11 D:1 1 S:168 M Lowpass_1_4_256_256 D:262 1 D:0 4 8 8 8 S:9 S:11 gte_4_4 gte_8 D:1 D:1 S:11 1 1 1 S:168 gte_8 M Lowpass_1_4_256_256 S:63 D:1 D:262 M MITR D:5 4 1 S:9 gte_4_4 D:1 A

  19. HARDWARE TASK GRAPH WITH DATA BIT-WIDTH MATCHED AND SYNCHRONIZED S:32 R RAM_Read_pf4_var_8 D:- 8 S:404 M Sobel_8_8_256_256 D:262 S:346 8 M Lowpass_8_8_256_256 D:262 8 S:354 M START_Mean_SD D:S+14 8 S:354 M START_Mean_SD 8 D:S+14 8 S:56 R RAM_buffer_pf4_8 S:0 8 8 D:16+S pad_8_9 8 8 D:0 S:0 9 pad_8_9 S:0 S:0 D:0 pad_8_10 pad_8_10 S:0 D:0 D:0 shift_left_9_1 D:0 10 S:0 9 9 shift_left_10_2 D:0 S:12 8 add_9 10 D:1 10 S:13 10 S:56 add_10 R RAM_buffer_pf4_8 D:1 S:4 D:16+S and_1 S:11 11 D:1 clip_high_10_8 S:11 D:1 clip_high_11_8 1 D:1 10 11 S:168 M Lowpass_1_4_256_256 S:0 S:0 D:262 1 trunc_high_11_8 trunc_high_10_8 D:0 D:0 4 8 8 8 S:9 S:11 gte_4_4 gte_8 D:1 D:1 S:11 1 1 1 S:168 gte_8 M Lowpass_1_4_256_256 S:63 D:1 D:262 M MITR D:5 4 1 S:9 gte_4_4 D:1 A

  20. Xilinx 4013XL Series FPGA Xilinx 4036XL Series FPGA Xilinx 4013XL Series FPGA Xilinx 4013XL Series FPGA Xilinx 4013XL Series FPGA Local RAM Local RAM Local RAM Local RAM Local RAM OUR WILDFORCE ACS USED AS A LINEAR ARRAY PCI Interface Local Bus 32 = 36-bit Data Path Crossbar PE0 PE1 PE2 PE3 PE4

  21. Technology Hardware Configuration Mapping Synthesis Design Input in Khoros PARTITION EARLY INSTEAD OF LATE TO SHORTEN THE HARDWARE MAPPING TIME EARLY Precompiled Library Cells Place & Route Merge P1 SUCCESS Place & Route Merge P2 Design Input in Khoros Workspace to Netlist K-way partitioning + Global Place & Route Place & Route • Coarser granularity -> smaller netlist. • Hierarchical and functional flow information are preserved. • Timing Synchronization greatly facilitated. • Less resource utilization. Merge P3 Place & Route Merge Pk LATE Optimizer Flatten Hardware Configuration SUCCESS P1 Place & Route K-way partitioning + Global Place & Route P2 Place & Route VHDL • Finer granularity -> larger netlist. • Functional and algorithmic flow of the design are lost. • Timing Synchronization can be a problem. • More resource utilization. • The resulted subcircuits are more likely to be placeable and routable. P3 Place & Route Pk Place & Route

  22. N - P0 - P1 P0 P1 NETLIST N P0 P1 N - P0 -P1 -P2 P2 N - P0 P0 P2 P1 P4 P3 P0 MULTI-FPGA PARTITIONING

  23. TIMING RESULTS FOR atr ON OUR WILDFORCE • OUR WILDFORCE ACS IS 156X FASTER THAN KHOROS/CPU NOW. • IF WE HAVE SUFFICIENT LOGIC AND MEMORY SUCH THAT NO RECONFIGURATIONS ARE NEEDED, THE ACS COULD BE 667X FASTER. • IF FULLY PIPELINED, THE ACS COULD BE 32,000X FASTER. Data Processing 33 Data Transfer 34 Host Code 1544 Reconfiguration 5159 0 1000 2000 3000 4000 5000 6000

  24. PARTITIONING - 1st BOARD CONFIGURATION PHASE Blank Frame Map Compute Edge Stats Find First Target Pixel RAM RAM Mask Target Pixels Input Image Check Intensity Stats Mark Frame Pixels 11 11 4 4 4 554 500 PE1 PE3 Low-Pass Filter RAM AND Sobel Filter Compute Intensity Stats Check Edge Stats Write to RAM - A 11 11 Low-Pass Filter Check >= 4 11 1296 Low-Pass Filter Check >= 4 CPE0 Mask Invalid Target Region 72 4 548 PE2 PE4

  25. PARTITIONING - 2nd BOARD CONFIGURATION PHASE Find First Target Pixel Find First Target Pixel RAM RAM Mask Target Pixels Mask Target Pixels Mark Frame Pixels Mark Frame Pixels 4 4 4 4 500 500 PE1 PE3 Read from RAM - A 5 Find First Target Pixel RAM Write to RAM - B 53 Mask Target Pixels CPE0 Mark Frame Pixels 4 4 72 500 PE2 PE4

  26. PARTITIONING - 3rd BOARD CONFIGURATION PHASE Find First Target Pixel RAM Mask Target Pixels Mark Frame Pixels 4 4 0 500 PE1 PE3 Read from RAM - B 5 Find First Target Pixel RAM Write to RAM - C 53 Mask Target Pixels CPE0 Mark Frame Pixels 4 4 72 500 PE2 PE4

  27. PARTITIONING - 4th BOARD CONFIGURATION PHASE RAM Read from RAM - C Find Max Intensity Combine Image and Frames 4 11 11 11 119 75 PE1 PE3 Input Image 11 Output Image RAM 53 CPE0 4 11 11 72 90 PE2 PE4

  28. PRODUCTIVITY IMPROVEMENT IS 100X(250 hours manually vs. 2.5 hours automatically) Application Khoros Partitioning Suite Data Matching Data Synchronization WSP2NETLIST NETLIST2STV Synthesis/Place & Route ACS Automatic Manual time

  29. LESSONS LEARNED • Learned that the translation from KHOROS to hardware is complicated by several factors including: • Differences in the way blocks of data are passed from operator to operator. • Parameters for data bit-widths must be specified for each cell. • Difference between data-driven KHOROS cells and clock-driven hardware cells creates a need for data synchronization. • Determined that reconfiguration time was the major obstacle to achieving high performance, and that RAM access conflicts required more reconfigurations than would be otherwise necessary. • Learned that manual implementation of KHOROS applications on WildForce is very time-consuming and tedious (250 hours). • Thus, great potential exists for making a significant (100x) improvement on productivity via automation.

  30. SCHEDULE AND MILESTONES May 98 Demonstrated the manual mapping of a simple KHOROS network on a Xilinx-based ACS (EVC-1). We also validated our method for library development at the KHOROS, VHDL and FPGA levels. Sep 98 Demonstrated the manual mapping of a more complex KHOROS network on a Xilinx-based ACS (Wildforce). Mar 99 Demonstrated the manual mapping of a complex KHOROS network with some automated FPGA partitioning on the Wildforce. Sep 99 Automated additional portions of the application development flow. Jan 00 Will demonstrate the Army Night Vision Lab challenge problem with automatic mapping onto the Wildforce. Mar 00 Will demonstrate two additional challenge problems (e.g. Face Detection and Image Backprojection on the Wildforce). Sep 00 Will demonstrate all three challenge problems on two additional ACS platforms (e.g. Altera-based ACS and latest Xilinx-Virtex ACS).

  31. KEY IDEAS • Khoros-based designs will be automatically linked to Adaptive Computing Systems. • Mapping onto ACS will be accelerated using precompiled library primitives (semicustom approach). • Metrics and visualization will reveal to application designer low-level details as warranted. • Application designer will guide the partitioning of tasks between hardware and software. SCHEDULEMILESTONES May 98 Demonstrated the manual mapping of a simple KHOROS network on a Xilinx-based ACS (EVC-1). We also validated our method for library development at the KHOROS, VHDL and FPGA levels. Sep 98 Demonstrated the manual mapping of a more complex KHOROS network on a Xilinx-based ACS (Annapolis Microsystems Wildforce). Mar 99 Demonstrated the manual mapping of a complex KHOROS network with some automated FPGA partitioning on the Wildforce. Sep 99 Automated additional portions of the application development flow. Jan 00 Will demonstrate the Army Night Vision Lab challenge problem withautomatic mapping onto the Wildforce. Mar 00 Will demonstrate two additional challenge problems (e.g. Face Detection and Image Backprojection on the Wildforce). Sep 00 Will demonstrate all three challenge problems on two additional ACS platforms (e.g. Altera-based ACS and latest Xilinx-Virtex ACS). GOAL: Automate the mapping of Khoros-based applications onto adaptive computing systems to improve designer productivity by 100x. IMPACT: More application designers will be able to achieve higher quality implementations in less time. Adaptive computing systems will be utilized more effectively and by a wider audience. CHAMPION: A SOFTWARE DESIGN ENVIRONMENT FOR ADAPTIVE COMPUTING SYSTEMS

More Related