1 / 22

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina. Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer. Outline. Reconfigurable Computing – Introduction

apu
Download Presentation

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

  2. Outline • Reconfigurable Computing – Introduction • SRC-6e architecture, programming model • Sorting Algorithms • Design guidelines • Testing Procedures, Results • Conclusions, Future Work • Lessons learned

  3. What is a Reconfigurable Computer? • Combination of: • Microprocessor workstation for frontend processing • FPGA backend for specialized coprocessing • Typical PC bus for communications

  4. What is a Reconfigurable Computer? • PC Characteristics • High clock speed • Superscalar, pipelined • Out of order issue • Speculative execution • High-Level Language • FPGA Characteristics • Low clock speed • Large number of configurable elements • LUTs, Block RAMs, CPAs • Multipliers • HDL Language

  5. What is the SRC-6e? • SRC = Seymour R. Cray • RC with high-throughput memory interface • 1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads • PCI-X (1.0) = 1.064 GB/s

  6. SRC-6e Development • Programming does not require knowledge of HW design • C code can compile to hardware

  7. SRC Design Objectives • FPGA Considerations • Superscalar design • Parallel, pipelined execution • SRC Considerations • High overall data throughput • Streaming versus non-streaming data transfer? • Reduction of FPGA data processing stalls due to data dependencies, data read/write delays • FPGA Block RAM versus SRC OnBoard Memory? • Evaluate software/hardware partitioning • Algorithm partitioning • Data size partitioning

  8. Sorting Algorithms • Traditional Algorithms • Comparison Sorts: Θ(n lg n) best case • Insertion sort • Merge sort • Heapsort • Quicksort • Counting Sorts • Radix sort: Θ(d(n+k)) • HPCS FORTRAN code baseline • Radix sort in combination with heapsort • This research focuses on 128-bit operands • SRC simplified data transfer, management

  9. Sorting – SRC FPGA Implementation • Memory Constraints • SRC onboard memory • 6 banks x 4 MB • Pipelined read or write access • 5 clock latency • FPGA BRAM memory • 144 blocks, 18 Kbit each • 1 clock read and write latency • Initial Choices • Parallel Insertion Sort (BubbleSort) • Produces sorted blocks • Use of onboard memory pipelined processing • Minimize data access stalls • Parallel Heapsort • Random access merge of sorted lists • Use of BRAM for low latency access • Good for random data access

  10. Parallel Insertion Sort (BubbleSort) • Systolic array of cells • Pipelined SRC processing from OnBoard Memory • Keeps highest value, passes other values • Latency 2x number of cells

  11. Parallel Insertion Sort (BubbleSort) • Systolic array of cells • Results passed out in reverse order of comparison • N = # comparator cells • Sorts a list completely in Θ(L2) • Limit sort size to some number a < L (list size) • Create multiple sorted lists • Each list sorted in Θ(a)

  12. Parallel Insertion Sort (BubbleSort) #include <libmap.h> void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) { OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE) DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1; while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i]; data_low_in = b[i]; parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out); c[i] = data_high_out; d[i] = data_low_out;

  13. Parallel Heapsort • Tree structure of cells • Asynchronous operation • Acknowledged data transfer • Merges sorted lists in Θ(n lg n) • Designed for Independent BRAM block accesses

  14. Parallel Heapsort • BRAM Limitations • 144 Block RAMs @ 512 32 bit values = not a whole lot of 128-bit values • OnBoard Memory • SRC constraint – Up to 64 reads and 8 writes in one MAP C file • Cascading clock delays as number of reads increase • Explore the use of MUXd access: search and update only 6 of 48 leaf nodes at a time in round-robin fashion

  15. FPGA Initial Results • Baseline: One V26000 • PAR options: -ol high –t 1 • Bubblesort Results – 100 Cells • 29,354 Slices (86%) • 37,131 LUTs (54%) • 13.608 ns = 73 MHz (verified operational at 100MHz) • Heapsort Results – 95 Cells (48 Leafs) • 21,011 Slices (62%) • 24,467 LUTs (36%) • 11.770 ns = 85 MHz (verified operational at 100MHz)

  16. Testing Procedures • All tests utilize one chip for baseline results • Evaluate fastest software radix of operation • Hardware/Software Partitioning • Five cases - Case 5 utilizes FPGA reconfiguration • Data size partitioning – 100, 500, 1000, 5000, 10000 • 10 runs for each test case/data partitioning combination • List size 500000 values

  17. Results • Fastest Software Operations (Baseline) • Comparison of Radixsort and Heapsort Combinations • Radix 4, 8 and 16 evaluated • Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000) • Radix-16 has too many buckets for sort size partitions evaluated • Heapsort comparisons faster than radixsort index updates

  18. Results • Fastest SW-only Time = 3.41 sec. • Fastest time including HW = 3.89 sec. • Bubblesort (HW), Heapsort (SW) • Partition Listsize of 1000 • Heapsort times… • Dominated by data access • Significantly slower than software

  19. Results – Bubblesort vs. Radixsort • Some cases where HW faster than SW • List sizes < 5000 • SRC data pipelined access • Fastest SW case was for list size = 10000 • MAP data transfer time less significant than data processing time • For size = 1000:Input (11.3%), Analyze (76.9%), Output (11.5%)

  20. Results - Limitations • Heapsort is limited by overhead of input servicing • Random accesses of OBM not ideal • Overhead of loop search, sequentially dependent processing • Bubblesort limited by number of cells • Can increase by approximately 13 cells • Two-chip streaming • Reconfiguration time assumed to be one-time setup factor • Reconfiguration case exception – Solve by having a core per V26000

  21. Conclusions • Pipelined, systolic designs are needed to overcome speed advantage of microprocessor • Bubblesort works well on small data sets • Heapsort’s random data access cannot exploit SRC benefits • SRC high-throughput data transfer and high-level data abstraction provides good framework to implement systolic designs

  22. Future Work • Heapsort’s random data access cannot exploit SRC benefits • Look for possible speedups using BRAM? • Unroll leaf memory access • Exploit SRC “periodic macro” paradigm • Currently evaluating radix sort in hardware • This works better than bubblesort for larger sort sizes • Compare MAP-C to VHDL when baseline VHDL is faster than SW

More Related