Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning of the DARPA HPCS Integer Sort Benchmark within the SRC-6e Reconfigurable Computer

Outline • Reconfigurable Computing – Introduction • SRC-6e architecture, programming model • Sorting Algorithms • Design guidelines • Testing Procedures, Results • Conclusions, Future Work • Lessons learned

What is a Reconfigurable Computer? • Combination of: • Microprocessor workstation for frontend processing • FPGA backend for specialized coprocessing • Typical PC bus for communications

What is a Reconfigurable Computer? • PC Characteristics • High clock speed • Superscalar, pipelined • Out of order issue • Speculative execution • High-Level Language • FPGA Characteristics • Low clock speed • Large number of configurable elements • LUTs, Block RAMs, CPAs • Multipliers • HDL Language

What is the SRC-6e? • SRC = Seymour R. Cray • RC with high-throughput memory interface • 1,415 MB/s for SNAP writes, 1,280 MB/s for SNAP reads • PCI-X (1.0) = 1.064 GB/s

SRC-6e Development • Programming does not require knowledge of HW design • C code can compile to hardware

SRC Design Objectives • FPGA Considerations • Superscalar design • Parallel, pipelined execution • SRC Considerations • High overall data throughput • Streaming versus non-streaming data transfer? • Reduction of FPGA data processing stalls due to data dependencies, data read/write delays • FPGA Block RAM versus SRC OnBoard Memory? • Evaluate software/hardware partitioning • Algorithm partitioning • Data size partitioning

Sorting Algorithms • Traditional Algorithms • Comparison Sorts: Θ(n lg n) best case • Insertion sort • Merge sort • Heapsort • Quicksort • Counting Sorts • Radix sort: Θ(d(n+k)) • HPCS FORTRAN code baseline • Radix sort in combination with heapsort • This research focuses on 128-bit operands • SRC simplified data transfer, management

Sorting – SRC FPGA Implementation • Memory Constraints • SRC onboard memory • 6 banks x 4 MB • Pipelined read or write access • 5 clock latency • FPGA BRAM memory • 144 blocks, 18 Kbit each • 1 clock read and write latency • Initial Choices • Parallel Insertion Sort (BubbleSort) • Produces sorted blocks • Use of onboard memory pipelined processing • Minimize data access stalls • Parallel Heapsort • Random access merge of sorted lists • Use of BRAM for low latency access • Good for random data access

Parallel Insertion Sort (BubbleSort) • Systolic array of cells • Pipelined SRC processing from OnBoard Memory • Keeps highest value, passes other values • Latency 2x number of cells

Parallel Insertion Sort (BubbleSort) • Systolic array of cells • Results passed out in reverse order of comparison • N = # comparator cells • Sorts a list completely in Θ(L2) • Limit sort size to some number a < L (list size) • Create multiple sorted lists • Each list sorted in Θ(a)

Parallel Insertion Sort (BubbleSort) #include <libmap.h> void parsort_test(int arraysize, int sortsize, int transfer, uint64_t datahigh_in[], uint64_t datalow_in[], uint64_t datahigh_out[], uint64_t datalow_out[], int64_t *start_transferin, int64_t *start_loop, int64_t *start_transferout, int64_t *end_transfer, int mapno) { OBM_BANK_A (a, uint64_t, MAX_OBM_SIZE) OBM_BANK_B (b, uint64_t, MAX_OBM_SIZE) OBM_BANK_C (c, uint64_t, MAX_OBM_SIZE) OBM_BANK_D (d, uint64_t, MAX_OBM_SIZE) DMA_CPU(CM2OBM, a, MAP_OBM_stripe(1, "A"), datahigh_in, 1, arraysize*8, 0); wait_DMA(0); …. while (arrayindex < arraysize) { endarrayindex = arrayindex + sortsize - 1; if (endarrayindex > arraysize - 1) endarrayindex = arraysize - 1; while (arrayindex < endarrayindex) { for (i=arrayindex; i<=endarrayindex; i++) { data_high_in = a[i]; data_low_in = b[i]; parsort(i==endarrayindex, data_high_in, data_low_in, &data_high_out, &data_low_out); c[i] = data_high_out; d[i] = data_low_out;

Parallel Heapsort • Tree structure of cells • Asynchronous operation • Acknowledged data transfer • Merges sorted lists in Θ(n lg n) • Designed for Independent BRAM block accesses

Parallel Heapsort • BRAM Limitations • 144 Block RAMs @ 512 32 bit values = not a whole lot of 128-bit values • OnBoard Memory • SRC constraint – Up to 64 reads and 8 writes in one MAP C file • Cascading clock delays as number of reads increase • Explore the use of MUXd access: search and update only 6 of 48 leaf nodes at a time in round-robin fashion

FPGA Initial Results • Baseline: One V26000 • PAR options: -ol high –t 1 • Bubblesort Results – 100 Cells • 29,354 Slices (86%) • 37,131 LUTs (54%) • 13.608 ns = 73 MHz (verified operational at 100MHz) • Heapsort Results – 95 Cells (48 Leafs) • 21,011 Slices (62%) • 24,467 LUTs (36%) • 11.770 ns = 85 MHz (verified operational at 100MHz)

Testing Procedures • All tests utilize one chip for baseline results • Evaluate fastest software radix of operation • Hardware/Software Partitioning • Five cases - Case 5 utilizes FPGA reconfiguration • Data size partitioning – 100, 500, 1000, 5000, 10000 • 10 runs for each test case/data partitioning combination • List size 500000 values

Results • Fastest Software Operations (Baseline) • Comparison of Radixsort and Heapsort Combinations • Radix 4, 8 and 16 evaluated • Minimum Time: Radix-8 Radixsort + Heapsort (Size = 5000 or 10000) • Radix-16 has too many buckets for sort size partitions evaluated • Heapsort comparisons faster than radixsort index updates

Results • Fastest SW-only Time = 3.41 sec. • Fastest time including HW = 3.89 sec. • Bubblesort (HW), Heapsort (SW) • Partition Listsize of 1000 • Heapsort times… • Dominated by data access • Significantly slower than software

Results – Bubblesort vs. Radixsort • Some cases where HW faster than SW • List sizes < 5000 • SRC data pipelined access • Fastest SW case was for list size = 10000 • MAP data transfer time less significant than data processing time • For size = 1000:Input (11.3%), Analyze (76.9%), Output (11.5%)

Results - Limitations • Heapsort is limited by overhead of input servicing • Random accesses of OBM not ideal • Overhead of loop search, sequentially dependent processing • Bubblesort limited by number of cells • Can increase by approximately 13 cells • Two-chip streaming • Reconfiguration time assumed to be one-time setup factor • Reconfiguration case exception – Solve by having a core per V26000

Conclusions • Pipelined, systolic designs are needed to overcome speed advantage of microprocessor • Bubblesort works well on small data sets • Heapsort’s random data access cannot exploit SRC benefits • SRC high-throughput data transfer and high-level data abstraction provides good framework to implement systolic designs

Future Work • Heapsort’s random data access cannot exploit SRC benefits • Look for possible speedups using BRAM? • Unroll leaf memory access • Exploit SRC “periodic macro” paradigm • Currently evaluating radix sort in hardware • This works better than bubblesort for larger sort sizes • Compare MAP-C to VHDL when baseline VHDL is faster than SW

Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina