200 likes | 404 Views
Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer. Russ Duren Baylor University, Waco, Texas Douglas Fouts and Dan Zulaica Naval Postgraduate School, Monterey, California . SRC-6E RECONFIGURABLE COMPUTER.
E N D
Performance Comparison of CORDIC Implementations on the SRC-6E Reconfigurable Computer Russ Duren Baylor University, Waco, Texas Douglas Fouts and Dan Zulaica Naval Postgraduate School, Monterey, California 1
SRC-6E RECONFIGURABLE COMPUTER • HARDWARE ARCHITECTURE CONSISTS OF 2 IDENTICAL HALVES • EACH SIDE HAS • 2 INTEL PENTIUM® PROCESSORS • 2 VIRTEX II SIX MILLION GATE FPGAS • MEMORY-TO-MEMORY CONNECTION • PROGRAMS CAN BE WRITTEN IN C OR FORTRAN • PORTIONS OF THE PROGRAM ARE CONVERTED TO CIRCUITRY IN THE FPGAS • GREATLY INCREASES EXECUTION SPEED 2
SRC-6E HARDWARE ARCHITECTURE (1/2) μP Board MAP Intel® μP Intel® μP 315/195 MB/s (peak) Controller L2 L2 6x 800 MB/s MIOC On-Board Memory (24 MB) 6x 800 MB/s SNAP PCI Common Memory Chain Port 800 MB/s Chain Port 800 MB/s FPGA FPGA 3
SRC-6E PROGRAMMING ENVIRONMENT • PENTIUM PROCESSORS • LINUX OPERATING SYSTEM IS THE MAIN USER INTERFACE • C AND FORTRAN COMPILERS FOR CODE DEVELOPMENT • INTEL AND GNU COMPILERS SUPPORTED • FPGAS • SRC-6E CUSTOM COMPILER CONVERTS HLL TO FPGA CIRCUITRY • FPGA ROUTINES WRITTEN IN C OR FORTRAN • SOFTWARE GENERATES ONE EXECUTABLE RESULT 4
HARDWARE COMPILER • SRC COMPILER TRANSLATES THE C SOURCE INTO FPGA CIRCUITRY • CIRCUITRY IS DEEPLY PIPELINED • FPGAS ARE CLOCKED AT 100 MHz • ONCE THE PIPELINE IS FILLED (LATENCY) RESULTS ARE PRODUCED EVERY 10 NANOSECONDS 5
LOWER LEVEL “PROGRAMMING” • SUPPORTS DEVELOPMENT OF FPGA “ROUTINES” USING TRADITIONAL TOOLS: • VERILOG AND VHDL • SCHEMATIC CAPTURE • INTELLECTUAL PROPERTY (IP) CORES • THESE METHODS ARE ANALOGOUS TO PROGRAMMING IN ASSEMBLY LANGUAGE FOR A NORMAL PROCESSOR • IP CORES ARE ANALOGOUS TO A LIBRARY OF ASSEMBLY LANGUAGE ROUTINES 6
EVALUATING THE SRC C HARDWARE COMPILER • INTENDED TO ENABLE C PROGRAMMERS TO TAKE ADVANTAGE OF THE SRC-6E WITHOUT REQUIRING A KNOWLEDGE OF DIGITAL CIRCUIT DESIGN • HOW DO YOU EVALUATE THE COMPILER? • EASE OF USE – OBJECTIVE MEASURES? • SCOPE OF MODIFICATIONS REQUIRED TO USE THE FPGAS • LIMITATIONS IMPOSED ON HLL ROUTINES • TYPICAL MEASURES OF COMPILER PERFORMANCE • OBJECT CODE SIZE - TRANSLATES TO FPGA CIRCUIT AREA • CODE EXECUTION SPEED 7
MODIFICATIONS REQUIRED TO USE THE FPGAS • CODE FOR FPGAS IS ISOLATED IN TO AN EXTERNAL FUNCTION • CALLS ARE INSERTED IN THE MAIN PROGRAM TO • INITIALIZE THE SRC • CALL THE EXTERNAL FUNCTION • RELEASE THE SRC (OPTIONAL) • EXTERNAL FUNCTION INCLUDES CALLS TO • TRANSFER INPUT DATA FROM SYSTEM COMMON MEMORY TO ON-BOARD MEMORY • TRANSFER OUTPUT DATA FROM ON-BOARD MEMORY TO SYSTEM COMMON MEMORY 8
LIMITATIONS IMPOSED ON HLL ROUTINES • VERSION 1.3 SUPPORTS: • DATA TYPES: 32-BIT (INT) AND 64-BIT (LONG LONG) • ADD, SUBTRACT, MULTIPLY • DIVISION (32-BIT ONLY) • RELATIONAL OPERATORS (==, !=, <, >, <=, >=) • BITWISE OPERATORS AND SHIFTS (&, |, !, <<, >>) • LOGICAL OPERATORS (&&, !, ||) • SQRT() (32BIT ONLY) • SOME RESTRICTIONS ON VARIABLE ACCESS AND CONTROL STATEMENTS • NOT PARTICULARLY LIMITING, BUT MUST BE CONSIDERED • EXAMPLE: NUMBER OF DATA WORDS TRANSFERRED MUST BE MULTIPLE OF 4 • FUTURE VERSIONS OF THE COMPILER WILL SUPPORT MORE OPERATIONS AND DATA TYPES AND RELAX RESTRICTIONS 9
TESTING HARDWARE COMPILER PERFORMANCE • WHAT BENCHMARK DO YOU USE FOR CIRCUIT AREA AND EXECUTION SPEED? • TYPICAL COMPILER EVALUATIONS COMPARE COMPILED CODE TO HAND-CODED ASSEMBLY LANGUAGE ROUTINES. • COMPARED COMPILER OUTPUT TO RESULTS OBTAINED USING INTELLECTUAL PROPERTY CORES • SELECTED CORDIC ALGORITHM AS BASIS OF COMPARISON • USES SHIFTS AND ADDS TO CALCULATE ARCTANGENT OF TWO NUMBERS • WELL SUITED TO IMPLEMENTATION ON AN FPGA • IP CORE AVAILABLE AT NO COST • USED XILINX CORE GENERATOR PROGRAM TO DEVELOP CORES 10
TESTING COMPILER EFFICIENCY • CORES VARIED IN INTERNAL PRECISION AND LATENCY • ALTHOUGH INPUTS TO CORES WERE 8 BITS WIDE, THE INTERFACE TO THE CALLING ROUTINE USED 32-BIT INTEGERS • DEVELOPED C ROUTINE TO CALCULATE CORDIC • MOST CLOSELY MATCHED CORE NUMBER 2 • 32-BIT INTEGER VARIABLES • 11 ITERATIONS • WROTE COMMON “MAIN” PROGRAM • CALLED CORDIC ROUTINE 256 TIMES • EACH CALL CALCULATED 249984 ARCTANGENTS 12
RESULTS *33,792 SLICES AVAILABLE 13
CIRCUIT AREA COMPARISON • C CODE REQUIRED 2 – 5 TIMES MORE CIRCUIT AREA • WHERE DOES THE FACTOR OF 2 COME FROM? • TOTAL CIRCUIT AREA • 6710 SLICES COMPARED TO 3555 FOR MOST SIMILAR CORE • WHERE DOES THE FACTOR OF 5 COME FROM? • AREA OF CORDIC ROUTINE ITSELF • APPROXIMATELY 2800 SLICES ADDED TO IP CORE FOR INTERFACE • SUBTRACTING 2800 FROM 6710 SLICES SAYS THE C CORDIC REQUIRED ABOUT 3900 SLICES • 3900 COMPARED TO 777 IS A RATIO OF 5 TO 1 14
CIRCUIT AREA IMPLICATIONS • CIRCUIT AREA EFFICIENCY LIMITS PROGRAM SIZE • EXTRA CIRCUIT AREA CAN BE USED FOR LOOP UNROLLING AND PARALLELIZATION TO INCREASE EXECUTION SPEED • MAY REQUIRE USE OF IP CORES OR CUSTOM CIRCUIT DESIGN • EXAMPLE: USED 8 COPIES OF CORE 1 IN PARALLEL. • RESULTING CIRCUIT REQUIRED ONLY 6814 SLICES • APPROXIMATELY SAME AS ONE CORDIC WRITTEN IN C • IF THE PROGRAM FITS, THEN THIS DOESN’T MATTER • FPGA DENSITY IS EXCEEDING MOORE’S LAW 15
EXECUTION SPEED COMPARISON • TOTAL EXECUTION TIME IS APPROXIMATELY IDENTICAL AT 7.4 SEC • LATENCY FOR C CODE IS ABOUT TWICE THAT OF IP CORES • LATENCY IS INSIGNIFICANT WITH LARGE DATA BLOCKS Time per loop = (Latency – 1 + Number of Calculations) * 10 ns (112 – 1 + 249984) * 10 ns = 2.5 msec/loop • CALCULATION TIME IS SMALL COMPARED TO EXECUTION TIME OF 7.4 SEC 256 loops * 2.5 msec = 0.64 seconds • MEMORY TRANSFER TIME DOMINATES EXECUTION TIME • WITH PEAK TRANSFER RATE MEMORY TRANSFERS REQUIRE 4 SECONDS 16
COMPUTATIONAL THROUGHPUT SUMMARY • COMPUTATIONAL THROUGHPUT FOR C ROUTINES EQUALS IP CORES FOR MANY PROBLEMS • WHEN HLL ROUTINE CAN BE FULLY PIPELINED AND • NUMBER OF CALCULATIONS >> LATENCY OR • I/O TIME >> CALCULATION TIME • OFTEN NO SIGNIFICANT TIMING IMPACT FOR USE OF C 17
ADVANTAGES OF HIGH LEVEL LANGUAGES • DOES NOT REQUIRE AN ADDITIONAL SKILL SET • NO HARDWARE KNOWLEDGE REQUIRED • INCREASED DESIGN FLEXIBILITY • PROGRAMMER IMPLEMENTS ALGORITHM AS DESIRED • LIMITED TO DATA WIDTHS RECOGNIZED BY THE COMPILER • LOWEST COST METHOD FOR APPLICATIONS WHERE IP CORES DO NOT EXIST • AVOIDS PURCHASE OR ROYALTY COST OF IP CORE WHEN ONE IS AVAILABLE 18
ADVANTAGES OF IP CORES • EASE OF DEVELOPMENT • VERY LOW DEVELOPMENT EFFORT • PREVIOUSLY VALIDATED (HOPEFULLY) • EASILY RECONFIGURED • LOWER LATENCY • OFTEN ONLY A MINOR ADVANTAGE • IN SOME CASES IT MAY BE CRITICAL • MORE COMPACT CIRCUIT REALIZATION • 2 TO 5 TIMES LESS FPGA CIRCUITRY NEEDED • ALLOWS SOLUTION OF MORE COMPLEX PROBLEMS • ENABLES INCREASED PARALLELISM OR LOOP UNROLLING 19
CONCLUSIONS • SRC-6E COMPILER ALLOWS C AND FORTRAN PROGRAMMERS TO ACCELERATE PROGRAMS WITHOUT BEING CIRCUIT DESIGNERS • HARDWARE COMPILER PRODUCES CODE THAT COMPARES WELL TO CUSTOM IP CORES • TYPICALLY SAME EXECUTION SPEED • REQUIRES MORE FPGA CIRCUIT AREA • PROVIDES CAPABILITY TO INTEGRATE IP CORES OR CUSTOM CIRCUIT DESIGNS 20