TigerSHARC CLU Exploration of XCORRS for Take-Home Quiz 4 BIAWPQHI -- 13 April – start of class

TigerSHARC CLUExploration of XCORRS for Take-Home Quiz 4BIAWPQHI -- 13 April – start of class M. Smith, University of Calgary, Canada smithmr@ucalgary.ca

Ideal -- Take Home Quiz • Develop tests for complex correlation • Time and functionality • Evaluate on • “C++” – in default and optimized mode (especially optimized) • Your optimized complex assembly code in complex correlation in SID and SIMD modes • XCORRS in complex correlation in SID and SIMD modes

Reasonable -- Take Home QuizCode and report • Develop Functionality and Time tests for real FIR -- based on Lab. 3 • Use on optimized C++ and your SISD and SIMD FIR • Develop Functionality and Time tests for real correlation -- based on Lab. 3 / 4 • Use on optimized C++ and your SISD and SIMD correlation • Work out (theory) speed changes expected on your SISD and SIMD if went to complex. Use as template for expected changes in optimized C++ • Develop Functionality and Time tests for complex FIR • Use on optimized C++ • Develop Functionality and Time tests for complex correlation • Use on optimized C++ and your SISD and SIMD XCORRS only • Report on whether changes in C++ code speed work the way you expect • Use these figures to scale for FIR and correlation to complex data • Report on relative speeds • “C++” – in default and optimized mode (especially optimized) • Your optimized complex assembly code in complex correlation in SID and SIMD modes • XCORRS in complex correlation in SID and SIMD modes

Mark assignment • My tests and C++ are available on the web • If you use my tests, then you must say so, and 10% of marks are deducted • If you use my C++ code, then you must say so, and 10% of marks are deducted • If you use my C++ code and my test, then you must say so, and 20% of marks are deducted

Real FIR float / int values[ ], params[ ] Loop: sum = sum + values * params 2 memory fetches 1 add and 1 mult per loop cycle – done in ½ cycle in theory Time N / 2 + overhead Determine overhead by measuring with and without the loop-sum Complex FIR CMPX float / int values[ ], params[ ] Loop: many common factors with FFT – Hint for final? sum = sum + values * params Real sum = v.re * p.re – v.im * p.im Imag sum = v.re * p.im + v.im * p.re 8 memory fetches 3 add / sub and 4 mult per loop Time ??? + overhead Speed comparison – Part 1

Speed in theory without doing anything special Any special way to store complex values to speed up memory access? Do we need to do 8 memory fetches On the Blackfin? In the TigerSHARC? Expected optimal speed? Time ??? + overhead Complex FIR CMPX float / int values[ ], params[ ] Loop: many common factors with FFT – Hint for final? sum = sum + values * params Real sum = v.re * p.re – v.im * p.im Imag sum = v.re * p.im + v.im * p.re 8 memory fetches 3 add / sub and 4 mult per loop Time ??? + overhead Speed comparison – Part 2

Speed comparison – Part 3? • Do these speed calculations scale the same way for complex correlation as for complex FIR? • Do a theory calculation and then compare result for debug and optimized C++ code to validate – within 25% of predicted changes is probably more than reasonable for a back-of-envelope calculation • Use scaling factor on your real FIR and correlation functions

Tests for following functions neededWhen convert from float to int? void ConvertReal2Complex(float *, CMPX32 *, int size) Make Complex = Real + j0 bool ConvertC32_2_C8(CMPX32 * , CMPX8 *, int size) Take bottom 8 bits of complex 32 Return false if overflows Complex 8 is padded 2 complex in to 32 bits --- int in format bool ConvertC32_2_C1(CMPX32 * , CMPX1 *, int size) Take bottom 1 bits of complex 32 Return false if overflows, or if not +-1 +-j1 format Complex 1 is padded 16 complex in to 32 bits --- int in format void ConvertC8_2_C32(CMPX8 * , CMPX32 *, int size) needed? YES um void ConvertC1_2_C32(CMPX1 * , CMPX32 *, int size) needed?

Tests for following functions needed float RealFIR(float *vals, float *params, int size, bool overhead); CMPLX ComplexFIR(CMPLX* vals, CMPLX params, int size, bool overhead);vals in dm and params in pm void RealCorrs(float *vals, int size1, float *params, int size2, float *result, int *size3, bool overhead); void ComplexCorrs(CMPLX* vals, int size1, CMPLX params, int size2, CMPLX *result, int *size3, bool overhead); void XCORRS(CMPLX* vals, int size1, CMPLX params, int size2, CMPLX *result, int *size3, bool overhead, int version); version is 0 – works, = 1 SISD, = 2 SIMD *

Some hints void XCORRS(CMPLX* vals, int size1, CMPLX params, int size2, CMPLX *result, int *size3, bool overhead, version) { bool ConvertC32_2_C8(CMPX32 * , dm CMPX8 *, int size1) bool ConvertC32_2_C1(CMPX32 * ,pm CMPX1 *, int size2) size3 = size1 – size2 for result = 1 to size 3 result[ ] = 0; if (!overhead) XCORRS(dm CMPX8 *, pm CMPX1 *, dm? Result, size1, size2, size 3, whichversion

Some Hints void ComplexCorrs(CMPLX* vals, int size1, CMPLX params, int size2, CMPLX *result, int *size3, bool overhead) { if (overhead) return; *size3 = size1 – size 2; for loop to size 3 result[loop] = ComplexFIR(vals, CMPLX params, int size, bool overhead); val++; end loop; }

Some decisions • Complex 32 – first decision • Store real in dm space and imaginary in pm space? • Complex8 in dm space, Complex1 in pm space • Doing everything with static pm variables • Using dm variables on stack, in an attempt to avoid running out of memory • Try with satellite of size 2048 and PRN data of size 1024 but suspect may not have enough room when doing with Complex 32 so may have to test on smaller for comparison • I ended up generating the same data as for thexcorrs( ) shown last Friday – size 48 = 16 * 3. Decided that if I could handle that (3 times round xcorrs loop) then far enough test

Some Tests developed 1 TEST(ConvertReal2CMPLX32, D_TEST) { TEST_LEVEL(1); #define TEST_SIZE 8 float values[TEST_SIZE] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; float zeros[TEST_SIZE] = {0, 0, 0, 0, 0, 0, 0, 0}; ConvertReal2Complex(values, C32Real, C32Imag, TEST_SIZE); ARRAYS_EQUAL(values, C32Real, TEST_SIZE); ARRAYS_EQUAL(zeros, C32Imag, TEST_SIZE); }

Test for padded data – C8 format #define TEST_SIZE 8 pm float imag1 [TEST_SIZE] = {0x04, 0x14, -0x8, -0x18, 0x24, 0x34, 0x44, 0x54}; float real1[TEST_SIZE] = {0x08, 0x18, -1, -2, 0x28, 0x38, 0x48, 0x58 }; TEST(ConvertToCMPLX8, D_TEST) { TEST_LEVEL(1); #define TEST_SIZE 8 unsigned int result[4] = {0x14180408, 0xE8FEF8FF, 0x34382428, 0x54584448}; CHECK(!ConvertC32_2_C8(real1, imag1, DATAC8, 1)); CHECK(ConvertC32_2_C8(real1, imag1, DATAC8, TEST_SIZE)); ARRAYS_EQUAL(DATAC8, result, TEST_SIZE / 2); }

Test for padded data C1 format #define LONGER_SIZE 32 pm float imag2[LONGER_SIZE] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …….. float real2[LONGER_SIZE] = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ……….. pm float imag4[LONGER_SIZE]; float real4[LONGER_SIZE]; TEST(ConvertCMPLX1, D_TEST) { TEST_LEVEL(1); unsigned int result1[2] = {0x00000000, 0x00000000}; unsigned int result2[2] = {0xFFFFFFFF, 0xFFFFFFFF}; CHECK(!ConvertC32_2_C1(real1, imag1, PRNC1, 1)); CHECK(!ConvertC32_2_C1(real1, imag1, PRNC1, TEST_SIZE)); CHECK(!ConvertC32_2_C1(real2, imag2, PRNC1, 1)); CHECK(ConvertC32_2_C1(real2, imag2, PRNC1, LONGER_SIZE)); ARRAYS_EQUAL(PRNC1, result1, LONGER_SIZE / 16); for (int i = 0; i < LONGER_SIZE; i++) { real4[i] = -1 * real2[i]; imag4[i] = -1 * imag2[i]; } CHECK(ConvertC32_2_C1(real4, imag4, PRNC1, LONGER_SIZE)); ARRAYS_EQUAL(PRNC1, result2, LONGER_SIZE / 16); }

RealFIR #define TEST_SIZE 8 pm float params[TEST_SIZE] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0}; TEST(RealFIR, D_TEST) { TEST_LEVEL(1); float impulse[TEST_SIZE]; float results[TEST_SIZE]; for (int i = 0; i < TEST_SIZE; i++) { for (int j = 0; j < TEST_SIZE; j++) // Set to zero impulse[j] = 0; impulse[i] = 1; results[i] = RealFIR(impulse, params, TEST_SIZE, false); } ARRAYS_EQUAL(results, params, TEST_SIZE); }

Complex FIR tests (3 of them)To see if I got both Real and Imag correct pm float resultsI[TEST_SIZE]; TEST(ComplexFIR, D_TEST) { TEST_LEVEL(1); float impulse[TEST_SIZE]; float resultsR[TEST_SIZE]; float zeros[TEST_SIZE] = {0, 0, 0, 0, 0, 0, 0, 0}; for (int i = 0; i < TEST_SIZE; i++) { for (int j = 0; j < TEST_SIZE; j++) // Set to zero impulse[j] = 0; impulse[i] = 1; for (int j = 0; j < TEST_SIZE; j++) { C32Real[j] = impulse[j]; C32Imag[j] = 0; C32Real1[j] = params[j]; C32Imag1[j] = 0; } ComplexFIR(C32Real, C32Imag, C32Real1, C32Imag1, &resultsR[i], &resultsI[i], TEST_SIZE, false); } ARRAYS_EQUAL(resultsR, params, TEST_SIZE); ARRAYS_EQUAL(resultsI, zeros, TEST_SIZE); }

Real Correlation pm float PRN32I[TEST_SIZE] = {1, -1, 1, -1, 1, 0, 0, 0}; TEST(RealCorrelation, D_TEST) { TEST_LEVEL(1); float data[TEST_SIZE * 2] = {0, 0, 0, 0, 1, -1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0 }; float result[TEST_SIZE]; int Iresult[TEST_SIZE]; int size3; RealCorrs(data, 2 * TEST_SIZE, PRN32I, TEST_SIZE, result, &size3, false); CHECK(size3 == TEST_SIZE); for (int j= 0; j < TEST_SIZE; j++) Iresult[j] = result[j]; CHECK(MaximumLocation(Iresult, TEST_SIZE) == 4); }

Complex Correlation -- Simple Test pm float dataI[TEST_SIZE * 2] = {0, 0, 0, 0, 1.0, -1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0}; pm float resI[TEST_SIZE]; TEST(ComplexCorrelation, D_TEST) { TEST_LEVEL(1) float dataR[TEST_SIZE * 2] = {0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }; float resR[TEST_SIZE]; int Iresult[TEST_SIZE]; float parR[TEST_SIZE] = {0, 0, 0, 0, 0, 0, 0, 0 }; int size3; ComplexCorrs(dataR, dataI, TEST_SIZE * 2, parR, PRN32I, TEST_SIZE, resR, resI, &size3, false); CHECK(size3 == TEST_SIZE); for (int j= 0; j < TEST_SIZE; j++) { Iresult[j] = abs(resR[j]); } CHECK(MaximumLocation(Iresult, TEST_SIZE) == 4); }

Complex Correlation– related to results from last lecture for (int i = 0; i < 96; i += 3) { satXCORRSR[i] = -1; satXCORRSR[i+1] = 1; satXCORRSR[i+2] = 1; satXCORRSI[i] = 0; satXCORRSI[i+1] = 0; satXCORRSI[i+2] = 0; } for (int i = 0; i < 48; i += 3) { prnXCORRSR[i] = -1; prnXCORRSR[i+1] = 1; prnXCORRSR[i+2] = 1; prnXCORRSI[i] = -1; prnXCORRSI[i+1] = 1; prnXCORRSI[i+2] = 1; } ComplexCorrs(satXCORRSR, satXCORRSI, 96, prnXCORRSR, prnXCORRSI, 48, resXCORRSR, resXCORRSI, &size3, false); CHECK(size3 == 48); for (int j= 0; j < 48; j++) { Iresult[j] = abs(resXCORRSR[j]); } for (int j = 1; j < 45; j += 3) { CHECK(resXCORRSR[j-1] == 48); CHECK(resXCORRSR[j] == -16); CHECK(resXCORRSR[j+1] == -16); CHECK(MaximumLocation(Iresult + j, 48 - j) == 2); }

Complex Correlation ASM– related to results from last lecture for (int i = 0; i < 96; i += 3) { satXCORRSR[i] = -1; satXCORRSR[i+1] = 1; satXCORRSR[i+2] = 1; satXCORRSI[i] = 0; satXCORRSI[i+1] = 0; satXCORRSI[i+2] = 0; } for (int i = 0; i < 48; i += 3) { prnXCORRSR[i] = -1; prnXCORRSR[i+1] = 1; prnXCORRSR[i+2] = 1; prnXCORRSI[i] = -1; prnXCORRSI[i+1] = 1; prnXCORRSI[i+2] = 1; } ComplexCorrsASM(satXCORRSR, satXCORRSI, 96, prnXCORRSR, prnXCORRSI, 48, resXCORRSR, resXCORRSI, &size3, false); CHECK(size3 == 48); for (int j= 0; j < 48; j++) { Iresult[j] = abs(resXCORRSR[j]); } for (int j = 1; j < 45; j += 3) { CHECK(resXCORRSR[j-1] == 48); CHECK(resXCORRSR[j] == -16); CHECK(resXCORRSR[j+1] == -16); CHECK(MaximumLocation(Iresult + j, 48 - j) == 2); }

bool ConvertC32_2_C8(float *inR, pm float *inI, unsigned int *C8, int size) { float *holdR = inR; pm float *holdI = inI; for (int i = 0; i < size; i++) { if ((*inR > 127) || (*inR < -128)) return false; if ((*inI > 127) || (*inI < -128)) return false; inR++; inI++; } // Not going to bother with things that don't fit if (size & 1) return false; inR = holdR; inI = holdI; for (int half = 0; half < size; half +=2) { unsigned int first = ( (int) *inR++) & 0xFF; unsigned int second = ( (int) *inI++) & 0xFF; unsigned int third = ( (int) *inR++) & 0xFF; unsigned int fourth = ( (int) *inI++) & 0xFF; *C8++ = ((((((fourth << 8) + third) << 8) + second) << 8) + first) ; } return true; }

C8  C32 and C16  C32 float UINT8ToFloat(unsigned int value) { if (value & 0x80) { value = value | 0xFFFFFF00; return ( (int) value); } else return value; } void ConvertC8_2_C32(unsigned int *C8, float *inR, pm float *inI, int size) { for (int i = 0; i < size; i +=2) { unsigned int value = *C8++; *inR++ = UINT8ToFloat(value & 0xFF); value >>= 8; *inI++ = UINT8ToFloat(value & 0xFF); value >>= 8; *inR++ = UINT8ToFloat(value & 0xFF); value >>= 8; *inI++ = UINT8ToFloat(value & 0xFF); } }

FIR filters float RealFIR(float *values, pm float *params, int size, bool overhead) { if (overhead) return 0.0; float sum = 0; for (int i = 0; i < size; i++) sum += *values++ * *params++; return sum; } pm float sumI = 0; void ComplexFIR(float *valR, pm float *valI, float *parR, pm float *parI, float *resultR, pm float* resultI, int size, bool overhead) { if (overhead) { *resultR = *resultI = 0; return;} float sumR = 0; sumI = 0; // Was a static variable for (int i = 0; i < size; i++) { sumR += *valR * *parR - *valI * *parI; sumI += *valR * *parI + *valI * *parR; valR++; valI++; parR++; parI++; } *resultR = sumR; *resultI = sumI; return; }

Correlation void RealCorrs(float *vals, int size1, pm float *params, int size2, float *result, int *size3, bool overhead) { if (overhead) return; *size3 = size1 - size2; for (int j = 0; j < size2; j++) *result++ = RealFIR(vals++, params, size2, overhead); } void ComplexCorrs(float* valR, pm float* valI, int size1, float* parR, pm float* parI, int size2, float* resR, pm float* resI, int *size3, bool overhead) { if (overhead) return; *size3 = size1 - size2; for (int j = 0; j < size2; j++) ComplexFIR(valR++, valI++, parR, parI, &resR[j], &resI[j], size2, false); }

Correlation XCORRS extern "C" void xcorrsfunc(unsigned int *C8, pm unsigned int *C1, unsigned int *C16, int size); void ComplexXCORRS(float* valR, pm float* valI, int size1, float* parR, pm float* parI, int size2, float* resR, pm float* resI, int *size3, bool overhead) { ConvertC32_2_C8(valR, valI, DATAC8, size1); *PRNC1 = 0x0; // Need to shift hte PPRN to location C15 ConvertC32_2_C1(parR, parI, PRNC1 + 1, size2); *size3 = size1 - size2; if (!overhead) xcorrsfunc(DATAC8, PRNC1, RESULTC16, *size3); ConvertC16_2_C32(RESULTC16, resR, resI, *size3); }

XCORRS – same code as beforeexcept – need to transfer results out // Shift out the values in TR registers into results xR3:0 = TR3:0;; Q[J6 += 4] = xR3:0;; xR3:0 = TR7:4;; Q[J6 += 4] = xR3:0;; xR3:0 = TR11:8;; Q[J6 += 4] = xR3:0;; xR3:0 = TR15:12;; Q[J6 += 4] = xR3:0;; IF NLC0E, JUMP OUTERLOOP;;

Need to get inpars and go round more than 16 times J0 = zeros;; // Clear the THR registers the hard way R3:0 = Q[J0 += 4];; THR3:0 = R3:0;; R7:4 = R3:0;; // K0 = prn;; J2 = J4;; // satellite_data;; LC0 = 3;; OUTERLOOP: K0 = J5;; J2 = J4;; J4 = J4 + 8; // Increment by 8 and not 16 REST OF CODE UNCHANGED // Load THR with PRN code R1:0 = L[K0 += 2];; THR1:0 = R1:0;; R1:0 = L[K0 += 2];; THR3:2 = R1:0;;

Test results

TigerSHARC CLU Exploration of XCORRS for Take-Home Quiz 4 BIAWPQHI -- 13 April – start of class