820 likes | 1.01k Views
MODULE 2. Syllabus. Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly. Fast and inexpensive implementation Limited in the range of numbers Susceptible to problems of overflow
E N D
Syllabus • Fixed and floating point formats • code improvement • Constraints • TMS 320C64x CPU • simple programming examples using C/assembly.
Fast and inexpensive implementation Limited in the range of numbers Susceptible to problems of overflow In a fixed-point processor, numbers are represented in integer format. Fixed-point numbers and their data types are characterized by their - word size in bits binary point and whether they are signed or unsigned Fixed point numbers
The dynamic range of an N-bit number based on 2’s-complement representation is between -(2N-1) & (2 N-1 - 1), or between -32,768 and 32,767 for a 16-bit system. • By normalizing the dynamic range between -1 and 1, the range will have 2N sections, 2 -(N-1) -size of each section starting at -1 up to 1 – 2 -(N-1). • For a 4-bit system, there would be 16 sections, each of size 1/8, from -1 to 7/8 .
In unsigned integer the stored number can take on any integer value from 0 to 65,535. • signed integer uses two's complement allows negative numbers it ranges from -32,768 to 32,767 • With unsigned fraction notation 65,536 levels spread uniformly between 0 and 1 • the signed fraction format allows negative numbers, equally spaced between -1 and 1
15+1=0 6+(-2)=4
The 4-bit unsigned numbers represent a modulo (mod) 16 system. • If 1 is added to the largest number (15), the operation wraps around to give 0 as the answer. • A number wheel graphically demonstrates the addition properties of a finite bit system. • Addition procedure • 1Find the first number x on the wheel. • 2. Step off y units in the clockwise direction, which brings you to the answer.
Carry and Overflow• Carry applies to unsigned numbers — when adding or subtracting, result is incorrect.• Overflow applies to signed numbers — when adding or subtracting, result is incorrect.
Examples: Overflow Carry Sign bit 01111 + 100+ 00111 111 -------- ------------- 10110 1011 Sign bit Carry
Rather than using the integer values just discussed, a fractional fixed-point number that has values between +0.99 . . . and -1 can be used.
Data types 1.Short: it is of size 16 bits represented as 2’s complement with a range from -215 to (215 -1) 2.Int or signed int: it is of size 32 bits represented as 2’s complement with a range from -231 to ( 231-1) 3.Float: it is of size 32 bits represented as IEEE 32 bit with a range from 2-126(1.175494x10-38) to 2+128 (3.40282346x1038) 4.Double: it is of size 64 bits represented as IEEE 64 bit with a range from 2-1022(2.22507385x10-308) to 2 1024(1.79769313x10308)
The advantage over fixed-point representation is that it can support a much wider range of values. The floating-point format needs slightly more storage The speed of floating-point operations is measured in FLOPS. Floating-point representation
General format of floating point number : X= M. be where M is the value of the significand (mantissa), b is the base e is the exponent. Mantissa determines the accuracy of the number Exponent determines the range of numbers that can be represented
Floating point numbers can be represented as: Single precision : • called "float" in the C language family • it is a binary format that occupies 32 bits • its significand has a precision of 24 bits Double precision : • called "double" in the C language family • it is a binary format that occupies 64 bits • its significand has a precision of 53 bits
31 30 23 22 0 e f S Single Precision (SP): Bit 31 represents sign bit Bits 23 to 30 represents exponent bits Bits 0 to 22 represents fractional bits Numbers as small as 10-38 and as large as10 38 can be represented
31 30 20 19 0 31 0 Double precision (DP) : • since 64 bits, more exponent and fractional bits are available • a pair of registers are used Bits 0 to 31 of first register represents fractional bits Bits 0 to 19 second register also represents fractional bits Bits 20 to 30 represents exponent bits Bits 31 is the sign bit Numbers as small as 10 -308 and as large as 10 +308 can be represented s e f f
Instructions ending in SP or DP represents single and double precision • Some Floating point instructions have more latencies than fixed point instructions Eg: MPY requires one delay MPYSP has three delays MPYDP requires nine delays • Single precision floating point value can be loaded into a single register where as Double precision values need a pair of registers A1:A0, A3:A2 ,…….. B1:B0, B3:B2 ,…………… • C6711 processor has a single precision reciprocal instruction RCPSP for performing division
Code Optimization code optimization is used to drastically reduce the execution time of the code. There are several techniques- (i) Use instructions in parallel (ii) Word-wide data (iii) intrinsic functions (iv) Software pipelining. Optimized assembly (ASM) code runs faster than C and require less memory space.
C C ++ Optimising Compiler 80 - 100% Low LinearASM Assembly Optimiser 95 - 100% Med ASM Hand Optimised 100% High Comparison of Programming Techniques Source Effort Efficiency* * Typical efficiency vs. hand optimized assembly.
Linear Assembly • The resulting assembly-coded program produced by the assembler optimizer is typically more efficient than one resulting from the C compiler optimizer. • Linear assembly code programming provides a compromise between coding effort and coding efficiency.
Optimization Steps 1.Program in C. Build your project without Optimization 2. Use intrinsic functions when appropriate as well as the various optimization levels 3. Use the profiler to determine/ identify the functions that may need to be further optimized. Then convert these functions in linear ASM. 4. Optimize code in ASM.
Profiler • The profiler analyzes program execution and shows you where your program is spending its time. • A profile analysis can report how many cycles a particular function takes to execute and how often it is called. • Profiling helps you to direct valuable development time toward optimizing the sections of code that most dramatically affect program performance.
Compiler options:A C-coded program is first passed through a parser that performs preprocessing functions and generate an intermediate file (.if) which becomes the input to an optimizer. The optimizer generates an (.opt) file which becomes the input to a code generator for further optimization and generates ASM file. .opt .if code generator ASM Parser Optimizer C Code
The options for optimization levels:1. -O0 optimizes the use of registers2. -O1 performs a local optimization in addition to optimization done by -00.3. -O2 performs global optimization in addition to optimization done by -00 and -01.4. -O3 performs file optimization in addition to the optimizations done by -00, -01 and -02. -02 and -03 attempt to do software optimizations.
Intrinsic C functions: • Similar to run time support library function • C intrinsic function are used to increase the efficiency of code. • int-mpy ( ) has an equivalent ASM instruction MPY, which multiplies 16 LSBs of a number by 16 LSBs of another number. 2. int-mpyh ( ) has an equivalent ASM instruction MPYH which multiplies 16 MSBs of a number by the 16 MSBs of another number. 3. int-mpylh ( ) has an equivalent ASM instruction MPYLH which multiplies 16 LSBs of a number by 16 MSBs of another. 4. int-mpyhl ( ) has an equivalent ASM instruction MPYHL which multiplies 16 MSBs of a number by the 16 LSBs of another. 5. Void-nassert (int) generates no code. It tells the compiler that expression declared with the asssert function is true. 6. Uint-lo (double) and Uint-hi (double) obtain low and high 32 bits of a double word.
Trip directive for loop count:Linear assembly directive (.trip) is used to specify the number of times a loop iterates.If the exact number is known and used, redundant loops are not generated and can improve both code size and execution time.
Cross-Paths • Data and address cross-path instructions are used to increase code efficiency. • MPY .M1x A2,B2,A4 • MPY .M2x A2,B2,B4
Software pipelining • software pipelining is a scheme which uses available resources to obtain efficient pipelining code. • The aim is to use all eight functional units within one cycle. There are three stages: 1. prolog (warm-up)- This stage contains instructions needed to build up the loop kernel cycle. 2. Loop kernel (cycle)- within this loop, all instructions are executed in parallel. Entire loop is executed in one cycle. 3. Epilog (cool-off)- This stage contains the instructions necessary to complete all iterations
Procedure for software pipelining: 1. Draw the dependency graph 2. Set up a scheduling table 3. Obtain code from the scheduling table. Dependency graph: (Procedure) 1. Draw the nodes and paths 2. Write the number of cycles to complete an instruction 3. Assign functional units associated with each code 4. Separate the data paths, so that the maximum number of units are utilized.
A node has one or more data paths going in and/or out of the node. • The numbers next to each node represent the number of cycles required to complete the associated instruction. • A parent node contains an instruction that writes to a variable; whereas a child node contains an instruction that reads a variable written by the parent.
LDH - > Parent of MPY • MPY - >Parent of ADD • The ADD instruction is fed back as input for the next iteration; similarly with the SUB instruction.
Dependency graph : (Eg. Two sum of product) Side B Side A LDW LDW bi .D2 ai .D1 5 5 5 5 MPY MPYH Prod h .M1x Prod l .M2x 2 2 ADD 1 1 Sum l Sum h .L2 .L1 SUB B 1 count loop 1 .S2 .S1
Scheduling table: 1. LDW starts in cycle 1 2. MPY and MPYH must start five cycles after LDW, due to four delay slots. Therefore MPY/MPYH starts at cycle 6. 3. ADD must start two cycles after MPY/MPYH due to one delay slot of MPY/MPYH. Therefore ADD starts in cycle 8. 4. B has 5 delay slots and starts in cycle 3, since branching occurs in cycle 9, after ADD instructions. 5. SUB instruction must start one cycle before branch instruction, since the loop count is decremented before branching occurs. Therefore SUB starts in cycle 2.
Schedule table before software pipelining: cycles units 1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,.. .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 LDW LDW MPY MPYH ADD ADD SUB B
Schedule table after software pipelining: cycles units 1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,.. LDW LDW LDW LDW LDW .D1 .D2 .M1 .M2 .L1 .L2 .S1 .S2 LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW LDW MPY MPY MPY MPYH MPYH MPYH ADD ADD SUB SUB SUB SUB SUB SUB SUB B B B B B B
Instructions within prolog stage (cycles 1-7) are repeated until and including loop kernel (cycle 8). • Instructions in the epilog stage (cycles 9,10…) are to complete the functionality of the code.
Loop Kernel • Within the loop cycle 8, multiple iterations of the loop-execute in parallel. ie, different iterations are processed at same time. eg: ADDs add data for iteration 1 MPY/MPYH multiply data for iteration 3 LDW load data for iterations 8 SUB decrements the counter for iteration 7 B branches for iteration 6 • ie, values being multiplied are loaded into registers 5 cycles prior to cycle when the values are actually multiplied. Before first multiplication occurs, fifth load has just completed. • This software pipelining is 8 iterations deep.
If the loop count is 100 (200 numbers) Cycle 1: LDW, LDW (also initialization of count and accumulators A7 and B7) Cycle 2: LDW, LDW, SUB Cycle 3-5: LDW, LDW, SUB, B Cycle 6-7: LDW, LDW, MPY, MPYH, SUB, B Cycle 8-107: LDW, LDW, MPY, MPYH, ADD, ADD, SUB, B Cycle 108: LDW, LDW, MPY, MPYH, ADD, ADD, SUB, B • Prolog section is within cycle 1-7 • Loop kernel is in cycle 8 • Epilog section is in cycle 108.
Execution Cycles: Number of cycles (with software pipelining): Fixed point = 7+ (N/2) +1 eg: N = 200 ; 7+100+1 = 108 Floating points = 9 + (N/2) + 15 Fixed Point Floating Point No Optimization2 + (16 X 200) = 32022 + (18 X 200) = 3602 With parallel instructions1 + (8 X 200) = 16011 + (10 X 200) = 2001 Two sums per iterations1 + (8 X 100) = 8011 + (10 X 100) + 7 = 1008 With S/W pipelining7 + (200/2) + 1 = 1089 + (200/2) +15 = 124
Memory Constraints: • Internal memory is arranged through various banks of memory so that loads and stores can occur simultaneously. • Since banks are single ported, only one access to each bank is performed per cycle. • Two memory access per cycle can be performed if they do not access the same bank. • If multiple access is performed to the same bank, pipeline will stall.
Cross Path Constraints: • Since there is one cross path in each side of the two datapaths, there can be at most two instructions per cycle using cross path. eg: Valid code segment (because both available cross paths are utilized ) ADD .L1X A1, B1, A0 II MPY .M2X A2, B2, B3 eg: Not valid ( because one cross path is used for both instructions) ADD .L1X A1, B1, A0 II MPY .M1X A2, B2, A3