220 likes | 389 Views
Ch. 11 Digital Signal Processing Using General-Purpose Processors. Kathy Grimes. Signals. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog Repeatability Tolerances
E N D
Ch. 11 Digital Signal Processing Using General-Purpose Processors Kathy Grimes
Signals • Signals • Electrical • Mechanical • Acoustic • Most real-world signals are Analog – they vary continuously over time • Many Limitations with Analog • Repeatability • Tolerances • Difficulty storing information or implementing certain operations Leads us to DSP…
Digital Signal Processing (DSP) • Represent signals by sequences of numbers • Pros • Repeatable • Accuracy can be controlled • Time-varying operations are easier to implement • Cons • Sampling cause loss of information • Round-off errors • A/D and D/A mixed-signal hardware
Digital Signal Processing (DSP) • Analog to Digital Converter • Continuous to Discrete time signal • 11.1 shows the sampling of a signal • Common Signals • Step Discontinuity (Figure 11.2) Impulse (Figure 11.3) FIGURE 11.1 Discrete Time Signals. FIGURE 11.2 Step Function. FIGURE 11.3 Impulse Function.
DSP Building Blocks • Based off of three basic functions: • Delay • Add • Multiply • Raw Performance for DSP algorithm is usually by # of ops needed to execute FIGURE 11.6 Delay Function. FIGURE 11.5 Multiply Function. FIGURE 11.4 Add Function.
DSP Building Blocks • These two systems in combination can be used to develop any discrete difference equation FIGURE 11.8 Feedback System. FIGURE 11.7 Feedforward System.
Fixed-Point and Floating-Point Implementations • Floating-Point DSP perform Integer Operation • Dynamic operating range • Fixed-Point DSP perform Integer and Floating Operation • Fixed range – 16 bit = 65536 max range • Analog world signals = infinite precision • Floating-point mimic the “infinite” range better • Easier to implement, avoids rounding and overflow errors • Why not always use Floating-point? • Cost, Availability, Price, and Performance • Precision Floating Point is good for smaller values but is poorer at larger values using same number of bits
Single Instruction Multiple Data • SIMD Microarchitecture and Instructions • One clock cycle for 4 data x(1 instruction)x 1 value • Increase of performance for low-level DSP functions (MAC) FIGURE 11.10 SIMD Instruction.
Microarchitecture Considerations • Processor Clockspeed • Cache size • Usually DSP architectures manually partition the memory space in order to reduce number of accesses to external memory • Latency = costly in terms of time and resources • Intel architectures have large amounts of cache and can overcome the fast/slow memory, however, all memory starts in “far” caches • Output data should be generated sequentially Accessing memory in a scattered pattern (while using threads) should be avoided
Implementation Options for Intel • Intrinsic • Vectorization • Intel Performance Primitives
Intrinsics and Data Types • C code that calls special built-in compiler capabilities that map closely to underlying SSE instruction set • Added Data Types • _m64, _m128, _m128d, _m128i • Intrinsic Operation Types • Arithmetic (fixed- and floating-point) • Shift • Logical • Compare • Set • Shuffle • Concatenation Adds four FP values packed into a and b and performs four additions in one instruction
Vectorization • Use compiler to apply vectorization techniques to loops within data processing iteration looks for opportunities to convert loops from single set to vector-based implementation (so that multiple operands can be operated at the same time) • Like GCC -- >aligned with SIMD instruction set • Use #pragma directives to guide compiler to avoid overheads such as data dependces Listing 11.7 Memory Alignment Property and Discarding Assumed Data Dependences. Listing 11.4 Explicitly Don’t Vectorize Loop.
Vectorization • Comparisons on Performance • This performance would be vastly different if the memory was not already aligned
Performance Primitives • Intel Libraries – highly optimized implementations for many different applications (include audio codecs, image processing, data compression, etc…) • Libraries take full advantage of CPU and SIMD (and most are written for performance) • Libraries are threaded and can obtain performance gains by parallelizing the algorithm • Libraries that take advantage are: • Signal Processing – Convolution and correlation, Finite impulse response (FIR) filter, FIR coefficints generation function, Infinite response filter (IIR), Transforms • Image Processing • Small Matrices and Realistic Rendering • Cryptography
Finite Impulse Response Filter • FIR filter equation • Y[n] = a.x[n] + b.x[n-1] + c.x[n-2] Listing 11.9 FIR Using Intel Performance Primitives. Listing 11.8 FIR Filter C Code Example
FIR Ex: Intel SSE • Loop Unrolling to get rid of data dependences • By changing the data elements, we can reduce the number of times we need to read data
Medical Ultrasound Imaging • Computation intensive • Needs a significant amount of embedded computational performance • Same basic algorithmic pattern even though physical configurations, parameters, and functionality are different • Beam forming • Envelope Extraction • Polar-to-Cartesian coordinate translation
FIGURE 11.12 Block Diagram of a Typical Ultrasound Imaging Application.
Envelope Detector FIGURE 11.15 Block Diagram of the Envelope Detector.
Envelope Detector FIGURE 11.16 Polar-to-Cartesian Conversion of a Hypothetically Scanned Rectangular Object. Listing 11.11 Code Sample for Envelope Detector.
Performance Results • Why such a large difference?
Summary • Digital Signal Processing in general-purpose processors • Extend Processing Capabilities • Simplifies overall application when platforms require Control, Communications, and General-purpose processing w/DSP • Many ways to improve an Intel system by implementing special C code, vectorization, and specific libraries • Performance is greatly enhanced when DSP is implemented properly