230 likes | 429 Views
MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions. A Programmable Single Chip Digital Signal Processing Engine. Presentation Outline. Space born signal processing tasks FPOA architecture highlights programmability and expandability System partition on FPOA device
E N D
MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions A Programmable Single Chip Digital Signal Processing Engine
Presentation Outline • Space born signal processing tasks • FPOA architecture highlights • programmability and expandability • System partition on FPOA device • Spatial processing - 5x5 filter solution • Temporal processing – motion estimation • Internal bus and I/O throughput • Resource utilization and future expansion
A System of Digital Signal Processing • spatial edge filter • temporal difference filter • apply equation that defines feature • checking threshold Spatial or Temporal Processing • analyze and characterize signals Feature Extraction Input Data Data Extraction Characterization Frequency or Time domain Processing • mux/de-mux • Average filter • min/max select • time domain low/high/bandpass filter • frequency transformation • frequency domain low/high/bandpass filter
Processing Requirements • High computation requirement on the following basic operations: add/sub and mul/mac, • Mixed control functions such as loop control and decision making • High I/O bandwidth to enable balanced processing vs. data input/output • Large and fast temporary memory space to facilitate real-time processing • Fast programmable and direct data transfer enables massive parallel processing
FPOA Architecture Summary • Heterogeneous Array of 16-bitSilicon Objects • MAC, ALU, Truth Tables, Register File,Internal RAM • Single Clock Cycle Execution for All Objects • Homogeneous 2-Layer Programmable Interconnect Mesh • Tightly Integrated Data and Control Flow • Integrated DDRII RLDRAM & SRAM Controllers • High Speed I/O at Device Boundaries: SerDes, LVDS, HSTL
Reconfigurable Interconnect Network • Each link consists of 16 Data bits, 1 valid bit, and 4 separate control bits • Nearest Neighbors • Range = 1 (N/E/S/W + diagonal) • Party Lines • Single cycle range = hop to 3 (skip 2) @ 1GHz • Extra clock cycles for digital retiming • 1 extra 25-object neighborhood • More clock cycles entire chip
FPOA Solution • Four GPIO ports with 44-bit I/O at 100 MHz, that is, 17.6 Giga bits per second • Two 250MHz DDR 32-bit external memory with 32 Giga bits per second bandwidth • 400 Silicon Objects running at 1 GHz • ALU: add/sub, and combinational logic • MAC: mul/mac • Register File (RF): fast distributed data storage • Internal RAM (IRAM): intermediate data storage • Party lines and muxes to support flexible internal bus as well as dedicated connections
5x5 Convolution Filter • Apply the filter operation to a 2D data array, D[0:m-1, 0:n-1], with a 5x5 2D mask, W[0:4, 0:4] for i = 2; i < m – 3; i++ for j = 2; j < n – 3; j++ temp = 0; for k = -2; k < 3; k++ for l = -2; l < 3; l++ temp = D[i+k, j+l] * W[k+2, l+2] + temp end_of_l end_of_k Y[i, j] = temp; end_of_j end_of_i
Computation Requirements • Assuming an m by n 2D data array and a 5x5 mask, there are 25 Multiply and Add (MAC) operations for each filtered sample • The whole convolution filter operation requires 25 * M * N MAC operations • With a standard 720x480 image data and 30 frames per second, the convolution filter operation requires 259 MMAC per second
Data Storage • 2D data storage in a 1D linear memory where 4 16-bit word can be accessed concurrently • Example of an 8x8 2D matrix stored in a 1D memory
Data Access Analysis • Samples are stored in the external memory with slower access speed • Maximize data bandwidth by accessing 4 words at a time • Use Register Files to store weights and sample data so that they can be repeatedly used without going out to external memory • Perform calculation on 4 pixels concurrently and rotate coefficients and samples in a way to form convolution operation
Data Processing Analysis Note 1: with a 5x5 filter the first two rows and columns are skipped Note 2: the sequence pattern of samples and coefficients are for the concurrent calculation of Y22, Y32, Y42, and Y52
FPOA Solution • Temporary data storage • 5 RFs, 3 ALUs • Data access control • 3 ALUs • Multiplier • 4 MACs • Adder Tree • 9 ALUs • Temporary Results • 2 RFs, 1 IRAM, 2 ALUs
5x5 Convolution Filter Performance • FPOA Resources • ALU: 17 • RF: 7 • MAC: 4 • IRAM: 1 • Total: 28 SOs + 1 IRAM • Data throughput • 20 results every 125 cycles
Motion Estimation • Identify the movement of a similar pattern over time • The main computation involves calculating the sum of absolute difference (SAD) between two 8x8 blocks, ie. X[0:7, 0:7] and Y[0:7, 0:7] sum = 0; for i = 0 to 7 for j = 0 to 7 temp = X[i, j] – Y[i, j] sum = sum + abs(temp) end_of_j end_of_i
SAD Computation Dataflow • 3 cycles throughput • Generates two partial sums of positive differences
SAD Performance • FPOA Resources • ALU: 35 • RF: 1 • Total: 36 SOs • Data throughput • 24 cycles per 8x8 block
Internal System Bus • Link all processing modules and the external host to the external memory for data accesses to the external system memory • Host controlled round-robin access from module to module • User defined package format to utilize the 16-bit party line and minimize the access overhead
System Bus Performance • FPOA Resources • ALU: 20 • Cycles • XRAM read: 4 cycles • XRAM write: 4 cycles • Module switch: 10 cycles
Performance of an Example Space Satellite Application • Processing Throughput • About 10 Million Samples per second • FPOA Resources (% of a device with 400 SOs and running at 400 MHz) • Cycle utilization: 21% • SO utilization: 51% • IRAM utilization: 25% • XRAM b/w: 49% (100 MHz DDR RLDRAM)