280 likes | 403 Views
‘Stream’-based wireless computing. Sridhar Rajagopal Research group meeting December 17, 2002. The figures used in the slides are borrowed from papers at VT and Stanford. Motivation. ‘Stream’-based computing what does it mean? Not a well-defined term
E N D
‘Stream’-based wireless computing Sridhar Rajagopal Research group meeting December 17, 2002 The figures used in the slides are borrowed from papers at VT and Stanford.
Motivation • ‘Stream’-based computing • what does it mean? • Not a well-defined term • ‘computation’ that uses flow of self-guided info. • ‘sequence of data’ • Related to flow of data through architecture • Application to implementing wireless algorithms
Outline • Stallion • reconfigurable computing at Virginia Tech • ‘stream’-based computing #1 • Custom Configurable Machines (CCM) • Imagine • media processing at Stanford • ‘stream’-based computing #2 • programmable architectures
Stallion at VT • Wormhole Run-Time Reconfiguration (RTR) • coarse-grained structure • reconfiguration using ‘streams’
‘Stream’ packets A stream packet Stream flow through architecture
Stream module description 4 States: IDLE – reconf. in progress BUSY – doing work PROGRAM – load reconf. data PASS – meant for next module Need to output packet/cycle VALID – maintain sync. - set INVALID instead of wait states - strip information off stack
Processing layer • Static section • configures the reconf. section • buffers data during reconf. & sends ‘IDLE’ packets • Reconf. Section • processing of the data done here • Higher layers convert algorithm to data and configuration patterns
Cart before the horse Colt before the Stallion Colt architecture (also at VT) IFU Mesh – Mesh of interconnected func. units
Stallion chip 2 4 3 16-bit data 4-control 3 4 2
IFU mesh in Stallion Dash-line –- skip buses Can send operands over 1/more IFUs
IFU details Only left input can do barrel shifting ALU based on LUT Control register – stores control information for reconfiguration Optional Delay Register - provides latency to synchronize path lengths of different pipeline streams Cond. unit Output control unit
Radio testbed at VT Stallion
Worm-hole routing • stream = worm architecture = holes • multiple, independent streams can wind their way through the chip simultaneously • parts of system can be processing, parts could be reconfiguring • GOAL: Layered Software Radio Architecture
‘Stream’ processing at Stanford • Speeding up media applications • Need lots of computations per memory reference • Lots of data and sub-word parallelism • Current GPP architectures do not have enough ALUs • ‘Stream’ processors to the rescue
Special-purpose processors Lots (100s) of ALUs Fed by dedicated wires/memories
Care and feeding of ALUs Instr. Cache IP Instruction Bandwidth IR Data Bandwidth Regs ‘Feeding’ Structure Dwarfs ALU
Architecture implications • Tremendous opportunities • media problems have lots of parallelism and locality • VLSI technology enables 100s of ALUs/chip (1000s soon) • (in 0.18um 0.1mm2 per integer adder, 0.5mm2 per FP adder) • Challenging problems • locality - global structures won’t work • explicit parallelism - ILP won’t keep 100 ALUs busy • memory - streaming applications don’t cache well • Its time to try some new approaches
Register file organization • Register files functions: • short term storage for intermediate results • communication between multiple function units • Global register files don’t scale with #ALUs • need more registers to hold more results (grows with #ALUs ) • need more ports to connect all of the units (grows with #ALUs 2)
Distributed register files • Distributed register files means: • not all functional units can access all data • each functional unit input/output no longer has a dedicated route from/to all register files
Input Data Kernel Stream Output Data Image 0 convolve convolve Depth Map SAD Image 1 convolve convolve Stream processing • Little data reuse (pixels never revisited) • Highly data parallel (output pixels not dependent on other output pixels) • Compute intensive (60 operations per memory reference)
Stream programming • Streams • Communication void main() { Stream<int> a(256); Stream<int> b(256); Stream<int> c(256); Stream<int> d(1024); ... example1(a, b, c); example2(c, d); ... } • Kernels • Computation KERNEL example1(istream<int> a, istream<int> b, ostream<int> c) { loop_stream(a) { int ai, bi, ci; a >> ai; b >> bi; ci = ai * 2 + bi * 3; c << ci; } }
Stream Processor • Instructions are Load, Store, and Operate • operands are streams • Operate performs a compound stream operation • read elements from input streams • perform a local computation • append elements to output streams • repeat until input stream is consumed • (e.g., triangle transform)
SDRAM SDRAM SDRAM SDRAM Streaming Memory System Stream Controller Network Host Stream Register File Network Interface Processor Microcontroller ALU Cluster 7 ALU Cluster 0 ALU Cluster 1 ALU Cluster 2 ALU Cluster 3 ALU Cluster 4 ALU Cluster 5 ALU Cluster 6 Imagine Stream Processor Imagine
Intercluster Network Local Register File + * * + + / CU To SRF Cross Point From SRF Arithmetic clusters
SDRAM ALU Cluster ALU Cluster SDRAM Stream Register File SDRAM SDRAM ALU Cluster 544GB/s 2GB/s 32GB/s Bandwidth hierarchy • VLIW clusters with shared control • 41.2 32-bit operations per word of memory bandwidth
Conclusions • ‘Streams’ shown to be promising for reconfigurable computing • wireless may need reconfigurability • ‘Streams’ shown to be promising for media processing • wireless may have similar workloads • Important to understand pros and cons of different methodologies for good wireless architectures • Important to have the right tools