260 likes | 271 Views
This talk outlines the mechanisms for stream processing in various architectures, including insights on stream caches, streaming architectures, and the concept of a streaming computer. It delves into the impact of caching, pipelining, and prefetching on processing speed. The presentation covers topics like prefetching strategies, memory hierarchy, and window buffering in stream processing models.
E N D
Stream Caching:Mechanisms for General Purpose Stream Processing Nat Duca Jonathan Cohen Johns Hopkins University Peter Kirchner IBM Research
Talk Outline • Objective: reconcile current practices of CPU design with stream processing theory • Part 1: Streaming Ideas in current architectures • Latency and Die-Space • Processor types and tricks • Part 2: Insights about Stream Caches • Could window-based streaming be the next step in computer architecture?
Streaming Architectures • Graphics processors • Signal processors • Network processors • Scalar/Superscalar processors • Data stream processors? • Software architectures?
What is a Streaming Computer? • Two [overlapping] ideas • A system that executes strict-streaming algorithms [unbounded N, small M] • A general purpose system that is geared toward general computation, but is best for the streaming case • Big motivator: ALU-bound computation! • To what extent do present computer architectures serve these two views of a streaming computer?
[Super]scalar Architectures • Keep memory latency from limiting computation speed • Solutions: • Caches • Pipelining • Prefetching • Eager execution / branch prediction[the super in superscalar] • These are heuristics to locate streaming patterns in unstructured program behavior
By the Numbers, Data • Optimized using caches, pipelines, and eager-execution • Random: 182MB/s • Sequential: 315MB/s • Optimizing with prefetching • Random: 490MB/s • Sequential: 516MB/s • Theoretical Maximum: 533MB/s
By the Numbers, Observations • Achieving full throughput on a scalar CPU requires either • (a) prefetching [requires advance knowledge] • (b) sequential access [no advances req'd] • Vector architectures hide latency in their instruction set using implicit prefetching • Dataflow machines solve latency using automatic prefetching • Rule 1: Sequential I/O simplifies control and access to memory, etc
Superscalar (e.g. P4) Local Memory Hierarchy Prefetch Cache
Superscalar (e.g. P4) Local Memory Hierarchy Cache The P4, by surface area, is about 95% cache, prefetch, and branch- prediction logic. The remaining area is primarily the floating point ALU. Prefetch
Pure Streaming (e.g. Imagine) Out Streams In Streams
Can We Build This Machine? Local Memory Hierarchy • Rule 2: Small memory footprint allows more room for ALU --> more throughput Out Streams In Streams
Part II: Chromium • Pure stream processing model • Deals with OpenGL command stream • Begin(Triangles); Vertex, Vertex, Vertex; End; • Record splits are supported, joins are not • You perform useful computation in Chromium by joining together Stream Processors into a DAG • Note: DAG is constructed across multiple processors (unlike dataflow)
Chromium w/ Stream Caches • We added join capability to Chromium for the purpose of collapsing multiple records to one • Incidentally: this allows windowed computations • Thought: there seems to be direct connection between streaming-joins and sliding-windows • Because we're in software, the windows can become quite big without too much hassle • What if we move to hardware?
Windowed Streaming Window Buffer Out Streams In Streams Uses for Window Buffer of size M: • Store program structures of up to size M • Cache M input records, where M << N
Windowed Streaming Window Buffer In Streams Out Streams Realistic values of M if you stay exclusively on chip: 128k... 256K ... 2MB [DRAM-on-chip tech is promising]
Impact on Window Size Window Buffer Out Streams In Streams Insight: As M increases, this starts to resemble a superscalar computer
The Continuum Architecture Memory Hierarchy • For too large a value of M: • Non-Sequential I/O --> caches • Caches --> less room for ALU (etc) Out Streams In Streams
Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Thought: Can we augment window-buffer limit by a loopback feature?
Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Memory Thought: What do we gain by allowing a finite delay in the loopback stream?
Streaming Networks: 1:N [Hanrahan model]
Versatility of Streaming Networks? • Question: What algorithms can we support here? How? • Both from a Theoretical and Practical view • We have experimented with graphics problems only: • Stream compression, visibility & culling, level of detail
New Concepts with Streaming Networks • An individual processor's cost is small • Highly flexible: use high level ideas of Dataflow • Multiple streams in and out • Interleaving or non-interleaved • Scalable window size • Open to entirely new concepts • E.g. How do you add more memory in this system?
Summary • Systems are easily built on the basis of streaming I/O and memory models • By design, it makes maximum use of hardware: very very efficient • Continuum of Architectures:Pure Streaming to Superscalar • Stream processors are trivially chained, even in cycles • Such a chained architecture may be higly flexible: • Experimental evidence & systems work • Dataflow literature • Streaming literature