260 likes | 454 Views
Stream Caching: Mechanisms for General Purpose Stream Processing. Nat Duca Jonathan Cohen Johns Hopkins University. Peter Kirchner IBM Research. Talk Outline. Objective: reconcile current practices of CPU design with stream processing theory Part 1: Streaming Ideas in current architectures
E N D
Stream Caching:Mechanisms for General Purpose Stream Processing Nat Duca Jonathan Cohen Johns Hopkins University Peter Kirchner IBM Research
Talk Outline • Objective: reconcile current practices of CPU design with stream processing theory • Part 1: Streaming Ideas in current architectures • Latency and Die-Space • Processor types and tricks • Part 2: Insights about Stream Caches • Could window-based streaming be the next step in computer architecture?
Streaming Architectures • Graphics processors • Signal processors • Network processors • Scalar/Superscalar processors • Data stream processors? • Software architectures?
What is a Streaming Computer? • Two [overlapping] ideas • A system that executes strict-streaming algorithms [unbounded N, small M] • A general purpose system that is geared toward general computation, but is best for the streaming case • Big motivator: ALU-bound computation! • To what extent do present computer architectures serve these two views of a streaming computer?
[Super]scalar Architectures • Keep memory latency from limiting computation speed • Solutions: • Caches • Pipelining • Prefetching • Eager execution / branch prediction[the super in superscalar] • These are heuristics to locate streaming patterns in unstructured program behavior
By the Numbers, Data • Optimized using caches, pipelines, and eager-execution • Random: 182MB/s • Sequential: 315MB/s • Optimizing with prefetching • Random: 490MB/s • Sequential: 516MB/s • Theoretical Maximum: 533MB/s
By the Numbers, Observations • Achieving full throughput on a scalar CPU requires either • (a) prefetching [requires advance knowledge] • (b) sequential access [no advances req'd] • Vector architectures hide latency in their instruction set using implicit prefetching • Dataflow machines solve latency using automatic prefetching • Rule 1: Sequential I/O simplifies control and access to memory, etc
Superscalar (e.g. P4) Local Memory Hierarchy Prefetch Cache
Superscalar (e.g. P4) Local Memory Hierarchy Cache The P4, by surface area, is about 95% cache, prefetch, and branch- prediction logic. The remaining area is primarily the floating point ALU. Prefetch
Pure Streaming (e.g. Imagine) Out Streams In Streams
Can We Build This Machine? Local Memory Hierarchy • Rule 2: Small memory footprint allows more room for ALU --> more throughput Out Streams In Streams
Part II: Chromium • Pure stream processing model • Deals with OpenGL command stream • Begin(Triangles); Vertex, Vertex, Vertex; End; • Record splits are supported, joins are not • You perform useful computation in Chromium by joining together Stream Processors into a DAG • Note: DAG is constructed across multiple processors (unlike dataflow)
Chromium w/ Stream Caches • We added join capability to Chromium for the purpose of collapsing multiple records to one • Incidentally: this allows windowed computations • Thought: there seems to be direct connection between streaming-joins and sliding-windows • Because we're in software, the windows can become quite big without too much hassle • What if we move to hardware?
Windowed Streaming Window Buffer Out Streams In Streams Uses for Window Buffer of size M: • Store program structures of up to size M • Cache M input records, where M << N
Windowed Streaming Window Buffer In Streams Out Streams Realistic values of M if you stay exclusively on chip: 128k... 256K ... 2MB [DRAM-on-chip tech is promising]
Impact on Window Size Window Buffer Out Streams In Streams Insight: As M increases, this starts to resemble a superscalar computer
The Continuum Architecture Memory Hierarchy • For too large a value of M: • Non-Sequential I/O --> caches • Caches --> less room for ALU (etc) Out Streams In Streams
Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Thought: Can we augment window-buffer limit by a loopback feature?
Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Memory Thought: What do we gain by allowing a finite delay in the loopback stream?
Streaming Networks: 1:N [Hanrahan model]
Versatility of Streaming Networks? • Question: What algorithms can we support here? How? • Both from a Theoretical and Practical view • We have experimented with graphics problems only: • Stream compression, visibility & culling, level of detail
New Concepts with Streaming Networks • An individual processor's cost is small • Highly flexible: use high level ideas of Dataflow • Multiple streams in and out • Interleaving or non-interleaved • Scalable window size • Open to entirely new concepts • E.g. How do you add more memory in this system?
Summary • Systems are easily built on the basis of streaming I/O and memory models • By design, it makes maximum use of hardware: very very efficient • Continuum of Architectures:Pure Streaming to Superscalar • Stream processors are trivially chained, even in cycles • Such a chained architecture may be higly flexible: • Experimental evidence & systems work • Dataflow literature • Streaming literature