1 / 26

Exploring Stream Caching in General Purpose Stream Processing Architectures

This talk outlines the mechanisms for stream processing in various architectures, including insights on stream caches, streaming architectures, and the concept of a streaming computer. It delves into the impact of caching, pipelining, and prefetching on processing speed. The presentation covers topics like prefetching strategies, memory hierarchy, and window buffering in stream processing models.

marquette
Download Presentation

Exploring Stream Caching in General Purpose Stream Processing Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stream Caching:Mechanisms for General Purpose Stream Processing Nat Duca Jonathan Cohen Johns Hopkins University Peter Kirchner IBM Research

  2. Talk Outline • Objective: reconcile current practices of CPU design with stream processing theory • Part 1: Streaming Ideas in current architectures • Latency and Die-Space • Processor types and tricks • Part 2: Insights about Stream Caches • Could window-based streaming be the next step in computer architecture?

  3. Streaming Architectures • Graphics processors • Signal processors • Network processors • Scalar/Superscalar processors • Data stream processors? • Software architectures?

  4. What is a Streaming Computer? • Two [overlapping] ideas • A system that executes strict-streaming algorithms [unbounded N, small M] • A general purpose system that is geared toward general computation, but is best for the streaming case • Big motivator: ALU-bound computation! • To what extent do present computer architectures serve these two views of a streaming computer?

  5. [Super]scalar Architectures • Keep memory latency from limiting computation speed • Solutions: • Caches • Pipelining • Prefetching • Eager execution / branch prediction[the super in superscalar] • These are heuristics to locate streaming patterns in unstructured program behavior

  6. By the Numbers, Data • Optimized using caches, pipelines, and eager-execution • Random: 182MB/s • Sequential: 315MB/s • Optimizing with prefetching • Random: 490MB/s • Sequential: 516MB/s • Theoretical Maximum: 533MB/s

  7. By the Numbers, Observations • Achieving full throughput on a scalar CPU requires either • (a) prefetching [requires advance knowledge] • (b) sequential access [no advances req'd] • Vector architectures hide latency in their instruction set using implicit prefetching • Dataflow machines solve latency using automatic prefetching • Rule 1: Sequential I/O simplifies control and access to memory, etc

  8. Superscalar (e.g. P4) Local Memory Hierarchy Prefetch Cache

  9. Superscalar (e.g. P4) Local Memory Hierarchy Cache The P4, by surface area, is about 95% cache, prefetch, and branch- prediction logic. The remaining area is primarily the floating point ALU. Prefetch

  10. Pure Streaming (e.g. Imagine) Out Streams In Streams

  11. Can We Build This Machine? Local Memory Hierarchy • Rule 2: Small memory footprint allows more room for ALU --> more throughput Out Streams In Streams

  12. Part II: Chromium • Pure stream processing model • Deals with OpenGL command stream • Begin(Triangles); Vertex, Vertex, Vertex; End; • Record splits are supported, joins are not • You perform useful computation in Chromium by joining together Stream Processors into a DAG • Note: DAG is constructed across multiple processors (unlike dataflow)

  13. Chromium w/ Stream Caches • We added join capability to Chromium for the purpose of collapsing multiple records to one • Incidentally: this allows windowed computations • Thought: there seems to be direct connection between streaming-joins and sliding-windows • Because we're in software, the windows can become quite big without too much hassle • What if we move to hardware?

  14. Windowed Streaming Window Buffer Out Streams In Streams Uses for Window Buffer of size M: • Store program structures of up to size M • Cache M input records, where M << N

  15. Windowed Streaming Window Buffer In Streams Out Streams Realistic values of M if you stay exclusively on chip: 128k... 256K ... 2MB [DRAM-on-chip tech is promising]

  16. Impact on Window Size Window Buffer Out Streams In Streams Insight: As M increases, this starts to resemble a superscalar computer

  17. The Continuum Architecture Memory Hierarchy • For too large a value of M: • Non-Sequential I/O --> caches • Caches --> less room for ALU (etc) Out Streams In Streams

  18. Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Thought: Can we augment window-buffer limit by a loopback feature?

  19. Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Memory Thought: What do we gain by allowing a finite delay in the loopback stream?

  20. Streaming Networks: Primitive

  21. Streaming Networks: 1:N [Hanrahan model]

  22. Streaming Networks: N:1

  23. Streaming Networks: The Ugly

  24. Versatility of Streaming Networks? • Question: What algorithms can we support here? How? • Both from a Theoretical and Practical view • We have experimented with graphics problems only: • Stream compression, visibility & culling, level of detail

  25. New Concepts with Streaming Networks • An individual processor's cost is small • Highly flexible: use high level ideas of Dataflow • Multiple streams in and out • Interleaving or non-interleaved • Scalable window size • Open to entirely new concepts • E.g. How do you add more memory in this system?

  26. Summary • Systems are easily built on the basis of streaming I/O and memory models • By design, it makes maximum use of hardware: very very efficient • Continuum of Architectures:Pure Streaming to Superscalar • Stream processors are trivially chained, even in cycles • Such a chained architecture may be higly flexible: • Experimental evidence & systems work • Dataflow literature • Streaming literature

More Related