Exploring Stream Caching in General Purpose Stream Processing Architectures

Stream Caching:Mechanisms for General Purpose Stream Processing Nat Duca Jonathan Cohen Johns Hopkins University Peter Kirchner IBM Research

Talk Outline • Objective: reconcile current practices of CPU design with stream processing theory • Part 1: Streaming Ideas in current architectures • Latency and Die-Space • Processor types and tricks • Part 2: Insights about Stream Caches • Could window-based streaming be the next step in computer architecture?

Streaming Architectures • Graphics processors • Signal processors • Network processors • Scalar/Superscalar processors • Data stream processors? • Software architectures?

What is a Streaming Computer? • Two [overlapping] ideas • A system that executes strict-streaming algorithms [unbounded N, small M] • A general purpose system that is geared toward general computation, but is best for the streaming case • Big motivator: ALU-bound computation! • To what extent do present computer architectures serve these two views of a streaming computer?

[Super]scalar Architectures • Keep memory latency from limiting computation speed • Solutions: • Caches • Pipelining • Prefetching • Eager execution / branch prediction[the super in superscalar] • These are heuristics to locate streaming patterns in unstructured program behavior

By the Numbers, Data • Optimized using caches, pipelines, and eager-execution • Random: 182MB/s • Sequential: 315MB/s • Optimizing with prefetching • Random: 490MB/s • Sequential: 516MB/s • Theoretical Maximum: 533MB/s

By the Numbers, Observations • Achieving full throughput on a scalar CPU requires either • (a) prefetching [requires advance knowledge] • (b) sequential access [no advances req'd] • Vector architectures hide latency in their instruction set using implicit prefetching • Dataflow machines solve latency using automatic prefetching • Rule 1: Sequential I/O simplifies control and access to memory, etc

Superscalar (e.g. P4) Local Memory Hierarchy Prefetch Cache

Superscalar (e.g. P4) Local Memory Hierarchy Cache The P4, by surface area, is about 95% cache, prefetch, and branch- prediction logic. The remaining area is primarily the floating point ALU. Prefetch

Pure Streaming (e.g. Imagine) Out Streams In Streams

Can We Build This Machine? Local Memory Hierarchy • Rule 2: Small memory footprint allows more room for ALU --> more throughput Out Streams In Streams

Part II: Chromium • Pure stream processing model • Deals with OpenGL command stream • Begin(Triangles); Vertex, Vertex, Vertex; End; • Record splits are supported, joins are not • You perform useful computation in Chromium by joining together Stream Processors into a DAG • Note: DAG is constructed across multiple processors (unlike dataflow)

Chromium w/ Stream Caches • We added join capability to Chromium for the purpose of collapsing multiple records to one • Incidentally: this allows windowed computations • Thought: there seems to be direct connection between streaming-joins and sliding-windows • Because we're in software, the windows can become quite big without too much hassle • What if we move to hardware?

Windowed Streaming Window Buffer Out Streams In Streams Uses for Window Buffer of size M: • Store program structures of up to size M • Cache M input records, where M << N

Windowed Streaming Window Buffer In Streams Out Streams Realistic values of M if you stay exclusively on chip: 128k... 256K ... 2MB [DRAM-on-chip tech is promising]

Impact on Window Size Window Buffer Out Streams In Streams Insight: As M increases, this starts to resemble a superscalar computer

The Continuum Architecture Memory Hierarchy • For too large a value of M: • Non-Sequential I/O --> caches • Caches --> less room for ALU (etc) Out Streams In Streams

Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Thought: Can we augment window-buffer limit by a loopback feature?

Windowed Streaming Window Buffer In Streams Out Streams Loopback streams Memory Thought: What do we gain by allowing a finite delay in the loopback stream?

Streaming Networks: Primitive

Streaming Networks: 1:N [Hanrahan model]

Streaming Networks: N:1

Streaming Networks: The Ugly

Versatility of Streaming Networks? • Question: What algorithms can we support here? How? • Both from a Theoretical and Practical view • We have experimented with graphics problems only: • Stream compression, visibility & culling, level of detail

New Concepts with Streaming Networks • An individual processor's cost is small • Highly flexible: use high level ideas of Dataflow • Multiple streams in and out • Interleaving or non-interleaved • Scalable window size • Open to entirely new concepts • E.g. How do you add more memory in this system?

Summary • Systems are easily built on the basis of streaming I/O and memory models • By design, it makes maximum use of hardware: very very efficient • Continuum of Architectures:Pure Streaming to Superscalar • Stream processors are trivially chained, even in cycles • Such a chained architecture may be higly flexible: • Experimental evidence & systems work • Dataflow literature • Streaming literature

Exploring Stream Caching in General Purpose Stream Processing Architectures

Exploring Stream Caching in General Purpose Stream Processing Architectures

Presentation Transcript

Detecting Cartels Joe Harrington Johns Hopkins University

Johns Hopkins CPC4

Career Pathing Initiatives at Johns Hopkins University

Johns Hopkins University: The Research University Model

Johns Hopkins University Business Plan Competition

Johns Hopkins University Business Plan Competition

Johns Hopkins Hospital

James F Philbin, PhD Johns Hopkins University

Johns Hopkins Health System

Research Administration at Johns Hopkins University

Jayant Gupchup Graduate student, Johns Hopkins University

Data-Intensive Science at Johns Hopkins University

Johns Hopkins Vaccine Initiative

Wei Liu The Johns Hopkins University

Johns Hopkins University Department of Biomedical Engineering

Shane Bergsma Johns Hopkins University

Charles Flexner, MD Johns Hopkins University

Johns Hopkins University

Christopher Dreisbach, Ph.D. Johns Hopkins University

Johns Hopkins University Applied Physics Laboratory

Data-Intensive Science at Johns Hopkins University

Michael Kazhdan Johns Hopkins University