MacroSS : Macro-SIMDization of Streaming Applications

MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, YoonseoChoi‡, Mark Woh*, ManjunathKudlur†, RodricRabbah‡, Trevor Mudge*, Scott Mahlke* * Advanced Computer Arch. Lab., University of Michigan • ‡ IBM T.J. Watson Research Center † Nvidia Corp.

Importance of SIMD • Energy and area efficient way to exploit data-level parallelism • Performance in multimedia and communication apps • Ubiquitous in modern processors • Intel: SSE, Larrabee • IBM: Altivec, Cell SPE • ARM: Neon Control Unit Control Unit Control Unit Functional Units Functional Units Functional Units Cache Cache Cache

Stream Computing • Prevalent in embedded, desktop and server systems • Many optimizations for mapping and scheduling applications to parallel architectures • Retargetability is a big plus in streaming languages • Task, pipeline, and data-level parallelism is mapped into core-level parallelism • Data-level parallelism on SIMD engines is not utilized

Traditional Vectorization on Streaming Applications

Why SIMD engines are under-utilized? • Finding data-level parallelism suitable for SIMD engines • Proper data-alignment • Complicated compiler optimization and transformations • Wide variety of SIMD standards

In this work… • Macro-level SIMDization techniques for streaming languages. • MacroSS compiler for StreamIt language • Hardware-based buffer optimizations for packing/unpacking operations • Evaluation of MacroSS on Intel Core i7

StreamIt • Main Constructs: • Filter: Encapsulate computation. • Stateful • Stateless • Pipeline  Expressing pipeline parallelism • Splitjoin Expressing task/data-level parallelism • Exposes different types of parallelism • Scheduling and rate-matching are needed filter pipeline splitjoin

Macro SIMDization • SIMDization at graph level • Tunes the graph based on the target system • SIMD standards • Wide/Narrow SIMD • Actor SIMDization: • Single-Actor • Vertical • Horizontal

Single-Actor SIMDization Overview Serial Execution Execution Reordering Realistic Vectorization Ideal Vectorization E(8) E E E E v E E E E E E v E E E E E E v E v E E E

Single Actor SIMDization • Only stateless actors • Scalar buffer accesses • Strided pushes and pops

Why Scalar Buffers? 128 bits 20 21 22 23 16 17 18 19 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 ?

Vertical SIMDization

Horizontal SIMDization Source • Find isomorphic actors in split/join structures • The isomorphic actors are merge in one vectorized actor • Actors can be both stateful or stateless. Splitter An A1 . . . . . . . . . B1 Bn C1 Cn Joiner Sink

Streaming Address Generation • Area overhead less than 1% on Core i7. • Critical path two 16-bit adds and one 64-bit add. Scalar Buffer Vector Buffer 20 21 22 23 14 17 20 23 16 17 18 19 13 16 19 22 12 13 14 15 12 15 18 21 8 9 10 11 2 5 8 11 4 5 6 7 1 4 7 10 0 1 2 3 0 3 6 9

Traditional vs. Macro SIMDization

Experimental Setup Streaming Program • Frontend StreamIt MIT Compiler • Backend MacroSS • ICC 11.1 compile C/C++ code • Core i7 with SSE4 Frontend Compiler Backend Compiler C Code Host Compiler Intel Core i7

Macro-SIMDization vs. Traditional

Benefits of SAGU

Conclusion • Streaming is prevalent in all computing domains. • Applying traditional SIMDization on streaming applications fails to utilize SIMD engines. • Macro-SIMDization is done at higher level. • MacroSS outperforms traditional SIMDization techniques by 54%.

Questions and Comments

Macro-SIMDization vs. Traditional

SAGU Implementation • Area overhead less than 1% on Core i7. • Critical path two 16-bit adds and one 64-bit add. • Minor ISA modifications are needed.

SIMD + Multi-core Scheduling • How to schedule for a heterogeneous SIMD system? • SIMDization reduces memory/bus traffic • Exploit SIMD parallelism before Core-level parallelism. • Is this the best we can do?

Multicore + Macro-SIMDization

MacroSS : Macro-SIMDization of Streaming Applications

MacroSS : Macro-SIMDization of Streaming Applications

Presentation Transcript

How to Achieve Coherence at a Macro Level

Zen and the Art of Macro Writing: Creating Global Programs That Can Be Accessed From Anywhere

Macro Processors

FURTHER APPLICATIONS OF INTEGRATION

FURTHER APPLICATIONS OF INTEGRATION

Micro Data For Macro Models

Streaming Protocol Suite

MACRO ECONOMICS

AP MACRO ECONOMICS UNIT 6 : MR. LIPMAN

How to Build Macro-Models in Tina SPICE Part 1: Text Macro-Model Creation Text SubCircuit Creation

Micro Data For Macro Models

Introduction to Multimedia Networking

Outline

UNIT – IV MACRO PROCESSORS

Scalable Video Coding

CE00164-3

Steganography in Streaming Media 网络流媒体信息隐藏

Scalable Video Coding

Massive Macro Cram Kit!

Macro Photography