Potential of Dynamic Binary Parallelization

Potential of Dynamic Binary Parallelization Jing Yang, Kevin Skadron, Mary Lou Soffa, and Kamin Whitehouse Department of Computer Science University of Virginia UCAS 7 Feburary 26, New Orleans, Louisiana

Why Automatic Parallelization ? • Bridge the gap between parallel hardware and sequential software • Manual parallelization • Typically yield the best speedups • Time-consuming • Error-prone: data races and memory consistency complexities • Difficult to understand or refactor for parallelization

Why Dynamic Binary Parallelization ? • Source code is sometimes unavailable • Legacy software • Third-party software • Y2K crisis: up to 60% of source code was missing • Assembled and defined at run time • Shared libraries, virtual functions, plugins, and dynamically-generated code • Components written in different languages • Exploit runtime information

Trace-Based Dynamic Binary Parallelization • State of the art • Distributed superscalar design • Dynamic CFG transformation • Instruction window size vs. spurious dependencies • Combine the best of two worlds • Long traces: large instruction window • Atomic execution: no control dependencies • High speculation accuracy: low rollback overhead • High execution coverage: Admiral’s Law

Conceptual Overview of T-DBP Predict Dispatch Sequential Execution Parallelized Candidate Traces T-DBP Skip Abort Success Predict Dispatch Continue Abort Abort Predict Dispatch Skip Success Success Predict Dispatch Core 1 Cores 2-7

Evaluation of T-DBP Prototype Is there room for further improvements ? How does runtime information help ? Cross boundaries between application and library code ! Only respect dependencies on the actual execution path !

Limit Study Setup • SPEC CPU2000: test input • Unlimited number of cores • Perfect speculation accuracy • Always identify the most frequently repeating patterns of instructions

Limit Study Process • Record execution sequences • Analyze execution sequences  traces • Parallelize execution sequences • Model parallel execution time • Verify parallel execution sequences

Record Execution Sequences • Dynamic binary instrumentation • Basic block: execution sequence • Effective address of loads and stores: memory disambiguation • Values of loads: deterministic replay • Reduce overhead • Double buffering: time • VPC3 compression algorithm: disk space

Analyze Execution Sequences • Offline dictionary-based algorithm How to emulate the handicap of static parallelization? Only combine adjacent basic blocks if both of them belong to application code or both of them belong to library code !

Parallelize Execution Sequences • Dynamic critical path scheduling algorithm • Build the dependency graph • Pick the next ready instruction with the smallest value of ALST – AEST • Schedule the instruction so that it does not delay the ALST of all scheduled instructions • Continue if not all instructions are scheduled

How to Emulate the Handicap of Static Parallelization ? I1 : R1 = R4 I3 : R0 = R2 I1 : R1 = R4 3 clock cycles I2 : R0 = R1 I2 : R0 = R1 I4 : R3 = R0 (b) Parallelization on the CFG. I3 : R0 = R2 I5 : R2 = 2 2 clock cycles I3 : R0 = R2 I1 : R1 = R4 I4 : R3 = R0 I4 : R3 = R0 I2 : R0 = R1 (a) A Simple CFG. (c) Parallelization on the Trace.

Model Parallel Execution Time • Instruction: one clock cycle • Pipelining • Inter-core synchronization: one clock cycle • Operand network • Synchronization array • Execution time of a parallelized trace • Maximum AEST of all instructions + one

Verify Parallel Execution Sequences • Link into a single executable • Basic blocks • Traces: one possibility of linearization • Load into the original address space • Replay on a real machine

Experimental Configurations • T-DBP: unconstrained • T-DBP – 1: not cross boundaries between application and library code • T-DBP – 2: not cross boundaries between application and library code; respect all true dependencies in the CFG

Results of Integer Benchmarks 9.19 6.56 4.52

Results of Floating Point Benchmarks 22.35 17.12 9.36

Conclusion • There is much room for further improvements • Runtime information helps a lot ?

Potential of Dynamic Binary Parallelization

Potential of Dynamic Binary Parallelization

Presentation Transcript

Dynamic Pricing - Potential and Issues

Parallelization of Expert System

Loop Parallelization

Parallelization

Cooperative Parallelization

Automatic Parallelization

Dynamic Binary Optimization: The Dynamo Case

Dynamic Binary Translation

Parallelization of urbanSTREAM

Parallelization of RHSEG

Parallelization of RHSEG

Dynamic Binary Optimization

Dynamic Binary Optimization

Dynamic Binary Optimization – Part 1

Dynamic Set ADT Binary Trees

Open TS dynamic parallelization system

The study of gravitational potential of a binary asteroid

PARALLELIZATION OF MULTIPLE BACKSOLVES

Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms

Dynamic Binary Translators and Instrumenters

The Future of RE: Dynamic Binary Visualization

Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms