200 likes | 348 Views
Potential of Dynamic Binary Parallelization. Jing Yang , Kevin Skadron, Mary Lou Soffa, and Kamin Whitehouse Department of Computer Science University of Virginia. UCAS 7 Feburary 26, New Orleans, Louisiana. Why Automatic Parallelization ?.
E N D
Potential of Dynamic Binary Parallelization Jing Yang, Kevin Skadron, Mary Lou Soffa, and Kamin Whitehouse Department of Computer Science University of Virginia UCAS 7 Feburary 26, New Orleans, Louisiana
Why Automatic Parallelization ? • Bridge the gap between parallel hardware and sequential software • Manual parallelization • Typically yield the best speedups • Time-consuming • Error-prone: data races and memory consistency complexities • Difficult to understand or refactor for parallelization
Why Dynamic Binary Parallelization ? • Source code is sometimes unavailable • Legacy software • Third-party software • Y2K crisis: up to 60% of source code was missing • Assembled and defined at run time • Shared libraries, virtual functions, plugins, and dynamically-generated code • Components written in different languages • Exploit runtime information
Trace-Based Dynamic Binary Parallelization • State of the art • Distributed superscalar design • Dynamic CFG transformation • Instruction window size vs. spurious dependencies • Combine the best of two worlds • Long traces: large instruction window • Atomic execution: no control dependencies • High speculation accuracy: low rollback overhead • High execution coverage: Admiral’s Law
Conceptual Overview of T-DBP Predict Dispatch Sequential Execution Parallelized Candidate Traces T-DBP Skip Abort Success Predict Dispatch Continue Abort Abort Predict Dispatch Skip Success Success Predict Dispatch Core 1 Cores 2-7
Evaluation of T-DBP Prototype Is there room for further improvements ? How does runtime information help ? Cross boundaries between application and library code ! Only respect dependencies on the actual execution path !
Limit Study Setup • SPEC CPU2000: test input • Unlimited number of cores • Perfect speculation accuracy • Always identify the most frequently repeating patterns of instructions
Limit Study Process • Record execution sequences • Analyze execution sequences traces • Parallelize execution sequences • Model parallel execution time • Verify parallel execution sequences
Record Execution Sequences • Dynamic binary instrumentation • Basic block: execution sequence • Effective address of loads and stores: memory disambiguation • Values of loads: deterministic replay • Reduce overhead • Double buffering: time • VPC3 compression algorithm: disk space
Analyze Execution Sequences • Offline dictionary-based algorithm How to emulate the handicap of static parallelization? Only combine adjacent basic blocks if both of them belong to application code or both of them belong to library code !
Parallelize Execution Sequences • Dynamic critical path scheduling algorithm • Build the dependency graph • Pick the next ready instruction with the smallest value of ALST – AEST • Schedule the instruction so that it does not delay the ALST of all scheduled instructions • Continue if not all instructions are scheduled
How to Emulate the Handicap of Static Parallelization ? I1 : R1 = R4 I3 : R0 = R2 I1 : R1 = R4 3 clock cycles I2 : R0 = R1 I2 : R0 = R1 I4 : R3 = R0 (b) Parallelization on the CFG. I3 : R0 = R2 I5 : R2 = 2 2 clock cycles I3 : R0 = R2 I1 : R1 = R4 I4 : R3 = R0 I4 : R3 = R0 I2 : R0 = R1 (a) A Simple CFG. (c) Parallelization on the Trace.
Model Parallel Execution Time • Instruction: one clock cycle • Pipelining • Inter-core synchronization: one clock cycle • Operand network • Synchronization array • Execution time of a parallelized trace • Maximum AEST of all instructions + one
Verify Parallel Execution Sequences • Link into a single executable • Basic blocks • Traces: one possibility of linearization • Load into the original address space • Replay on a real machine
Experimental Configurations • T-DBP: unconstrained • T-DBP – 1: not cross boundaries between application and library code • T-DBP – 2: not cross boundaries between application and library code; respect all true dependencies in the CFG
Results of Integer Benchmarks 9.19 6.56 4.52
Results of Floating Point Benchmarks 22.35 17.12 9.36
Conclusion • There is much room for further improvements • Runtime information helps a lot ?