230 likes | 317 Views
Relational Verification to SIMD Loop Synthesis. Mark Marron – Imdea & Microsoft Research Sumit Gulwani – Microsoft Research Gilles Barthe , Juan M. Crespo, Cesar Kunz – Imdea. General SIMD Compilation. Compilers struggle to utilize SIMD operations in general purpose code
E N D
Relational VerificationtoSIMD Loop Synthesis Mark Marron – Imdea& Microsoft Research Sumit Gulwani – Microsoft Research Gilles Barthe, Juan M. Crespo, Cesar Kunz – Imdea
General SIMD Compilation • Compilers struggle to utilize SIMD operations in general purpose code • Text processing, web browser, compiler, etc. • Standard library code (C++ STL, .Net BCL) they utilize • Challenges • Data structure layouts of composite types • Complex data driven control flow • Wide ranging code restructuring is often needed • We know most of the needed “tricks” but: • Time and implementation effort too large to identify and implement all of them • “An Evaluation of Vectorizing Compilers” PACT ‘11
Example (Exists Function) struct { int tag; int score; } widget; int exists(widget* vals, intlen, int t, int s) { for(inti = 0; i < len; ++i) { inttagok = vals[i].tag == t; intscoreok = vals[i].score > s; intandok = tagok & scoreok; if(andok) return 1; } return 0; }
SIMD Example (Exists Function) [ti, si, ti+1, si+1] … for(; i < (len - 3); i += 4) { m128i blck1 = load_128(vals, i); m128i blck2 = load_128(vals, i + 4); m128i tagvs = shuffle_i32(blck1, blck2, ORDER(0, 2, 0, 2)); m128i scorevs = shuffle_i32(blck1, blck2, ORDER(1, 3, 1, 3)); m128i cmptag = cmpeq_i32(vectv, tagvs); m128i cmpscore = cmpgt_i32(vecsv, scorevs); m128i cmpr = and_i128(cmptag, cmpscore); int match = !allzeros(cmpr); if (match) return 1; } … [ti+2, si+2, ti+3, si+3] [ti, ti+1, ti+2, ti+3] [si, si+1, si+2, si+3] [ti==t ? 0xF…F : 0x0, …, ti+3==t ? 0xF…F : 0x0] [si>s ? 0xF…F : 0x0, …, si+3>s ? 0xF…F : 0x0] [cmptag0 & cmpscore0, …, cmptag3 & cmpscore3] (cmpr0!=0 | cmpr1!=0 | cmpr2!=0 | cmpr3!=0)
Overview of Approach • Deductive Rewriting of program source to: • Identify high-level structures of interest • Rewrite to expose latent parallelism (split, unroll, etc.) and straighten hot-paths • Relational Verification techniques used to: • Construct the needed synthesis conditions (for code involving loops!) • Produce proof for semantic equivalence of input and result code • Inductive Synthesis of SIMD program fragments to: • Identify the best SIMD realizations of the synthesis conditions • Produce proofs of correctness wrt. synthesis conditions • Methodology more general than just SIMD Loops!
From Verification to Synthesis Condition Generation • Relational Verification: • Prove two programs equivalent under equivalence relations on states • y = x • y = x1 + x2 + x3 + x4 • y = 5 • Only afew standard equivalence relations needed in practice • Prove results of two programs are equivalent by showing: • If the programs are synchronously executed then at synchronization points the program states are always equivalent under the relations • For our purposes at the start and end of the loop body
Relational Verification intsumr = 0; int as0, as1, as2, as3 = 0; for(inti = 0; i < len; i+=4) { as0 = as0 + A[i]; as1 = as1 + A[i+1]; as2 = as2 + A[i+2]; as3 = as3 + A[i+3]; } sumr = as0 + as1 + as2 + as3; intsuml = 0; for(inti = 0; i < len; i+=4) { suml = suml + A[i]; suml = suml + A[i+1]; suml = suml + A[i+2]; suml = suml + A[i+3]; } Full Loop Invariant: Relational Invariant:
From Verification to Condition Generation • We use “Product Programs” approach • “Relational verification using product programs” FM ‘11 • Rename variables in “left” and “right” programs disjointly • Interleave the programs “appropriately” • Generates verification conditions on the combined program • Key Idea: • Replace code in “right” program with uninterpreted Function () • Perform Product program construction and VC generation • Resulting VC for are needed synthesis pre/post conditions
Relational Synthesis Condition intsuml = 0; for(inti = 0; i < len; i+=4) { suml = suml + A[i]; suml = suml + A[i+1]; suml = suml + A[i+2]; suml = suml + A[i+3]; } int sumr = 0; m128i ac = [0, 0, 0, 0]; for(inti = 0; i < len; i+=4) { } sumr = ac.0 + ac.1 + ac.2 + ac.3; Relational Invariant:
Resulting Synthesis Condition • Pre-condition: • ac == [v1, v2, v3, v4] • Post-condtion: • ac == [v1 + A[i], v2 + A[i+1], v3+ A[i+2], v4 + A[i+3]]
Instruction Sequence Search • Search space for SIMD instruction sequences is large • Length: frequently need 8 or more instructions • Branching: SSE has 200+ instructions • Concrete state space exploration • Explore program states instead of instruction sequences • Use concrete execution to quickly exclude many candidate instruction sequences • Query SMT solver for a counter example input • Eventually either no counter examples or give up • Search for alternative sequences • Can generate multiple solutions to find best performance on varying data sizes
Optimize Search • Cost model provides upper bound on depth of search • Also used to pick best operation to explore next and to pick shortest path from input to output state • Incrementally expand available instruction set • Start with standard operations (and those seen in input code) • Add more specialized operations if desired • Generate multiple initial input-output pairs • One per path in original loop body • Stack machine construction to reduce the branching factor
Cost Model • Do not want to compute absolute costs • A very hard problem • Compute relative costs • Both programs run on the same data so same cache misses and branch taken/not taken • Build simple machine model to encapsulate instruction costs • Cost function a polynomial in terms of loop counts and branch rates • Use conservative static estimates for synthesis • Can use runtime data for selection in JIT setting
Complete Algorithm Restructure Loop Cost Score Cost Ranking Function Restructured Program … for(i I by 4c) { } … Restructured Program … for(i I by 4c) { } … Input Program … for(i I by c) { } … CPU Model Input Program … for(i I by c) { } … Optimistic Vectorize Final SIMD Program … for(i I by 4c) { } … Final SIMD Program … for(i I by 4c) { } … Merge & Cleanup Synthesize Body Simulation Relation (Eq) Synth. Cond. Generation Correctness Proof Synthesis Cond.
SIMD Standard Library • Synthesize SIMD implementations of C++ STL and .NetBCL code • Consistent performance improvements • Between 2x-4x on large inputs • Avoid performance degradation on small inputs • Cost model accurately predicts performance • Can pick best implementation based on hardware and input data
String Processing • Synthesize standard string functions using PCMPESTRI • Packed Compare Explicit Length Strings, Return Index • Encoded semantics and provided them to synthesizer • Synthesized range of common string functions with no other changes • Speedup of 3.4x for String.Equals • Speedup up to 9.5x for String.IndexOfAny
Impact In Practice 483.Xalan (SPEC CPU) • XML processing framework written in C++ • Replaced STL calls with our SIMD implementations • Performance sensitive to input data • Previous work replacing these calls with set structures was +15% to -20% on different data • Synthesized SIMD code produces consistent 2%-5% speedup • Indicates a 1.15x to 1.5x speedup in the STL code which is inline with cost model predictions
Benefits of Approach • Proof of correctness from original loop and SIMD version • Separation of correctness and optimization • Transform for performant code structure • If incorrect proof (or synthesis) will fail later • Approach consistently produces fast SIMD code • Robust to details of SIMD instruction set and loop patterns • 2x-4x speedups obtained from synthesized SIMD code
Future Work • Pointers and object structures • Scatter-Gather support will help • Compact object graphs into arrays (current work) • Can we do local data structure transformations? • Apply technique to larger structures and more generally • What about loops with small inner-loops (HashTable lookup)? • Can we use synthesis as part of general code-gen?
Big Picture Conclusions • Big challenges and big benefits using specialized hardware • Both performance and power! • Synthesis complements compilation • Small step vs. big step code generation • Verification structures synthesis (and eliminates compilation bugs) • Can we apply ideas to other compiler actions? Target other hardware? • Idea more general than just compilers or SIMD synthesis • Expert provided deductive structure • Inductive synthesis driven by underlying semantics • A powerful combination for approaching problems