250 likes | 354 Views
Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture. Farhana Aleen, Nate Clark Georgia Institute of Technology Modified by Michelle Goodstein LBA Reading Group 6/4/09. Motivation. Extracting performance from multi-core is hard.
E N D
Commutativity Analysis for Software Parallelization: Letting Program Transformations See the Big Picture Farhana Aleen, Nate Clark Georgia Institute of Technology Modified by Michelle Goodstein LBA Reading Group 6/4/09
Motivation Extracting performance from multi-core is hard I need to write parallel program Automatic compiler-based parallelization helps 2
Source Of Parallelism: Commutativity sum(5) sum(10) 15 = sum (10) sum(5) 15 = Application Application foo(a) foo(b) foo(b) foo(a) output output =
Execute the function in two different orders Check equivalence of memory Existing Approach Of Detecting Commutativity sum(x) sum(y) x+y = sum(y) sum(x) y+x =
Opportunities Missed By Existing Approach Insertion of elements in to Hash-set (vector <linked-list>) 2 2 6 insert(2) insert(6) 6 6 2 insert(6) insert(2)
The Idea 2 2 6 2 remove(6) Yes! insert(2) insert(6) is_member(2) 6 2 6 remove(6) 2 Yes! insert(6) insert(2) is_member(2) class hash_set{ vector<linked_list> set; insert(); remove(); is_member(); } • Identical memory does not matter • Final output matters
Our Approach: Step 1 Symbolically execute in two different orders Check for the identical memory layout M I1 M I2 insert() insert() M1 M2 I2 I1 insert() insert() ? M1,2 M2,1 == If not similar, check reader functions
Step 2: Checking Reader Functions M2,1 M2,1 I M1,2 M1,2 I I I is_member() is_member() remove() remove() M’1,2 M’2,1 M”1,2 M”2,1 == == insert() Candidate function Readers of candidate function’s output is_member() remove() Readers of readers’ output … … … …
Pros/Cons Of Our Approach Pros- Identifies more commutativity Finds more parallelism Cons- More equivalence checking
Equivalence Checking Options Random Testing X X Random Interpretation X Speed Symbolic Execution X Accuracy
Random Interpretation: Example Input(x,y) x=2 y=3 x 2 y 3 a=x+y • Choose random values for input variables a 5 x 3 • Execute taken branch of the condition • Execute fall-through branch • Replicate initial memory state • Adjust values if(x!=y) y 3 a 6 fall-through taken b=a b=2x w=3 x x • Affine join of v1 and v2w.r.t. weight w • w(v1,v2)w v1 + (1-w)v2 3 2 y 3 y 3 a 6 a 5 b b 4 6 assert(b=2x) x 5 y 3 a 8 b 10
Random Interpretation In Equivalence Checking Initial memory Initial memory foo(x) foo(y) foo(y) foo(x) Modified memory
Why Random Interpretation Works Avoids scalability problem Affine join superposes all execution paths Linear relationships same before and after the join The error probability is very low: at most Decreases the error probability exponentially
(Added Slide) Probability details • Low error probability: • In general, at most 1 bad random value / join in program • Prob(error) = (# joins )/264 • Empiricially (prior work): # of joins increases linearly in # of program statements • Coefficient of .5 to 5.2 • Assume 1000 statement function, commutative • Prob(error) (5.2 * 1000) / 264 2.8 * 10-16 • To decrease error, increase # of runs
Experimental Methodology • Trimaran compiler • Scheduled them • Infinite issue machine • Perfect memory system • Pointer Analysis • Stack and heap sensitive • Tested on • SPECint2000 • MediaBench
(Added) Experimental Methodology • In some ways, an “upper bound” on commutativity • Can issue as many instructions as are commutative • Memory is perfect • Not a true upper bound tho • Random interpretation will sometimes fail/give up
(Added) Suggested Parallelism • Suppose a sorting algorithm will print to stderr if debug flag is set • Cannot be parallelized, b/c of dependences between writes • Human can differentiate • Compiler identifies things that are almost parallel, • Human states that the semantic changes (e.g., printf orders) do not matter parallel • Otherwise, ignore
Summary Commutativity a significant source of parallelism Identical memory does not matter for identifying commutative functions Our technique: 13% more commutative functions detected 28% more parallelism uncovered