670 likes | 802 Views
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al. P. P. P. P. C. C. C. C. C. Improving Performance of a Single Application. T1. T2. T3. T4. dependence. Store 0x86. Load 0x86. . Chip Boundary.
E N D
Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.
P P P P C C C C C Improving Performance of a Single Application T1 T2 T3 T4 dependence Store 0x86 Load 0x86 Chip Boundary Finding parallel threads is difficult
Compiler Makes Conservative Decisions Iteration #2 Iteration #3 Iteration #4 Iteration #1 for (i = 0; i < N; i++) { … *q *p … } … 0x80 0x86 … … 0x60 0x66 … … 0x90 0x96 … … 0x50 0x56 … Pros • Examine the entire program Cons • Make conservative decisions for ambiguous dependences • Complex control flow • Pointer and indirect references • Runtime input
Instruction Window Load 0x88 Store 0x66 Add a1, a2 Using Hardware to Exploit Parallelism Pros • Disambiguate dependence accurately at runtime • Search for independent instructions within a window Instruction Window Cons • Exploit parallelism among a small number of instructions How to exploit parallelism with both hardware and compiler?
Compiler Creates parallel threads Hardware Disambiguates dependences Thread-Level Speculation (TLS) Speculatively parallel Sequential Load 0x86 Load *q dependence Store *p Store 0x86 Store *p Time Load *q My goal: Speed up programs with the help of TLS
Single Chip Multiprocessor (CMP) Chip Boundary • Replicated processor cores • Reduce design cost • Scalable and decentralized design • Localize wires • Eliminate centralized structures • Infrastructure transparency • Handle legacy codes P P P C C C Interconnect Memory • Our support and optimization for TLS • Built upon CMP • Applied to other architectures that support multiple threads
Thread-Level Speculation (TLS) Speculatively parallel Load *q dependence Store *p Time Support thread level speculation Recover from failed speculation • Buffer speculative writes from the memory Track data dependences • Detect data dependence violation
Buffering Speculative Writes from the Memory P P P P Chip Boundary C C C C Interconnect Memory Data contents Directory to maintain cache coherence Speculative state Extending cache coherence protocol to support TLS
violation detected Detecting Dependence Violations Producer Consumer P P • Producer forwards all store addresses • Consumer detects data dependence violation C C store address Interconnect Extending cache coherence protocol [ISCA’00]
Synchronizing Scalars Producer Consumer …=a Time …=a a=… a=… Identifying scalars causing dependences is straightforward
Synchronizing Scalars Producer Consumer …=a Wait(a) Time a=… Signal(a) …=a a=… Use forwarded value Dependent scalars should be synchronized
Reducing the Critical Forwarding Path Long Critical Path Short Critical Path wait …=a Time Critical Path a = … Critical Path signal Instruction scheduling can reduce critical forwarding path
Compiler Infrastructure • Loops were targeted. • Selected loops were discarded • Low coverage (<0.1%) • Less than 30 instructions / iteration • >16384 instructions / iteration • Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results. • Compiler inserts instructions to create and manage epochs. • Compiler allocates forward variables on stack (forwarding frame) • Compiler inserts wait and signal instructions.
Synchronization • Constraints • Wait before first use • Signal after last definition • Signal on every possible path • Wait should be as late as possible • Signal should be as early as possible.
Data flow Analysis for synchronization • The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included. • CFG is modeled as a graph, with BB as nodes.
Instruction Scheduling • Conservative scheduling • Aggressive scheduling • Control Dependences • Data dependences
wait(a) signal(a) Instruction Scheduling … = a a = … Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions
a = … a = … a = … a = … a = … a = … a = … signal(a) signal(a) signal(a) signal(a) signal(a) signal(a) signal(a) Instruction Scheduling wait(a) wait(a) … = a … = a a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions
*q=… *q=… a = … a = … signal(a) signal(a) Instruction Scheduling wait(a) wait(a) wait(a) … = a … = a … = a a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions
Instruction Scheduling Dataflow Analysis: Handles complex control flow We Define Two Dataflow Analyses: “Stack” analysis: finds instructions needed to compute the forwarded value “Earliest” analysis: finds the earliest node to compute the forwarded value
a = a*11 signal a Computation Stack Stores the instructions to compute a forwarded value Associating a stack to every node for every forwarded scalar We know how to compute the forwarded value We don’t know how to compute the forwarded value Not yet evaluated
start counter=0 wait(p) p->jmp? p=p->jmp q = p; p->real? counter++; q=q->next; q? p=p->next p = p->next end A Simplified Example from GCC do { } while(p); . . .
Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? counter++; q=q->next; q? p=p->next p = p->next signal p end
Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p = p->next signal p end
Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p=p->next p = p->next signal p signal p end
p=p->next signal p p=p->next signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p=p->next p = p->next signal p signal p end
Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
p=p->next Node to Revisit signal p p=p->next p=p->next signal p signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit p=p->next signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit p=p->next signal p p=p->jmp p=p->next p=p->next signal p signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Node to Revisit p=p->next p=p->next signal p signal p p=p->next p=p->next signal p signal p Solution is consistent Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Scheduling Instructions Dataflow Analysis: Handles complex control flow We Define Two Dataflow Problems: “Stack” analysis: finds instructions needed to compute the forwarded value. “Earliest” analysis: finds the earliest node to compute the forwarded value.
Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
The Earliest Analysis Earliest start Not Earliest p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end
Code Transformation Earliest start Not Earliest counter=0 wait(p) p->jmp? p2=p->jmp p1=p2->next Signal(p1) p1=p->next Signal(p1) q = p p->real? counter++; q=q->next; q? p=p->next p = p->next end
wait(a) … = a *q=… *q=… a = … a = … signal(a) signal(a) a = … signal(a) Initial Synchronization Instruction Scheduling wait(a) wait(a) … = a … = a a = … signal(a) Speculatively Scheduling Instructions Scheduling Instructions
Aggressive scheduling • Optimize common case by moving signal up. • On wrong path, send violate_epoch to next thread and then forward correct value. • If instructions are scheduled past branches, then on occurrence of an exception, a violation should be sent and non-speculative code should be executed. • Changes: • Modify Meet operator to compute stack for frequently executed paths • Add a new node for infrequent paths and insert violate_epoch signals. Earliest is set to true for these nodes.
Aggressive scheduling • Two new instructions • Mark_load – tells H/W to remember the address of location Placed when load moves past store • Unmark_load – clears the mark. placed at original position of load instruction. • In the meet operation, conflicting load marks are merged using logical or.