Compiler Optimization of scalar and memory resident values between speculative threads.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

P P P P C C C C C Improving Performance of a Single Application T1 T2 T3 T4 dependence Store 0x86 Load 0x86  Chip Boundary  Finding parallel threads is difficult

Compiler Makes Conservative Decisions Iteration #2 Iteration #3 Iteration #4 Iteration #1 for (i = 0; i < N; i++) { …  *q *p  … } …  0x80 0x86  … …  0x60 0x66  … …  0x90 0x96  … …  0x50 0x56  … Pros • Examine the entire program Cons • Make conservative decisions for ambiguous dependences • Complex control flow • Pointer and indirect references • Runtime input

Instruction Window Load 0x88 Store 0x66 Add a1, a2 Using Hardware to Exploit Parallelism Pros • Disambiguate dependence accurately at runtime • Search for independent instructions within a window Instruction Window Cons • Exploit parallelism among a small number of instructions  How to exploit parallelism with both hardware and compiler?

Compiler Creates parallel threads Hardware Disambiguates dependences Thread-Level Speculation (TLS) Speculatively parallel Sequential Load 0x86 Load *q dependence Store *p Store 0x86  Store *p    Time  Load *q  My goal: Speed up programs with the help of TLS

Single Chip Multiprocessor (CMP) Chip Boundary • Replicated processor cores • Reduce design cost • Scalable and decentralized design • Localize wires • Eliminate centralized structures • Infrastructure transparency • Handle legacy codes P P P C C C Interconnect Memory •  Our support and optimization for TLS • Built upon CMP • Applied to other architectures that support multiple threads

Thread-Level Speculation (TLS) Speculatively parallel Load *q dependence  Store *p Time    Support thread level speculation Recover from failed speculation • Buffer speculative writes from the memory Track data dependences • Detect data dependence violation

Buffering Speculative Writes from the Memory P P P P Chip Boundary C C C C Interconnect Memory Data contents Directory to maintain cache coherence Speculative state  Extending cache coherence protocol to support TLS

 violation detected Detecting Dependence Violations Producer Consumer P P • Producer forwards all store addresses • Consumer detects data dependence violation C C store address Interconnect  Extending cache coherence protocol [ISCA’00]

Synchronizing Scalars Producer Consumer …=a Time …=a a=… a=…  Identifying scalars causing dependences is straightforward

Synchronizing Scalars Producer Consumer …=a Wait(a) Time a=… Signal(a) …=a a=… Use forwarded value  Dependent scalars should be synchronized

Reducing the Critical Forwarding Path Long Critical Path Short Critical Path wait …=a Time Critical Path a = … Critical Path signal  Instruction scheduling can reduce critical forwarding path

Potential

Compiler Infrastructure • Loops were targeted. • Selected loops were discarded • Low coverage (<0.1%) • Less than 30 instructions / iteration • >16384 instructions / iteration • Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results. • Compiler inserts instructions to create and manage epochs. • Compiler allocates forward variables on stack (forwarding frame) • Compiler inserts wait and signal instructions.

Synchronization • Constraints • Wait before first use • Signal after last definition • Signal on every possible path • Wait should be as late as possible • Signal should be as early as possible.

Data flow Analysis for synchronization • The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included. • CFG is modeled as a graph, with BB as nodes.

Instruction Scheduling • Conservative scheduling • Aggressive scheduling • Control Dependences • Data dependences

wait(a) signal(a) Instruction Scheduling … = a a = … Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions

a = … a = … a = … a = … a = … a = … a = … signal(a) signal(a) signal(a) signal(a) signal(a) signal(a) signal(a) Instruction Scheduling wait(a) wait(a) … = a … = a a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions

*q=… *q=… a = … a = … signal(a) signal(a) Instruction Scheduling wait(a) wait(a) wait(a) … = a … = a … = a a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions

Instruction Scheduling Dataflow Analysis: Handles complex control flow We Define Two Dataflow Analyses: “Stack” analysis: finds instructions needed to compute the forwarded value “Earliest” analysis: finds the earliest node to compute the forwarded value 

a = a*11 signal a Computation Stack Stores the instructions to compute a forwarded value Associating a stack to every node for every forwarded scalar We know how to compute the forwarded value We don’t know how to compute the forwarded value Not yet evaluated

start counter=0 wait(p) p->jmp? p=p->jmp q = p; p->real? counter++; q=q->next; q? p=p->next p = p->next end A Simplified Example from GCC do { } while(p); . . .

Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? counter++; q=q->next; q? p=p->next p = p->next signal p end

Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p = p->next signal p end

Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p=p->next p = p->next signal p signal p end

p=p->next signal p p=p->next signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p=p->next p = p->next signal p signal p end

Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

p=p->next Node to Revisit signal p p=p->next p=p->next signal p signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit p=p->next signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit p=p->next signal p p=p->jmp p=p->next p=p->next signal p signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Node to Revisit p=p->next p=p->next signal p signal p p=p->next p=p->next signal p signal p Solution is consistent Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Scheduling Instructions Dataflow Analysis: Handles complex control flow We Define Two Dataflow Problems: “Stack” analysis: finds instructions needed to compute the forwarded value. “Earliest” analysis: finds the earliest node to compute the forwarded value. 

Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

The Earliest Analysis Earliest start Not Earliest p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

Code Transformation Earliest start Not Earliest counter=0 wait(p) p->jmp? p2=p->jmp p1=p2->next Signal(p1) p1=p->next Signal(p1) q = p p->real? counter++; q=q->next; q? p=p->next p = p->next end

wait(a) … = a *q=… *q=… a = … a = … signal(a) signal(a) a = … signal(a) Initial Synchronization Instruction Scheduling wait(a) wait(a) … = a … = a a = … signal(a) Speculatively Scheduling Instructions Scheduling Instructions

Aggressive scheduling • Optimize common case by moving signal up. • On wrong path, send violate_epoch to next thread and then forward correct value. • If instructions are scheduled past branches, then on occurrence of an exception, a violation should be sent and non-speculative code should be executed. • Changes: • Modify Meet operator to compute stack for frequently executed paths • Add a new node for infrequent paths and insert violate_epoch signals. Earliest is set to true for these nodes.

Aggressive scheduling • Two new instructions • Mark_load – tells H/W to remember the address of location Placed when load moves past store • Unmark_load – clears the mark. placed at original position of load instruction. • In the meet operation, conflicting load marks are merged using logical or.

Compiler Optimization of scalar and memory resident values between speculative threads.

Compiler Optimization of scalar and memory resident values between speculative threads.

Presentation Transcript

MEMORY OPTIMIZATION

Dynamic Performance Tuning for Speculative Threads

Compiler Optimization-Space Exploration

Optimizing Compiler . Scalar optimizations .

Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads

Compiler Optimization and Code Generation

Optimizing General Compiler Optimization

Compiler Estimation of Load Imbalance Overhead in Speculative Parallelization

Shared Memory Programming OpenMP and Threads

Compiler Speculative Optimizations

IBM Compiler Optimization Arguments

IBM Compiler Optimization Arguments

Static Compiler Optimization Techniques

Optimizing Compiler . Scalar optimizations .

Static Compiler Optimization Techniques

CSC D70: Compiler Optimization

CSC D70: Compiler Optimization Memory Optimizations

Static Compiler Optimization Techniques

MEMORY OPTIMIZATION

Compiler Optimization-Space Exploration

Compiler Optimization and Code Generation

Shared Memory Programming: Threads and OpenMP