1 / 67

Compiler Optimization of scalar and memory resident values between speculative threads.

Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al. P. P. P. P. C. C. C. C. C. Improving Performance of a Single Application. T1. T2. T3. T4. dependence. Store 0x86. Load 0x86. . Chip Boundary.

Download Presentation

Compiler Optimization of scalar and memory resident values between speculative threads.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Compiler Optimization of scalar and memory resident values between speculative threads. Antonia Zhai et. al.

  2. P P P P C C C C C Improving Performance of a Single Application T1 T2 T3 T4 dependence Store 0x86 Load 0x86  Chip Boundary  Finding parallel threads is difficult

  3. Compiler Makes Conservative Decisions Iteration #2 Iteration #3 Iteration #4 Iteration #1 for (i = 0; i < N; i++) { …  *q *p  … } …  0x80 0x86  … …  0x60 0x66  … …  0x90 0x96  … …  0x50 0x56  … Pros • Examine the entire program Cons • Make conservative decisions for ambiguous dependences • Complex control flow • Pointer and indirect references • Runtime input

  4. Instruction Window Load 0x88 Store 0x66 Add a1, a2 Using Hardware to Exploit Parallelism Pros • Disambiguate dependence accurately at runtime • Search for independent instructions within a window Instruction Window Cons • Exploit parallelism among a small number of instructions  How to exploit parallelism with both hardware and compiler?

  5. Compiler Creates parallel threads Hardware Disambiguates dependences Thread-Level Speculation (TLS) Speculatively parallel Sequential Load 0x86 Load *q dependence Store *p Store 0x86  Store *p    Time  Load *q  My goal: Speed up programs with the help of TLS

  6. Single Chip Multiprocessor (CMP) Chip Boundary • Replicated processor cores • Reduce design cost • Scalable and decentralized design • Localize wires • Eliminate centralized structures • Infrastructure transparency • Handle legacy codes P P P C C C Interconnect Memory •  Our support and optimization for TLS • Built upon CMP • Applied to other architectures that support multiple threads

  7. Thread-Level Speculation (TLS) Speculatively parallel Load *q dependence  Store *p Time    Support thread level speculation Recover from failed speculation • Buffer speculative writes from the memory Track data dependences • Detect data dependence violation

  8. Buffering Speculative Writes from the Memory P P P P Chip Boundary C C C C Interconnect Memory Data contents Directory to maintain cache coherence Speculative state  Extending cache coherence protocol to support TLS

  9. violation detected Detecting Dependence Violations Producer Consumer P P • Producer forwards all store addresses • Consumer detects data dependence violation C C store address Interconnect  Extending cache coherence protocol [ISCA’00]

  10. Synchronizing Scalars Producer Consumer …=a Time …=a a=… a=…  Identifying scalars causing dependences is straightforward

  11. Synchronizing Scalars Producer Consumer …=a Wait(a) Time a=… Signal(a) …=a a=… Use forwarded value  Dependent scalars should be synchronized

  12. Reducing the Critical Forwarding Path Long Critical Path Short Critical Path wait …=a Time Critical Path a = … Critical Path signal  Instruction scheduling can reduce critical forwarding path

  13. Potential

  14. Compiler Infrastructure • Loops were targeted. • Selected loops were discarded • Low coverage (<0.1%) • Less than 30 instructions / iteration • >16384 instructions / iteration • Loops were then profiled and simulated for measuring optimistic upper bound. Good loops were chosen based on the results. • Compiler inserts instructions to create and manage epochs. • Compiler allocates forward variables on stack (forwarding frame) • Compiler inserts wait and signal instructions.

  15. Synchronization • Constraints • Wait before first use • Signal after last definition • Signal on every possible path • Wait should be as late as possible • Signal should be as early as possible.

  16. Data flow Analysis for synchronization • The set of forwarding scalars are defined as intersection of set of scalars with downwards exposed definition and upwards exposed use. Scalars live outside loop are also included. • CFG is modeled as a graph, with BB as nodes.

  17. Instruction Scheduling • Conservative scheduling • Aggressive scheduling • Control Dependences • Data dependences

  18. wait(a) signal(a) Instruction Scheduling … = a a = … Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions

  19. a = … a = … a = … a = … a = … a = … a = … signal(a) signal(a) signal(a) signal(a) signal(a) signal(a) signal(a) Instruction Scheduling wait(a) wait(a) … = a … = a a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions

  20. *q=… *q=… a = … a = … signal(a) signal(a) Instruction Scheduling wait(a) wait(a) wait(a) … = a … = a … = a a = … signal(a) a = … signal(a) Speculatively Scheduling Instructions Initial Synchronization Scheduling Instructions

  21. Instruction Scheduling Dataflow Analysis: Handles complex control flow We Define Two Dataflow Analyses: “Stack” analysis: finds instructions needed to compute the forwarded value “Earliest” analysis: finds the earliest node to compute the forwarded value 

  22. a = a*11 signal a Computation Stack Stores the instructions to compute a forwarded value Associating a stack to every node for every forwarded scalar We know how to compute the forwarded value We don’t know how to compute the forwarded value Not yet evaluated

  23. start counter=0 wait(p) p->jmp? p=p->jmp q = p; p->real? counter++; q=q->next; q? p=p->next p = p->next end A Simplified Example from GCC do { } while(p); . . .

  24. Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? counter++; q=q->next; q? p=p->next p = p->next signal p end

  25. Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p = p->next signal p end

  26. Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p=p->next p = p->next signal p signal p end

  27. p=p->next signal p p=p->next signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p counter++; q=q->next; q? p=p->next p=p->next p = p->next signal p signal p end

  28. Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  29. Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  30. p=p->next Node to Revisit signal p p=p->next p=p->next signal p signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp q = p p->real? p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  31. Node to Revisit Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  32. Node to Revisit p=p->next signal p Stack Analysis start counter=0 wait(p) p->jmp? p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  33. Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  34. Node to Revisit p=p->next signal p p=p->jmp p=p->next p=p->next signal p signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  35. Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  36. Node to Revisit p=p->next signal p Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  37. Node to Revisit p=p->next p=p->next signal p signal p p=p->next p=p->next signal p signal p Solution is consistent Stack Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  38. Scheduling Instructions Dataflow Analysis: Handles complex control flow We Define Two Dataflow Problems: “Stack” analysis: finds instructions needed to compute the forwarded value. “Earliest” analysis: finds the earliest node to compute the forwarded value. 

  39. Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  40. Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  41. Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  42. Earliest Not Earliest Not Evaluated The Earliest Analysis start p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  43. The Earliest Analysis Earliest start Not Earliest p=p->jmp counter=0 wait(p) p->jmp? p=p->next p=p->next signal p signal p p=p->jmp p=p->next q = p p->real? signal p p=p->next p=p->next signal p signal p p=p->next counter++; q=q->next; q? p=p->next signal p p=p->next p = p->next signal p signal p end

  44. Code Transformation Earliest start Not Earliest counter=0 wait(p) p->jmp? p2=p->jmp p1=p2->next Signal(p1) p1=p->next Signal(p1) q = p p->real? counter++; q=q->next; q? p=p->next p = p->next end

  45. wait(a) … = a *q=… *q=… a = … a = … signal(a) signal(a) a = … signal(a) Initial Synchronization Instruction Scheduling wait(a) wait(a) … = a … = a a = … signal(a) Speculatively Scheduling Instructions Scheduling Instructions

  46. Aggressive scheduling • Optimize common case by moving signal up. • On wrong path, send violate_epoch to next thread and then forward correct value. • If instructions are scheduled past branches, then on occurrence of an exception, a violation should be sent and non-speculative code should be executed. • Changes: • Modify Meet operator to compute stack for frequently executed paths • Add a new node for infrequent paths and insert violate_epoch signals. Earliest is set to true for these nodes.

  47. Aggressive scheduling • Two new instructions • Mark_load – tells H/W to remember the address of location Placed when load moves past store • Unmark_load – clears the mark. placed at original position of load instruction. • In the meet operation, conflicting load marks are merged using logical or.

More Related