1 / 21

Be-Nice Scheduling for embedded SMT processors

Be-Nice Scheduling for embedded SMT processors. Handong Ye. Apr 6 th , 2008 Boston. Be-Nice Scheduling. ITS (Inter-Thread Stall) Introduction Be-Nice Scheduling Some experimental results. Be-Nice Scheduling. ITS Introduction ITS in Out-Of-Order processor ITS in In-Order processor

Download Presentation

Be-Nice Scheduling for embedded SMT processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Be-Nice Schedulingfor embedded SMT processors Handong Ye Apr 6th, 2008 Boston

  2. Be-Nice Scheduling • ITS (Inter-Thread Stall) Introduction • Be-Nice Scheduling • Some experimental results

  3. Be-Nice Scheduling • ITS Introduction • ITS in Out-Of-Order processor • ITS in In-Order processor • Be-Nice Scheduling • Some experimental results

  4. Be-Nice Scheduling • ITS Introduction • ITS in Out-Of-Order machine • A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others • Flush, … • ITS in In-Order machine • A thread holds Functional Units, blocking others • 2 examples • What can compiler do ?

  5. Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Examples, assume: • SMT, 2 threads • Embedded • 2 LS units, and 2 ALU • Separate dispatch buffer

  6. Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Example – 1 (Same FU ITS) • A missed load can block other threads which are using the same LS unit

  7. Thread-B Thread-A add add Dispatch Buffer add ld ld ld EXE MEM MISS WB LS1 LS2 ALU1 ALU2 Example - 1 : same-FU block

  8. Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Example – 2 (Cross FU ITS) • A missed load can block other threads which are using non-LS Functional Units, e.g., ALU

  9. Thread-B Thread-A add add add Dispatch Buffer add add ld ld EXE MEM MISS WB LS1 LS2 ALU1 ALU2 Example – 2 : cross-FU block

  10. Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Assume: • Thread-A cache miss, around 1%~2% • 2. Thread-B always hit • Results: • 1. Half of idle cycles are • due to ITS • 2. Almost 1/3 cycles are • idle The effect of ITS, from thread-A to thread-B

  11. Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • What can compiler do ? • Focused on in-order embedded processor • Need a few simple HW supports • Using Open64, in Instruction Scheduling

  12. Be-Nice Scheduling • ITS (Inter-Thread Stall) Introduction • Be-Nice Scheduling • Some experimental results

  13. Be-Nice Scheduling • Be-Nice Scheduling • Intuitive thinking • Prefetch : Unacceptable for embedded system • Reduce Cross-FU ITS: Reduce the number of FUs hold by the thread-A • Reduce Same-FU ITS: Avoid issuing instructions from other threads into those blocked FUs

  14. ld add add ld Original Thread-A Thread-B Thread-A add add add add sched Dispatch Buffer ld add ld ld EXE MEM WB LS1 LS2 ALU1 ALU2

  15. Be-Nice Scheduling • Be-Nice Scheduling • Objective • Schedule n (>=2) loads back-to-back • Issue the n loads to same FU • Compiler + HW solution • HW side • Add an extra load, ld.n (n=1,2), saying sending load only to the nth LS unit • Different threads has its prefer LS unit • Compiler side • Profile to figure out the loads which are highly possible to miss , saying ‘load_a’ • Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them as a pseudo OP • Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both are changed to ‘ld.1’

  16. Be-Nice Scheduling • Be-Nice Scheduling • A Compiler + HW solution Identified to miss BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3

  17. WHIRL CG-expand CGIR Extended block optimizer Software pipelining Loop unrolling Be-Nice Scheduling Scheduling pre- pass ( GCM here) Global register alloc Local register alloc Control flow opt. Scheduling post-pass If-conversion Loop optimizations Prolog and Epilog Code emission .s

  18. Be-Nice Scheduling • Be-Nice Scheduling ( In Open64 GCM and LIS ) • The key points during code motion • Use GCM to find candidates of <ld.1, ld.1> pair • Moving the pair as a ‘pseudo’ single instruction

  19. Be-Nice Scheduling • Some experimental results • Be-Nice Schedule on Thread-A • Performance difference on Thread-B

  20. Be-Nice Scheduling • Some experimental results The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice

  21. Be-Nice Scheduling • Some experimental results IPC Improvement of thread-B with Be-Nice Instruction Scheduling

More Related