210 likes | 348 Views
Be-Nice Scheduling for embedded SMT processors. Handong Ye. Apr 6 th , 2008 Boston. Be-Nice Scheduling. ITS (Inter-Thread Stall) Introduction Be-Nice Scheduling Some experimental results. Be-Nice Scheduling. ITS Introduction ITS in Out-Of-Order processor ITS in In-Order processor
E N D
Be-Nice Schedulingfor embedded SMT processors Handong Ye Apr 6th, 2008 Boston
Be-Nice Scheduling • ITS (Inter-Thread Stall) Introduction • Be-Nice Scheduling • Some experimental results
Be-Nice Scheduling • ITS Introduction • ITS in Out-Of-Order processor • ITS in In-Order processor • Be-Nice Scheduling • Some experimental results
Be-Nice Scheduling • ITS Introduction • ITS in Out-Of-Order machine • A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others • Flush, … • ITS in In-Order machine • A thread holds Functional Units, blocking others • 2 examples • What can compiler do ?
Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Examples, assume: • SMT, 2 threads • Embedded • 2 LS units, and 2 ALU • Separate dispatch buffer
Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Example – 1 (Same FU ITS) • A missed load can block other threads which are using the same LS unit
Thread-B Thread-A add add Dispatch Buffer add ld ld ld EXE MEM MISS WB LS1 LS2 ALU1 ALU2 Example - 1 : same-FU block
Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Example – 2 (Cross FU ITS) • A missed load can block other threads which are using non-LS Functional Units, e.g., ALU
Thread-B Thread-A add add add Dispatch Buffer add add ld ld EXE MEM MISS WB LS1 LS2 ALU1 ALU2 Example – 2 : cross-FU block
Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Assume: • Thread-A cache miss, around 1%~2% • 2. Thread-B always hit • Results: • 1. Half of idle cycles are • due to ITS • 2. Almost 1/3 cycles are • idle The effect of ITS, from thread-A to thread-B
Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • What can compiler do ? • Focused on in-order embedded processor • Need a few simple HW supports • Using Open64, in Instruction Scheduling
Be-Nice Scheduling • ITS (Inter-Thread Stall) Introduction • Be-Nice Scheduling • Some experimental results
Be-Nice Scheduling • Be-Nice Scheduling • Intuitive thinking • Prefetch : Unacceptable for embedded system • Reduce Cross-FU ITS: Reduce the number of FUs hold by the thread-A • Reduce Same-FU ITS: Avoid issuing instructions from other threads into those blocked FUs
ld add add ld Original Thread-A Thread-B Thread-A add add add add sched Dispatch Buffer ld add ld ld EXE MEM WB LS1 LS2 ALU1 ALU2
Be-Nice Scheduling • Be-Nice Scheduling • Objective • Schedule n (>=2) loads back-to-back • Issue the n loads to same FU • Compiler + HW solution • HW side • Add an extra load, ld.n (n=1,2), saying sending load only to the nth LS unit • Different threads has its prefer LS unit • Compiler side • Profile to figure out the loads which are highly possible to miss , saying ‘load_a’ • Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them as a pseudo OP • Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both are changed to ‘ld.1’
Be-Nice Scheduling • Be-Nice Scheduling • A Compiler + HW solution Identified to miss BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3
WHIRL CG-expand CGIR Extended block optimizer Software pipelining Loop unrolling Be-Nice Scheduling Scheduling pre- pass ( GCM here) Global register alloc Local register alloc Control flow opt. Scheduling post-pass If-conversion Loop optimizations Prolog and Epilog Code emission .s
Be-Nice Scheduling • Be-Nice Scheduling ( In Open64 GCM and LIS ) • The key points during code motion • Use GCM to find candidates of <ld.1, ld.1> pair • Moving the pair as a ‘pseudo’ single instruction
Be-Nice Scheduling • Some experimental results • Be-Nice Schedule on Thread-A • Performance difference on Thread-B
Be-Nice Scheduling • Some experimental results The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice
Be-Nice Scheduling • Some experimental results IPC Improvement of thread-B with Be-Nice Instruction Scheduling