Be-Nice Scheduling for embedded SMT processors

Be-Nice Schedulingfor embedded SMT processors Handong Ye Apr 6th, 2008 Boston

Be-Nice Scheduling • ITS (Inter-Thread Stall) Introduction • Be-Nice Scheduling • Some experimental results

Be-Nice Scheduling • ITS Introduction • ITS in Out-Of-Order processor • ITS in In-Order processor • Be-Nice Scheduling • Some experimental results

Be-Nice Scheduling • ITS Introduction • ITS in Out-Of-Order machine • A thread holds (or fulfills) shared resources too long, e.g., instruction queue/reservation station/..., and blocks others • Flush, … • ITS in In-Order machine • A thread holds Functional Units, blocking others • 2 examples • What can compiler do ?

Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Examples, assume: • SMT, 2 threads • Embedded • 2 LS units, and 2 ALU • Separate dispatch buffer

Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Example – 1 (Same FU ITS) • A missed load can block other threads which are using the same LS unit

Thread-B Thread-A add add Dispatch Buffer add ld ld ld EXE MEM MISS WB LS1 LS2 ALU1 ALU2 Example - 1 : same-FU block

Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Example – 2 (Cross FU ITS) • A missed load can block other threads which are using non-LS Functional Units, e.g., ALU

Thread-B Thread-A add add add Dispatch Buffer add add ld ld EXE MEM MISS WB LS1 LS2 ALU1 ALU2 Example – 2 : cross-FU block

Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • Assume: • Thread-A cache miss, around 1%~2% • 2. Thread-B always hit • Results: • 1. Half of idle cycles are • due to ITS • 2. Almost 1/3 cycles are • idle The effect of ITS, from thread-A to thread-B

Be-Nice Scheduling • ITS Introduction • ITS In In-Order machine • What can compiler do ? • Focused on in-order embedded processor • Need a few simple HW supports • Using Open64, in Instruction Scheduling

Be-Nice Scheduling • ITS (Inter-Thread Stall) Introduction • Be-Nice Scheduling • Some experimental results

Be-Nice Scheduling • Be-Nice Scheduling • Intuitive thinking • Prefetch : Unacceptable for embedded system • Reduce Cross-FU ITS: Reduce the number of FUs hold by the thread-A • Reduce Same-FU ITS: Avoid issuing instructions from other threads into those blocked FUs

ld add add ld Original Thread-A Thread-B Thread-A add add add add sched Dispatch Buffer ld add ld ld EXE MEM WB LS1 LS2 ALU1 ALU2

Be-Nice Scheduling • Be-Nice Scheduling • Objective • Schedule n (>=2) loads back-to-back • Issue the n loads to same FU • Compiler + HW solution • HW side • Add an extra load, ld.n (n=1,2), saying sending load only to the nth LS unit • Different threads has its prefer LS unit • Compiler side • Profile to figure out the loads which are highly possible to miss , saying ‘load_a’ • Schedule another load, saying ‘load_b’, behind ‘load_a’, and glue them as a pseudo OP • Change ‘load_a’ and ‘load_b’ to the thread’s prefer LS unit, e.g., both are changed to ‘ld.1’

Be-Nice Scheduling • Be-Nice Scheduling • A Compiler + HW solution Identified to miss BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r2 = $r2 + 4 $r3 = ld $r4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld $r2 $r3 = ld $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3 BB1: $r1 = ld.1 $r2 $r3 = ld.1 $r4 $r2 = $r2 + 4 $r3 = $r3 + 4 $r5 = $r1 + $r3

WHIRL CG-expand CGIR Extended block optimizer Software pipelining Loop unrolling Be-Nice Scheduling Scheduling pre- pass ( GCM here) Global register alloc Local register alloc Control flow opt. Scheduling post-pass If-conversion Loop optimizations Prolog and Epilog Code emission .s

Be-Nice Scheduling • Be-Nice Scheduling ( In Open64 GCM and LIS ) • The key points during code motion • Use GCM to find candidates of <ld.1, ld.1> pair • Moving the pair as a ‘pseudo’ single instruction

Be-Nice Scheduling • Some experimental results • Be-Nice Schedule on Thread-A • Performance difference on Thread-B

Be-Nice Scheduling • Some experimental results The Number of ITS Cycles in thread-B: w/ Be-Nice vs. w/o Be-Nice

Be-Nice Scheduling • Some experimental results IPC Improvement of thread-B with Be-Nice Instruction Scheduling

Be-Nice Scheduling for embedded SMT processors

Be-Nice Scheduling for embedded SMT processors

Presentation Transcript

9. Code Scheduling for ILP-Processors

Embedded System Scheduling

Cache Utilization-Aware Scheduling for Multicore Processors

Embedded Processors

CECS 347 Embedded Processors

Macro instruction synthesis for embedded processors

Architectural Support for Enhanced SMT Job Scheduling

Embedded Processors are Everywhere

Compiler Issues for Embedded Processors

Design Support for Embedded Processors and Applications

Asymmetry Aware Scheduling Algorithms for Asymmetric Processors

Scalable Vector Processors for Embedded Systems

Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors

Storage Allocation for Embedded Processors

DLL-Conscious Instruction Fetch Optimization for SMT Processors

Processors for Embedded Systems

Embedded System Scheduling

“Temperature-Aware Task Scheduling for Multicore Processors”

Design Support for Embedded Processors and Applications

Processors for Embedded Systems

Scheduling for Embedded Real-Time Systems