170 likes | 328 Views
CSE P501 – Compiler Construction. Instruction Scheduling Issues Latencies List scheduling. Instruction Scheduling is . . . a b c d e f g h. b a d f c g h f. Schedule. Execute in-order to get correct answer. Issue in new order eg : memory fetch is slow eg : divide is slow
E N D
CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling Jim Hogg - UW - CSE - P501
Instruction Scheduling is . . . a b c d e f g h b a d f c g h f Schedule • Execute in-order to get correct answer • Issue in new order • eg: memory fetch is slow • eg: divide is slow • Overall faster • Still get correct answer! • Originally devised for super-computers • Now used everywhere: • in-order procs - older ARM • out-of-order procs - newer x86 • Compiler does 'heavy lifting' - reduce chip power Jim Hogg - UW - CSE - P501
Chip Complexity, 1 Following factors make scheduling complicated: • Different kinds of instruction take different times (in clock cycles) to complete • Modern chips have multiple functional units • so they can issue several operations per cycle • "super-scalar" • Loads are non-blocking • ~50 in-flight loads and ~50 in-flight stores JIm Hogg - UW - CSE - P501
Typical Instruction Timings JIm Hogg - UW - CSE - P501
Load Latencies Core Core • Instruction ~5 per cycle • Register 1 cycle • L1 Cache ~4 cycles • L2 Cache ~10 cycles • L3 Cache ~40 cycles • DRAM ~100 ns L1 = 64 KB per core L2 = 256 KB per core L3 = 2-8 MB shared DRAM
Super-Scalar JIm Hogg - UW - CSE - P501
Chip Complexity, 2 • Branch costs vary (branch predictor) • Branches on some processors have delay slots (eg: Sparc) • Modern processors have branch-predictor logic in hardware • heuristics predict whether branches are taken or not • keeps pipelines full • GOAL: Scheduler should reorder instructions to • hide latencies • take advantage of multiple function units (and delay slots) • help the processor effectively pipeline execution • However, many chips schedule on-the-fly too • eg: Haswell out-of-order window = 192 ops JIm Hogg - UW - CSE - P501
Data Dependence Graph a leaf c b e d g f root h i read-after-write = RAW = true dependence = flow dependence write-after-read = WAR = anti-dependence write-after-write = WAW = output-dependence The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies JIm Hogg - UW - CSE - P501
Scheduling Really Works ... Original Scheduled a = 2*a*b*c*d 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle New schedule uses extra register: r3 Preserves (WAW) output-dependency JIm Hogg - UW - CSE - P501
Scheduler: Job Description • The Job • Given code for some machine; and latencies for each instruction, reorder to minimize execution time • Constraints • Produce correct code • Minimize wasted cycles • Avoid spilling registers • Don't take forever to reach an answer JIm Hogg - UW - CSE - P501
Job Description - Part 2 • foreach instruction in dependence graph • Denote current instruction as ins • Denote number of cyles to execute as ins.delay • Denote cycle number in which ins should start as ins.start • foreach instruction depthat is dependent on ins • Ensure ins.start + ins.delay<= dep.start What if the scheduler makes a mistake? On-chip hardware stalls the pipeline until operands become available: so slower, but still correct! JIm Hogg - UW - CSE - P501
Dependence Graph + Timings a13 c12 b10 e10 d9 g8 f7 h5 i3 • Superscripts show path length to end of computation • a-b-d-f-h-i is critical path • Can schedule leaves any time - no constraints • Since a has longest delay, schedule it first; then c; then ... 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle JIm Hogg - UW - CSE - P501
List Scheduling • Build a precedence graph D • Compute a priority function over the nodes in D • typical: longest latency-weighted path • Rename registers to remove WAW conflicts • Create schedule, one cycle at a time • Use queue of operations that are Ready • At each cycle • Choose a Ready operation and schedule it • Update Ready queue JIm Hogg - UW - CSE - P501
List Scheduling Algorithm cycle = 1 // clock cycle number Ready = leaves of D // ready to be scheduled Active = { } // being executed while Ready Active {} do foreachins Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc Ready then Ready = {suc} endif enddo endif endforeach if Ready {} then remove an instruction, ins, from Ready ins.start = cycle; Active= ins; endif cycle++ endwhile
Beyond Basic Blocks • List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities: • Schedule extended basic blocks (EBBs) • Watch for exit points – limits reordering or requires compensating • Trace scheduling • Use profiling information to select regions for scheduling using traces (paths) through code JIm Hogg - UW - CSE - P501