Profile-Based Dynamic Optimization Research for Future Computer Systems

Profile-Based Dynamic Optimization Research for Future Computer Systems Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004 Seminar@UW-Madison

Brief history of ‘my’ research • 1970’s: The MPG System A Machine-Independent Efficient Microprogram Generator • 1980’s: MUNAP A Two-Level Microprogrammed Multiprocessor Computer • 1990’s: A-NET A Language-Architecture Integrated Approach for Parallel Object-Oriented Computation Seminar@UW-Madison

A Two-Level Microprogrammed Multiprocessor Computer-MUNAP A 28-bit vertical microinstruction activates up to 4 nanoprograms in 4 PU’s every machine cycle MUNAP Seminar@UW-Madison

A Parallel Object-Oriented Total Architecture A-NET(Actors-NETwork ) • Massively parallel computation • Each node consists of a PE and a router. • PE has the language-oriented, typical CISC architecture. • The programmable router is topology- independent. A-NET Multicomputer Seminar@UW-Madison

Current dynamic optimization projects • Computation-oriented: • YAWARA: A meta-level optimizing computer system • HAGANE: Binary-level multithreading • Communication-oriented: • Spec-All: Aggressive Read/Write Access Speculation Method for DSM Systems • Cross-Line: Adaptive Router Using Dynamic Information Seminar@UW-Madison

YAWARA: A Meta-Level Optimizing Computer System Seminar@UW-Madison

Background • Moore’s Law will be maintained by the semiconductor technology • how can we utilize the huge amount of transistors for speedup of program execution? • our idea is to utilize some chip area for dynamicallyand autonomously tuning the configuration of on-chip multiprocessor Seminar@UW-Madison

Base-level processor Memory Meta-level Meta-level processor Base-level Profile of control and data Results of optimization Base-level processor Results of computation Instructions and data Memory Seminar@UW-Madison

Design considerations • HW vs. SW reconfiguration →SW reconfiguration • Static vs. dynamic reconfiguration →both a static and dynamic reconfig. capability • Homogeneous vs. heterogeneous architecture →unified homogeneous structure Seminar@UW-Madison

Basic concepts of thread-level reconfiguration Meta-level Base-level Profiling MT PT Application PT PT PT CT CT CT CT CT PT Management Thread CT PT CT CT CT CT CT Optimization CT CT OT CT OT OT OT OT OT OT Memory MT: Management Thread, PT: Profiling Thread, OT: Optimizing Thread, CT: Computing Thread Seminar@UW-Madison

Execution model Management Thread (MT) activate Profiling Thread (PT) Computing Thread (CT) Profiling-centric sleep collect profile wake up optimization initiate condition satisfied activate Optimizing Thread (OT) sleep collect profile Computing Thread (CT) Profiling Thread (PT) Computing-centric sleep collect profile optimization initiate condition satisfied Seminar@UW-Madison

Change of configurations by meta-level optimization Meta-level Base-level MT OT OT PT PT CT OT OT OT PT PT CT OT OT OT PT CT CT OT OT PT PT CT CT OT OT PT CT CT PT OT OT PT PT CT CT MT OT PT CT CT CT MT OT PT CT CT CT OT OT PT CT CT CT OT PT CT CT CT CT OT OT PT CT CT CT PT CT CT CT CT CT OT PT PT CT CT CT CT CT CT CT CT CT OT PT CT CT CT CT CT CT CT CT CT CT OT PT CT CT CT CT CT CT CT CT CT CT Seminar@UW-Madison

The YAWARA System • an implementation of the computation model • the SW system consists of static and dynamic optimization systems • the HW system includes uniformly structured thread engines (TE); each TE can execute base- and meta-level threads spirit of YAWARA・・・　"A flexible method prevails where a rigid one fails." Seminar@UW-Madison

Software System Static feedback Source Code (C/C++,Java,Fortran,…) Execution Profile SOS (StaticOptimizationSystem) DOS (DynamicOptimizationSystem) Code Analysis Info Dynamic feedback Executable image Run-time Profile Execution Results TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) Thread Engines Seminar@UW-Madison

Hardware System feedback-directed resource control TE TE TE TE I$ register file network OUT thread- code cache TE TE TE TE to/from network thread -0 thread- data cache thread -1 I$ D$ thread -2 network IN thread -N TE TE TE TE INT*4 + FP*1 execution control TE TE TE TE D$ profiling buffer profiling controller Thread Engine(TE) Seminar@UW-Madison

8 ８８８ 8 9 ９１０９ 10 １０ 11 ９１１ 9 12 １２ Speculative thread #0 １１ 13 １３ 11 １１１２ 12 １２１３ 12 １３ (CT) １４ i - 1 14 １４ i １５ 15 １５ i +1 21 ２１ #0 １６ 16 １６ 22 ２２ #1 １７ #0 17 １７ hit 8 １８ 18 １８ 9 ２０１９ 20 19 ２０１９ 11 12 ２１ 21 ２１２３２２ 23 22 ２３２２２４ 24 ２４２５ 25 ２５ 19 Example application – compress – Speculative multithreading using path prediction mechanism Hot path Hot loop Phased behavior Hot path#0 Base #1 #1 hit miss ⇒ #1 ・speculative multithreading code generation ・helper threads generation ・path predictor generation (OT) ・management thread （MT）・speculative multithreading profiling (PT) hot loop / hot path detection (PT, OT) Meta Seminar@UW-Madison

Conclusion -YAWARA- • we proposed an autonomous reconfiguration mechanism based on dynamic behavior • we also proposed a software and hardware system, called YAWARA, that implements the reconfiguration efficiently • we are now developing the software system and the simulator. Seminar@UW-Madison

YAWARA@PDCS2004 Prediction and Execution Methods of Frequently Executed Two Paths for Speculative Multithreading Seminar@UW-Madison

#2 path #1 path other ｐaths Occurrence ratios of the top-two paths compress/ compress 54.5% 22.4% ijpeg/ forward_DCT 42.1% 48.2% m88ksim/ killtime 97.0% 3.0% li/ sweep 80.7% 19.3% The top two paths occupy 80-100% of execution Seminar@UW-Madison

Two-level path prediction • Introducing two-level branch prediction • history register keeps sequence of #1 path executions (1: #1, 0: the other paths) • counter table counts #1 path executions Single Path Predictor (SPP) history register if v13 >= X predict #1 counter table 1101 v0 v1 : v13 v14 v15 otherwise predict #2 threshold: X Seminar@UW-Madison

Another path predictor Dual Path Predictor (DPP) #1 path history register #1 path counter table 1101 v0 v1 if v13 >= v2 predict #1 : v13 v14 v15 #2 path history register #2 path counter table 0010 v0 v1 otherwise predict #2 v2 : v14 v15 Seminar@UW-Madison

Single Speculation (SS) When a thread fails … recovery process Abort succeeding threads #1 path #1 path Recovery process execute non-speculative thread Non-speculative execution Continue speculative execution continue speculative execution Speculation failure degrades performance #1 path #1 path Seminar@UW-Madison

compress/ compress 54.5% 22.4% ijpeg/ forward_DCT 42.1% 48.2% m88ksim/ killtime 97.0% 3.0% li/ sweep 80.7% 19.3% Double Speculation (DS) • Even when 1st speculation fails, secondary choice has high possibility Top-Two Paths are Dominant. because expected #2 hit = 49.2% expected #2 hit = 81.3% expected #2 hit = 100% expected #2 hit = 100% Seminar@UW-Madison

Double Speculation (DS) If secondary speculation succeeds, performance loss is not so large. recovery process #1 path #2 path #1 path #1 path #2 path #1 path secondary speculation #1 path continue speculative execution Seminar@UW-Madison

Evaluation flow hot-path detection (SIMCA) • thread codes • #1 path speculative thread • #2 path speculative thread • non-speculative thread thread-code generation path history acquisition (SIMCA) path execution history performance estimator speculation hit ratio speed-up ratio Seminar@UW-Madison

Prediction success ratio 100 compress 80 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 forward_DCT 80 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

Prediction success ratio 100 80 killtime 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 80 sweep 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

Speed-up ratio 2.0 compress 1.0 speed-up ratio 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4.0 forward_DCT 3.0 2.0 speed-up ratio 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

Speed-up ratio 3.0 2.0 killtime speed-up ratio 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3.0 2.0 speed-up ratio sweep 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison

Conclusions- Two-Path-Limited Speculative Multithreading - • We proposed - path prediction method and predictors - speculation methods for path-based speculative multithreading • Preliminary performance estimation results are shown Seminar@UW-Madison

Current and future works • Accurate and detailed evaluation for various applications  SPEC 2000, MediaBench, … • Integration to our Dynamic Optimization Framework YAWARA Seminar@UW-Madison

Current dynamic optimization projects • Computation-oriented: • YAWARA: A meta-level optimizing computer system • HAGANE: Binary-level multithreading • Communication-oriented: • Spec-All: Aggressive Read/Write Access Speculation Method for DSM Systems • Cross-Line: Adaptive Router Using Dynamic Information Seminar@UW-Madison

HAGANE:Binary-Level Multithreading Seminar@UW-Madison

Background • Multithread programming is not so easy. → Automatic multithreading system However… • Source codes are not always available. →Multithreading at binary level Seminar@UW-Madison

Binary Translator & Optimizer System Source BinaryCode Execution Profile Analysis Info STO (Static Translator & Optimizer) DTO (Dynamic Translator & Optimizer) Multithreaded Binary Code (statically translated) Multithreaded Binary Code (dynamically translated) Process Memory Image Multithread Processor Execution Profile Info Seminar@UW-Madison

Continuation Continuation Continuation TSAG TSAG TSAG Computation Computation Computation Write-back Write-back Write-back Thread Pipelining Model - Loop iterations are mapped onto threads Thread i Thread i+1 Thread i+2 TSAG = Target Store Address Generation Seminar@UW-Madison

Example translation mtc1 $zero[0],$f4 addu $v1[3],$zero[0],$zero[0] bstr slti $v0[2],$v1[3],5000 beq $v0[2],$zero[0],$ST_LL0 addu$t0[8],$a0[4],$zero[0] addu$t1[9],$a1[5],$zero[0] addi $v1[3],$v1[3],1 addi $a0[4],$a0[4],4 addi $a1[5],$a1[5],4 lfrk wtsagd addu$t2[10],$sp[28],$zero[0] altsw$t2[10] tsagd l.s $f0,0($t0[8]) l.s $f2,0($t1[9]) l.s$f4,0($t2[10]) mul.s $f0,f0,f2 add.s $f4,$f4,$f0 sttsw$t2[10],$f4 $ST_LL0: estr mov.s $f0,$f4 jr $ra[31] mtc1 $zero[0],$f4 addu $v1[3],$zero[0],$zero[0] $BB1: l.s $f0,0($a0[4]) l.s $f2,0($a1[5]) mul.s $f0,f0,f2 addiu $v1[3],$v1[3],1 add.s $f4,$f4,$f0 slti $v0[2],$v1[3],5000 addiu $a1[5],$a1[5],4 addiu $a0[4],$a0[4],4 bne $v0[2],$zero[0],$BB1 $BB2: mov.s $f0,$f4 jr $ra[31] Cont. TSAG Comp. Source Binary Code W.B. ・ Thread Management Instructions ・ Overhead code for multithreading Translated Code Seminar@UW-Madison

Superthreaded Architecture L1 Instruction Cache Thread Processing Unit Thread Processing Unit Execution Unit Execution Unit Communication Unit Communication Unit ● ● ● Memory Buffer Memory Buffer Write-Back Unit Write-Back Unit L1 Data Cache Seminar@UW-Madison

m88ksim (SPECint95) • poor speedup ratios • loop unrolling does not affect the performance • number of iterations is quite small. Seminar@UW-Madison

ijpeg (SPECint95) • the thread code size is too small to hide the thread management • overhead • loop unrolling is effective to achieve good speedup ratios • excessive loop unrolling causes performance degradation • number of iterations is not so large. Seminar@UW-Madison

swim (SPECfp95) • good speedup ratios • loop unrolling is effective to achieve linear speedup • number of iterations is large. Seminar@UW-Madison

Conclusion-HAGANE- • We have evaluated the binary-level multithreading using some SPEC95 benchmark programs. • The performance evaluation results indicate: • the thread code size should be large enough to improve the performance. • loop unrolling is effective for the small loop body. • excessive loop unrolling degrades performance Seminar@UW-Madison

HAGANE@PDCS2004 A Methodology ofBinary-Level Variable Analysisfor Multithreading Seminar@UW-Madison

Background and Objective Usually, loop-iterations are interrelated through memory variables, such as induction ones. However, it is difficult to analyze this kind of dependency at binary level. Binary-level variable analysis method is strongly required for binary-level multithreading. Seminar@UW-Madison

for (i = 1; i < N; i++) { z = i * 2; x = a[i-1]; y = x * 3; a[i] = z + y; } lw $a1[5], 16($s8[30]) lw $v1[3], 16($s8[30]) lw $a0[4], 16($s8[30]) sll $v1[3], $v1[3], 0x2 addu $v1[3], $v1[3], $a2[6] lw $v0[2], 16($s8[30]) lw $v1[3], -4($v1[3]) addiu $v0[2], $v0[2], 1 sw $v0[2], 16($s8[30]) lw $v0[2], 16($s8[30]) sll $a1[5], $a1[5], 0x1 sll $a0[4], $a0[4], 0x2 sll $v0[2], $v1[3], 0x1 addu $v0[2], $v0[2], $v1[3] lw $v1[3], 16($s8[30]) addu $a0[4], $a0[4], $a2[6] addu $a1[5], $a1[5], $v0[2] sw $a1[5], 0($a0[4]) slt $v1[3], $v1[3], $a3[7] Example Binary Code -4($v1[3]) 0($a0[4]) Seminar@UW-Madison

Binary-Level Variable Analysis • Register values are analyzed using data flow trees. • When register values, used for memory references, are judged as the same, the memory location is regarded as a virtual register. • Using the virtual registers, steps (1) and (2) are repeated. Seminar@UW-Madison

Construction of Dataflow Tree addiu $29#1, $29#0, -8 sw $0, 0($29#1) addu $5#1, $0, $0 lw $2#1, 0($29#1) addu $3#1, $5#1, $4#0 addiu $5#2, $5#1, 1 addu $2#2, $2#1, $3#1 sw $2#2, 0($29#1) slti $2#3, $5#2, 100 bne $2#3, $0, L1 Seminar@UW-Madison

$2#2 + 14 * $4#0 4 Example Normalization Seminar@UW-Madison

Detection of Loop Induction Variables Loop induction variable is the register, which • has inter-iteration dependency, and • increases with a fixed value between iterations. The concept of virtual register makes it possible to detect induction variables on memory. Seminar@UW-Madison

Application • 101.tomcatv of SPECfp95 Benchmark • Fortran to C translator ver. 19940927 • GCC cross compiler ver 2.7.2.3 for SIMCA • Data set: test • The six most inner loops (#1-#6) are selected • They have induction variables on memory Seminar@UW-Madison

Profile-Based Dynamic Optimization Research for Future Computer Systems

Profile-Based Dynamic Optimization Research for Future Computer Systems

Presentation Transcript

Profiling, Instrumentation, and Profile Based Optimization

Profile Guided Optimization ( )

Bus-Based Computer Systems

Dynamic Optimization for Interactive Computing Systems

Dynamic Optimization and Learning for Renewal Systems

Dynamic Optimization and Learning for Renewal Systems --

Discrete-Event Dynamic Systems Research

Data -intensive Computing Systems Query Optimization (Cost-based optimization)

Bus-Based Computer Systems

Profile-based Dynamic Voltage Scheduling with Program Checkpoints

Bus-Based Computer Systems

Dynamic Binary Optimization

Dynamic Binary Optimization

Statistics Profile For Query Optimization

Computer Systems Research

Dynamic Optimization

Improving Region Selection in Dynamic Optimization Systems

Dynamic Query Optimization

Computer Systems Principles Dynamic Memory Management

CS411 Dynamic Web-Based Systems

Scenario-based PA Method for Dynamic Component-Based Systems

Dynamic Route Optimization