540 likes | 644 Views
Profile-Based Dynamic Optimization Research for Future Computer Systems. Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004. Brief history of ‘my’ research. 1970’s: The MPG System
E N D
Profile-Based Dynamic Optimization Research for Future Computer Systems Takanobu Baba Department of Information Science Utsunomiya University, Japan http://aquila.is.utsunomiya-u.ac.jp November 12, 2004 Seminar@UW-Madison
Brief history of ‘my’ research • 1970’s: The MPG System A Machine-Independent Efficient Microprogram Generator • 1980’s: MUNAP A Two-Level Microprogrammed Multiprocessor Computer • 1990’s: A-NET A Language-Architecture Integrated Approach for Parallel Object-Oriented Computation Seminar@UW-Madison
A Two-Level Microprogrammed Multiprocessor Computer-MUNAP A 28-bit vertical microinstruction activates up to 4 nanoprograms in 4 PU’s every machine cycle MUNAP Seminar@UW-Madison
A Parallel Object-Oriented Total Architecture A-NET(Actors-NETwork ) • Massively parallel computation • Each node consists of a PE and a router. • PE has the language-oriented, typical CISC architecture. • The programmable router is topology- independent. A-NET Multicomputer Seminar@UW-Madison
Current dynamic optimization projects • Computation-oriented: • YAWARA: A meta-level optimizing computer system • HAGANE: Binary-level multithreading • Communication-oriented: • Spec-All: Aggressive Read/Write Access Speculation Method for DSM Systems • Cross-Line: Adaptive Router Using Dynamic Information Seminar@UW-Madison
YAWARA: A Meta-Level Optimizing Computer System Seminar@UW-Madison
Background • Moore’s Law will be maintained by the semiconductor technology • how can we utilize the huge amount of transistors for speedup of program execution? • our idea is to utilize some chip area for dynamicallyand autonomously tuning the configuration of on-chip multiprocessor Seminar@UW-Madison
Base-level processor Memory Meta-level Meta-level processor Base-level Profile of control and data Results of optimization Base-level processor Results of computation Instructions and data Memory Seminar@UW-Madison
Design considerations • HW vs. SW reconfiguration →SW reconfiguration • Static vs. dynamic reconfiguration →both a static and dynamic reconfig. capability • Homogeneous vs. heterogeneous architecture →unified homogeneous structure Seminar@UW-Madison
Basic concepts of thread-level reconfiguration Meta-level Base-level Profiling MT PT Application PT PT PT CT CT CT CT CT PT Management Thread CT PT CT CT CT CT CT Optimization CT CT OT CT OT OT OT OT OT OT Memory MT: Management Thread, PT: Profiling Thread, OT: Optimizing Thread, CT: Computing Thread Seminar@UW-Madison
Execution model Management Thread (MT) activate Profiling Thread (PT) Computing Thread (CT) Profiling-centric sleep collect profile wake up optimization initiate condition satisfied activate Optimizing Thread (OT) sleep collect profile Computing Thread (CT) Profiling Thread (PT) Computing-centric sleep collect profile optimization initiate condition satisfied Seminar@UW-Madison
Change of configurations by meta-level optimization Meta-level Base-level MT OT OT PT PT CT OT OT OT PT PT CT OT OT OT PT CT CT OT OT PT PT CT CT OT OT PT CT CT PT OT OT PT PT CT CT MT OT PT CT CT CT MT OT PT CT CT CT OT OT PT CT CT CT OT PT CT CT CT CT OT OT PT CT CT CT PT CT CT CT CT CT OT PT PT CT CT CT CT CT CT CT CT CT OT PT CT CT CT CT CT CT CT CT CT CT OT PT CT CT CT CT CT CT CT CT CT CT Seminar@UW-Madison
The YAWARA System • an implementation of the computation model • the SW system consists of static and dynamic optimization systems • the HW system includes uniformly structured thread engines (TE); each TE can execute base- and meta-level threads spirit of YAWARA・・・ "A flexible method prevails where a rigid one fails." Seminar@UW-Madison
Software System Static feedback Source Code (C/C++,Java,Fortran,…) Execution Profile SOS (StaticOptimizationSystem) DOS (DynamicOptimizationSystem) Code Analysis Info Dynamic feedback Executable image Run-time Profile Execution Results TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) TE (Thread Engine) Thread Engines Seminar@UW-Madison
Hardware System feedback-directed resource control TE TE TE TE I$ register file net- work OUT thread- code cache TE TE TE TE to/from network thread -0 thread- data cache thread -1 I$ D$ thread -2 net- work IN thread -N TE TE TE TE INT*4 + FP*1 execution control TE TE TE TE D$ profiling buffer profiling controller Thread Engine(TE) Seminar@UW-Madison
8 8 8 8 8 9 9 10 9 10 10 11 9 11 9 12 12 Speculative thread #0 11 13 13 11 11 12 12 12 13 12 13 (CT) 14 i - 1 14 14 i 15 15 15 i +1 21 21 #0 16 16 16 22 22 #1 17 #0 17 17 hit 8 18 18 18 9 20 19 20 19 20 19 11 12 21 21 21 23 22 23 22 23 22 24 24 24 25 25 25 19 Example application – compress – Speculative multithreading using path prediction mechanism Hot path Hot loop Phased behavior Hot path#0 Base #1 #1 hit miss ⇒ #1 ・speculative multithreading code generation ・helper threads generation ・path predictor generation (OT) ・management thread (MT) ・speculative multithreading profiling (PT) hot loop / hot path detection (PT, OT) Meta Seminar@UW-Madison
Conclusion -YAWARA- • we proposed an autonomous reconfiguration mechanism based on dynamic behavior • we also proposed a software and hardware system, called YAWARA, that implements the reconfiguration efficiently • we are now developing the software system and the simulator. Seminar@UW-Madison
YAWARA@PDCS2004 Prediction and Execution Methods of Frequently Executed Two Paths for Speculative Multithreading Seminar@UW-Madison
#2 path #1 path other paths Occurrence ratios of the top-two paths compress/ compress 54.5% 22.4% ijpeg/ forward_DCT 42.1% 48.2% m88ksim/ killtime 97.0% 3.0% li/ sweep 80.7% 19.3% The top two paths occupy 80-100% of execution Seminar@UW-Madison
Two-level path prediction • Introducing two-level branch prediction • history register keeps sequence of #1 path executions (1: #1, 0: the other paths) • counter table counts #1 path executions Single Path Predictor (SPP) history register if v13 >= X predict #1 counter table 1101 v0 v1 : v13 v14 v15 otherwise predict #2 threshold: X Seminar@UW-Madison
Another path predictor Dual Path Predictor (DPP) #1 path history register #1 path counter table 1101 v0 v1 if v13 >= v2 predict #1 : v13 v14 v15 #2 path history register #2 path counter table 0010 v0 v1 otherwise predict #2 v2 : v14 v15 Seminar@UW-Madison
Single Speculation (SS) When a thread fails … recovery process Abort succeeding threads #1 path #1 path Recovery process execute non-speculative thread Non-speculative execution Continue speculative execution continue speculative execution Speculation failure degrades performance #1 path #1 path Seminar@UW-Madison
compress/ compress 54.5% 22.4% ijpeg/ forward_DCT 42.1% 48.2% m88ksim/ killtime 97.0% 3.0% li/ sweep 80.7% 19.3% Double Speculation (DS) • Even when 1st speculation fails, secondary choice has high possibility Top-Two Paths are Dominant. because expected #2 hit = 49.2% expected #2 hit = 81.3% expected #2 hit = 100% expected #2 hit = 100% Seminar@UW-Madison
Double Speculation (DS) If secondary speculation succeeds, performance loss is not so large. recovery process #1 path #2 path #1 path #1 path #2 path #1 path secondary speculation #1 path continue speculative execution Seminar@UW-Madison
Evaluation flow hot-path detection (SIMCA) • thread codes • #1 path speculative thread • #2 path speculative thread • non-speculative thread thread-code generation path history acquisition (SIMCA) path execution history performance estimator speculation hit ratio speed-up ratio Seminar@UW-Madison
Prediction success ratio 100 compress 80 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 forward_DCT 80 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison
Prediction success ratio 100 80 killtime 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 100 80 sweep 60 succ. ratio (%) 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison
Speed-up ratio 2.0 compress 1.0 speed-up ratio 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4.0 forward_DCT 3.0 2.0 speed-up ratio 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison
Speed-up ratio 3.0 2.0 killtime speed-up ratio 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 3.0 2.0 speed-up ratio sweep 1.0 0 S 100 P1 only 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 history length Seminar@UW-Madison
Conclusions- Two-Path-Limited Speculative Multithreading - • We proposed - path prediction method and predictors - speculation methods for path-based speculative multithreading • Preliminary performance estimation results are shown Seminar@UW-Madison
Current and future works • Accurate and detailed evaluation for various applications SPEC 2000, MediaBench, … • Integration to our Dynamic Optimization Framework YAWARA Seminar@UW-Madison
Current dynamic optimization projects • Computation-oriented: • YAWARA: A meta-level optimizing computer system • HAGANE: Binary-level multithreading • Communication-oriented: • Spec-All: Aggressive Read/Write Access Speculation Method for DSM Systems • Cross-Line: Adaptive Router Using Dynamic Information Seminar@UW-Madison
HAGANE:Binary-Level Multithreading Seminar@UW-Madison
Background • Multithread programming is not so easy. → Automatic multithreading system However… • Source codes are not always available. →Multithreading at binary level Seminar@UW-Madison
Binary Translator & Optimizer System Source BinaryCode Execution Profile Analysis Info STO (Static Translator & Optimizer) DTO (Dynamic Translator & Optimizer) Multithreaded Binary Code (statically translated) Multithreaded Binary Code (dynamically translated) Process Memory Image Multithread Processor Execution Profile Info Seminar@UW-Madison
Continuation Continuation Continuation TSAG TSAG TSAG Computation Computation Computation Write-back Write-back Write-back Thread Pipelining Model - Loop iterations are mapped onto threads Thread i Thread i+1 Thread i+2 TSAG = Target Store Address Generation Seminar@UW-Madison
Example translation mtc1 $zero[0],$f4 addu $v1[3],$zero[0],$zero[0] bstr slti $v0[2],$v1[3],5000 beq $v0[2],$zero[0],$ST_LL0 addu$t0[8],$a0[4],$zero[0] addu$t1[9],$a1[5],$zero[0] addi $v1[3],$v1[3],1 addi $a0[4],$a0[4],4 addi $a1[5],$a1[5],4 lfrk wtsagd addu$t2[10],$sp[28],$zero[0] altsw$t2[10] tsagd l.s $f0,0($t0[8]) l.s $f2,0($t1[9]) l.s$f4,0($t2[10]) mul.s $f0,f0,f2 add.s $f4,$f4,$f0 sttsw$t2[10],$f4 $ST_LL0: estr mov.s $f0,$f4 jr $ra[31] mtc1 $zero[0],$f4 addu $v1[3],$zero[0],$zero[0] $BB1: l.s $f0,0($a0[4]) l.s $f2,0($a1[5]) mul.s $f0,f0,f2 addiu $v1[3],$v1[3],1 add.s $f4,$f4,$f0 slti $v0[2],$v1[3],5000 addiu $a1[5],$a1[5],4 addiu $a0[4],$a0[4],4 bne $v0[2],$zero[0],$BB1 $BB2: mov.s $f0,$f4 jr $ra[31] Cont. TSAG Comp. Source Binary Code W.B. ・ Thread Management Instructions ・ Overhead code for multithreading Translated Code Seminar@UW-Madison
Superthreaded Architecture L1 Instruction Cache Thread Processing Unit Thread Processing Unit Execution Unit Execution Unit Communication Unit Communication Unit ● ● ● Memory Buffer Memory Buffer Write-Back Unit Write-Back Unit L1 Data Cache Seminar@UW-Madison
m88ksim (SPECint95) • poor speedup ratios • loop unrolling does not affect the performance • number of iterations is quite small. Seminar@UW-Madison
ijpeg (SPECint95) • the thread code size is too small to hide the thread management • overhead • loop unrolling is effective to achieve good speedup ratios • excessive loop unrolling causes performance degradation • number of iterations is not so large. Seminar@UW-Madison
swim (SPECfp95) • good speedup ratios • loop unrolling is effective to achieve linear speedup • number of iterations is large. Seminar@UW-Madison
Conclusion-HAGANE- • We have evaluated the binary-level multithreading using some SPEC95 benchmark programs. • The performance evaluation results indicate: • the thread code size should be large enough to improve the performance. • loop unrolling is effective for the small loop body. • excessive loop unrolling degrades performance Seminar@UW-Madison
HAGANE@PDCS2004 A Methodology ofBinary-Level Variable Analysisfor Multithreading Seminar@UW-Madison
Background and Objective Usually, loop-iterations are interrelated through memory variables, such as induction ones. However, it is difficult to analyze this kind of dependency at binary level. Binary-level variable analysis method is strongly required for binary-level multithreading. Seminar@UW-Madison
for (i = 1; i < N; i++) { z = i * 2; x = a[i-1]; y = x * 3; a[i] = z + y; } lw $a1[5], 16($s8[30]) lw $v1[3], 16($s8[30]) lw $a0[4], 16($s8[30]) sll $v1[3], $v1[3], 0x2 addu $v1[3], $v1[3], $a2[6] lw $v0[2], 16($s8[30]) lw $v1[3], -4($v1[3]) addiu $v0[2], $v0[2], 1 sw $v0[2], 16($s8[30]) lw $v0[2], 16($s8[30]) sll $a1[5], $a1[5], 0x1 sll $a0[4], $a0[4], 0x2 sll $v0[2], $v1[3], 0x1 addu $v0[2], $v0[2], $v1[3] lw $v1[3], 16($s8[30]) addu $a0[4], $a0[4], $a2[6] addu $a1[5], $a1[5], $v0[2] sw $a1[5], 0($a0[4]) slt $v1[3], $v1[3], $a3[7] Example Binary Code -4($v1[3]) 0($a0[4]) Seminar@UW-Madison
Binary-Level Variable Analysis • Register values are analyzed using data flow trees. • When register values, used for memory references, are judged as the same, the memory location is regarded as a virtual register. • Using the virtual registers, steps (1) and (2) are repeated. Seminar@UW-Madison
Construction of Dataflow Tree addiu $29#1, $29#0, -8 sw $0, 0($29#1) addu $5#1, $0, $0 lw $2#1, 0($29#1) addu $3#1, $5#1, $4#0 addiu $5#2, $5#1, 1 addu $2#2, $2#1, $3#1 sw $2#2, 0($29#1) slti $2#3, $5#2, 100 bne $2#3, $0, L1 Seminar@UW-Madison
$2#2 + 14 * $4#0 4 Example Normalization Seminar@UW-Madison
Detection of Loop Induction Variables Loop induction variable is the register, which • has inter-iteration dependency, and • increases with a fixed value between iterations. The concept of virtual register makes it possible to detect induction variables on memory. Seminar@UW-Madison
Application • 101.tomcatv of SPECfp95 Benchmark • Fortran to C translator ver. 19940927 • GCC cross compiler ver 2.7.2.3 for SIMCA • Data set: test • The six most inner loops (#1-#6) are selected • They have induction variables on memory Seminar@UW-Madison