590 likes | 699 Views
Application of Binary Translation to Java Reconfigurable Architectures. Antonio Carlos S. Beck Filho caco@inf.ufrgs.br Luigi Carro carro@inf.ufrgs.br Instituto de Informática - GME Universidade Federal do Rio Grande do Sul. Introduction. 1. The embedded system market is expanding. 1.
E N D
Application of Binary Translation to Java Reconfigurable Architectures Antonio Carlos S. Beck Filho caco@inf.ufrgs.br Luigi Carro carro@inf.ufrgs.br Instituto de Informática - GME Universidade Federal do Rio Grande do Sul
Introduction 1 • The embedded system market is expanding 1
Introduction 1 • The embedded system market is expanding More performance is required 1
Introduction 1 • Moreover… • Shorter Design cycle • The complexity of these embedded systems is increasing as well • Battery dependent 2
Introduction 1 These embedded systems are adopting Java • Devices with Java as cellular phones and PDAs: • 176 million in 2001 • 721 million in 2006 [1] • 80% of cellular phones will support Java [2] • 10 times more embedded system developers than general-purpose software ones by the year 2010 [3] [1] D. Takahashi, Java Chips Make a Comeback, Red Herring, 2001 [2] G. Lawton, “Moving Java into Mobile Phones”, Computer, vol. 35, n. 6, 2002, pp. 17-20 [3] R.W. Atherton, “Moving Java to the Factory”. IEEE Spectrum, 1998, pp. 18-23, 3
Introduction 1 • The Java Language... • Object Oriented • Modeling • Programation • Validation • Widely spread • Safe • Small size of ROM memory (CISC) • Multiplataform 4
Motivation 2 • How to increase the performance with low power consumption? 5
Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! 5
Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! Special tools and compilers are needed! 5
Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! Special tools and compilers are needed! No software portability! And the design cycle? 5
Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! Special tools and compilers are needed! No software portability! And the design cycle? 5
Outline 3 • Java processors • Using Binary Translation with reconfigurable arrays • The reconfigurable array • Results • Area • Performance • Power consumption • Conclusions and Future Work 6
Femtojava Low-Power 4 • Five stages: Instruction Fetch Operand Fetch Write Back Decoder Execution 8
Femtojava Low-Power 4 IADD Instruction Fetch Operand Fetch Write Back Decoder Execution • With a instruction queue of 9 bytes long to handle with variable size instructions 8
Femtojava Low-Power 4 IADD 11011… Instruction Fetch Operand Fetch Write Back Decoder Execution • Responsible for the generation of the microOPs and for checking data dependence 8
Femtojava Low-Power 4 4 4 POP Top of Stack 2 2 7 8 3 9 Instruction Fetch Operand Fetch Write Back Decoder Execution • It has a register bank with two ports • Stack and local variable storage implemented in this register file 8
Femtojava Low-Power 4 4 4 POP Top of Stack 2 2 7 8 3 9 Instruction Fetch Operand Fetch Write Back Decoder Execution • It has a register bank with two ports • Stack and local variable storage implemented in this register file Allows comparisons with RISC machines! 8
Femtojava Low-Power 4 4 + 2 = 6 Instruction Fetch Operand Fetch Write Back Decoder Execution • Six functional units: multiplier, ALU, shifter, constant generator, branch and LD/ST 8
Femtojava Low-Power 4 6 Top of Stack 7 8 3 9 Instruction Fetch Operand Fetch Decoder Execution Write Back • Write the results back to the stack or local variable storage 8
VLIW Architecture 5 • 2 instructions/VLIW packet: Instruction 2 Instruction 1 Instruction Fetch Operand Fetch Write Back Decoder Execution • VLIW packet has a variable size • In this case, The VLIW packet can have 1 or 2 instructions/packet 9
VLIW Architecture 5 Instruction 1 11011… Decoder 1 Instruction Fetch Operand Fetch Write Back Execution Decoder 2 Instruction 2 11011… • Decoder 2 doesn’t support calls and return of methods 9
VLIW Architecture 5 Register Bank 2 4 OperandStack 2 7 Register Bank 1 OperandStack 8 6 Local Variable Pool 3 1 9 Instruction Fetch Operand Fetch Write Back Decoder Execution • Each flow has its own operand stack • The local variable pool of the method is shared No mechanism is necessary for communication among the flows! 9
VLIW Architecture 5 Instruction Fetch Operand Fetch Write Back Decoder Execution • Six functional units: multiplier, ALU, shifter, constant generator, branch and LD/ST • They are replicated in each flow 9
VLIW Architecture 5 Instruction Fetch Operand Fetch Decoder Execution Write Back • Write the results back to the operand stack of each flow OR to local variable storage of the 1st register bank 9
Why use a reconfigurable array? • Hypothesis: substitution of a sequence of instructions by a combinational circuit saves power (we loose area) • Let us see the multiplication algorithm example • TCalg = n*(TPFF+n*T+Tset) • TCCC = n* n*T (very pessimistic)
The Binary Translation 6 • BT: take a binary code and produce another binary for a different machine • BT advantages when used with reconfiguration: • One can detect paralelism and reconfigure the array at run-time • No need for special tools or compilers anymore! • We solve the sw-compatibility problem 10
The Binary Translation 6 • How it works? • Observe the bytecodes looking for frequently executed sequences • Save this sequence in a special cache • When this sequence of instructions is found again, the array is reconfigured and set as active functional unit 10
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul Considering these bytecodes 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul The instructions depend on each other! 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul These two blocks are independent !!! 11
Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul Operand Block 1 – First Sequence Operand Block 2 – Second Sequence 11
The Reconfigurable Array 8 • The array is coarse-grain • It allows to save a great number of sequences in the cache • The reconfiguration is fast 12
The Reconfigurable Array 8 • The array is coarse-grain • It allows to save a great number of sequences in the cache • The reconfiguration is fast • It is formed by one or more basic cells • With one multiplier and a sequence of seven sets of basic functional units 13
General Overview 9 Reconfiguration Cache Array . . . Detector Unit 14
Power Simulator 10 • CACO-PS • Cycle AccurateCOnfigurablePower Simulator • Based on the switching activity • Pd = α . fc . C . Vdd² • Result is given in number of gate capacitances that switch 15
Results 11 • A set of algorithms were executed in the architectures • Sin Calculation • Sort – Bubble • Sort – Select • Sort – Quick (10 and 100 elements) • Search – Binary • Search – Sequential • IMDCT (plus three unrolled versions) • Floating Point Sums emulation • Full MP3 PLAYER 16
Performance 11 17
Performance 11 17
Performance 11 The same number of different sequences of instructions 17
Performance 11 Parallelism exposed by loop unrolling 17
Performance 11 Parallelism exposed by loop unrolling 17
Performance 11 No more parallelism available! 17
Performance 11 No more parallelism available! 17
Performance 11 There is room for improvement! 17
Performance 11 Compare these two and you can save reconfiguration memory 17