1 / 59

Application of Binary Translation to Java Reconfigurable Architectures

Application of Binary Translation to Java Reconfigurable Architectures. Antonio Carlos S. Beck Filho caco@inf.ufrgs.br Luigi Carro carro@inf.ufrgs.br Instituto de Informática - GME Universidade Federal do Rio Grande do Sul. Introduction. 1. The embedded system market is expanding. 1.

jag
Download Presentation

Application of Binary Translation to Java Reconfigurable Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application of Binary Translation to Java Reconfigurable Architectures Antonio Carlos S. Beck Filho caco@inf.ufrgs.br Luigi Carro carro@inf.ufrgs.br Instituto de Informática - GME Universidade Federal do Rio Grande do Sul

  2. Introduction 1 • The embedded system market is expanding 1

  3. Introduction 1 • The embedded system market is expanding More performance is required 1

  4. Introduction 1 • Moreover… • Shorter Design cycle • The complexity of these embedded systems is increasing as well • Battery dependent 2

  5. Introduction 1 These embedded systems are adopting Java • Devices with Java as cellular phones and PDAs: • 176 million in 2001 • 721 million in 2006 [1] • 80% of cellular phones will support Java [2] • 10 times more embedded system developers than general-purpose software ones by the year 2010 [3] [1] D. Takahashi, Java Chips Make a Comeback, Red Herring, 2001 [2] G. Lawton, “Moving Java into Mobile Phones”, Computer, vol. 35, n. 6, 2002, pp. 17-20 [3] R.W. Atherton, “Moving Java to the Factory”. IEEE Spectrum, 1998, pp. 18-23, 3

  6. Introduction 1 • The Java Language... • Object Oriented • Modeling • Programation • Validation • Widely spread • Safe • Small size of ROM memory (CISC) • Multiplataform 4

  7. Motivation 2 • How to increase the performance with low power consumption? 5

  8. Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! 5

  9. Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! Special tools and compilers are needed! 5

  10. Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! Special tools and compilers are needed! No software portability! And the design cycle? 5

  11. Motivation 2 • How to increase the performance with low power consumption? • Using a reconfigurable array! Special tools and compilers are needed! No software portability! And the design cycle? 5

  12. Outline 3 • Java processors • Using Binary Translation with reconfigurable arrays • The reconfigurable array • Results • Area • Performance • Power consumption • Conclusions and Future Work 6

  13. Femtojava Low-Power 4 7

  14. Femtojava Low-Power 4 • Five stages: Instruction Fetch Operand Fetch Write Back Decoder Execution 8

  15. Femtojava Low-Power 4 IADD Instruction Fetch Operand Fetch Write Back Decoder Execution • With a instruction queue of 9 bytes long to handle with variable size instructions 8

  16. Femtojava Low-Power 4 IADD 11011… Instruction Fetch Operand Fetch Write Back Decoder Execution • Responsible for the generation of the microOPs and for checking data dependence 8

  17. Femtojava Low-Power 4 4 4 POP Top of Stack 2 2 7 8 3 9 Instruction Fetch Operand Fetch Write Back Decoder Execution • It has a register bank with two ports • Stack and local variable storage implemented in this register file 8

  18. Femtojava Low-Power 4 4 4 POP Top of Stack 2 2 7 8 3 9 Instruction Fetch Operand Fetch Write Back Decoder Execution • It has a register bank with two ports • Stack and local variable storage implemented in this register file Allows comparisons with RISC machines! 8

  19. Femtojava Low-Power 4 4 + 2 = 6 Instruction Fetch Operand Fetch Write Back Decoder Execution • Six functional units: multiplier, ALU, shifter, constant generator, branch and LD/ST 8

  20. Femtojava Low-Power 4 6 Top of Stack 7 8 3 9 Instruction Fetch Operand Fetch Decoder Execution Write Back • Write the results back to the stack or local variable storage 8

  21. VLIW Architecture 5 • 2 instructions/VLIW packet: Instruction 2 Instruction 1 Instruction Fetch Operand Fetch Write Back Decoder Execution • VLIW packet has a variable size • In this case, The VLIW packet can have 1 or 2 instructions/packet 9

  22. VLIW Architecture 5 Instruction 1 11011… Decoder 1 Instruction Fetch Operand Fetch Write Back Execution Decoder 2 Instruction 2 11011… • Decoder 2 doesn’t support calls and return of methods 9

  23. VLIW Architecture 5 Register Bank 2 4 OperandStack 2 7 Register Bank 1 OperandStack 8 6 Local Variable Pool 3 1 9 Instruction Fetch Operand Fetch Write Back Decoder Execution • Each flow has its own operand stack • The local variable pool of the method is shared No mechanism is necessary for communication among the flows! 9

  24. VLIW Architecture 5 Instruction Fetch Operand Fetch Write Back Decoder Execution • Six functional units: multiplier, ALU, shifter, constant generator, branch and LD/ST • They are replicated in each flow 9

  25. VLIW Architecture 5 Instruction Fetch Operand Fetch Decoder Execution Write Back • Write the results back to the operand stack of each flow OR to local variable storage of the 1st register bank 9

  26. Why use a reconfigurable array? • Hypothesis: substitution of a sequence of instructions by a combinational circuit saves power (we loose area) • Let us see the multiplication algorithm example • TCalg = n*(TPFF+n*T+Tset) • TCCC = n* n*T (very pessimistic)

  27. The Binary Translation 6 • BT: take a binary code and produce another binary for a different machine • BT advantages when used with reconfiguration: • One can detect paralelism and reconfigure the array at run-time • No need for special tools or compilers anymore! • We solve the sw-compatibility problem 10

  28. The Binary Translation 6 • How it works? • Observe the bytecodes looking for frequently executed sequences • Save this sequence in a special cache • When this sequence of instructions is found again, the array is reconfigured and set as active functional unit 10

  29. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul Considering these bytecodes 11

  30. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11

  31. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11

  32. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11

  33. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul The instructions depend on each other! 11

  34. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul 11

  35. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul These two blocks are independent !!! 11

  36. Bytecodes Detection 7 Bipush 10 Bipush 5 Imul Bipush 3 Bipush 4 Ishl Iadd Istore Bipush 6 Bipush 7 imul Operand Block 1 – First Sequence Operand Block 2 – Second Sequence 11

  37. The Reconfigurable Array 8 • The array is coarse-grain • It allows to save a great number of sequences in the cache • The reconfiguration is fast 12

  38. The Reconfigurable Array 8 • The array is coarse-grain • It allows to save a great number of sequences in the cache • The reconfiguration is fast • It is formed by one or more basic cells • With one multiplier and a sequence of seven sets of basic functional units 13

  39. General Overview 9 Reconfiguration Cache Array . . . Detector Unit 14

  40. Power Simulator 10 • CACO-PS • Cycle AccurateCOnfigurablePower Simulator • Based on the switching activity • Pd = α . fc . C . Vdd² • Result is given in number of gate capacitances that switch 15

  41. Results 11 • A set of algorithms were executed in the architectures • Sin Calculation • Sort – Bubble • Sort – Select • Sort – Quick (10 and 100 elements) • Search – Binary • Search – Sequential • IMDCT (plus three unrolled versions) • Floating Point Sums emulation • Full MP3 PLAYER 16

  42. Performance 11 17

  43. Performance 11 17

  44. Performance 11 The same number of different sequences of instructions 17

  45. Performance 11 Parallelism exposed by loop unrolling 17

  46. Performance 11 Parallelism exposed by loop unrolling 17

  47. Performance 11 No more parallelism available! 17

  48. Performance 11 No more parallelism available! 17

  49. Performance 11 There is room for improvement! 17

  50. Performance 11 Compare these two and you can save reconfiguration memory 17

More Related