290 likes | 477 Views
Mohamed. M. Saad Mohamed A. Mohamedin & Prof. Binoy Ravindran. Hydra VM: Extracting Parallelization From Legacy Code Using STM . VT-MENA Program Electrical & Computer Engineering Department Virginia Polytechnic Institute and State University. Outline. Motivation & Objectives
E N D
Mohamed. M. Saad Mohamed A. Mohamedin & Prof. BinoyRavindran Hydra VM: Extracting Parallelization From Legacy Code Using STM VT-MENA Program Electrical & Computer Engineering Department Virginia Polytechnic Institute and State University
Outline • Motivation & Objectives • Background • Transactional Memory • Jikes RVM • Program Reconstruction • Architecture • Profiler, Builder & Runtime • Future Work
Motivation • Why Multicores? • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits • heat problems • speed of light problems • difficult design and verification • large design teams necessary • server farms need expensive air-conditioning
Motivation • No fast CPUs any more, just more cores! • Trend is using multi-core & hyper-threading
Motivation • At 2005, Sun Niagara (8 cores with HT run 32 HWT) • At 2010, Supermicro (48-core AMD Opteron). • Now, Sun make boxes with between 128-512 hardware threads (16 HWT/core, 8 cores/CPU) !! What About Software!!! Are we ready for this HW ?!
Objective • Many applications are designed to use few threads • Legacy systems were designed to run at a single processor • Multi-threading programming is headache for developers (race situations, concurrent access, …) • HydraVM: Java Virtual Machine Prototype based on Jikes RVM and targets utilizing large number of cores through detecting automatically possible parallel portions of code
Background • Transactional Memory • Jikes RVM (Adaptive Online Architecture)
Atomicity • Atomicity: An operation (or set of operations) appears to the rest of the system to occur instantaneously Example (Money Transfer): …… synchronized { from = from - amount to = to + amount } …… …… Example (Money Transfer): …… …… account1.lock() account2.lock() from = from - amount to = to + amount account1.unlock() account2.unlock() …… …… ≈ Y account1 account2 X
Mutual Locks “Classical Approach” • Drawbacks • Deadlock • Livelock • Starvation • Priority Inversion • Non-composable • Cost of managing the lock • Non-scalable on multiprocessors Y A B X
Transactional Memory • Simplifies parallel programming by allowing a group of load and store instructions to execute in an atomicway using additional primitives account1y account2y Y Example (Money Transfer): …… …… START-TRANSACTION from = from - amount to = to + amount END-TRANSACTION …… …… account1 account2 account1x account2x X Commit or Rollback & Retry
Transactional Memory • Each transaction has ReadSet & WriteSet • Transactions conflict if have the same variable(s) at ReadSet / WriteSet • Conflict Resolution using Contention Managerthat employs different policies (Aggressive, Polite, Back-Off, Random, …..) • Aborted code undo changes (if required) and retries again
Transactional Memory • Transactions may be nested (multiple levels) • Inner transaction share the ReadSet/WriteSet of parent • Inner transactions conflicts with each other and with other higher level transactions • Aborting parent transaction forces abort for children • Inner transactions changes are visible to parents once commit successfully, but hidden from outside world till commit of highest level
Transactional Memory • Hardware Transactional Memory Modifications in processors, cache and bus protocol ex; unbounded HTM, TCC, …. • Software Transactional Memory Software runtime library or the programming language support Minimal hardware support; CAS, LL/SC ex; RSTM, DSTM, ESTM, .. • Hybrid Transactional Memory Exploits HTM support to achieve hardware performance for transactions that do not exceed the HTM’s limitations, and STM otherwise ex; LogTM, HyTM, … • Distributed Transactional Memory Extends transaction primitives to distributed environment (network of multiple machines) ex; HyFlow, DecentSTM, GenSTM, …
Jikes RVM • Mature modular open source Java virtual machine designed for research purposes. Unlike most other JVMs it is written in Java! • Adaptive Online System
Program Reconstruction “The Main Idea” • We view program as a set of basic building blocks • Each block is a set of instructions • Block has single entry and multiple exists • Blocks may access the same memory (variables) • It is possible to reconstruct the program from these blocks by rearranging it differently with some changes to the control instructions. • It is even possible to assign each set of blocks to different thread
Example 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return int counter = 0; for(inti=0; i<2; i++) if(Math.random()>0.3) counter++; else counter--;
Example public class Test{ public static void foo(){int counter = 0; for(inti=0; i<12; i++) if(Math.random()>0.3) counter++; else counter--; } public static void zoo(){System.out.println("hi"); } public static void main(String[] args){inti=6; if(i<10)foo(); else zoo(); }}
ArchitectureProfiler • Split code into Basic Block • Inject loaded classes with additional instructions to monitor: • Program Flow (Which Basic Blocks are accessed and in what order?) • Memory accessed by each Basic Block • Which Basic Block is doing I/O ?
ArchitectureProfiler 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 write J write C visit B1 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl read K write K visit B2 14: ifle 23 17: iinc 1, 1 read C write C visit B3 20: goto 26 23: iinc 1, -1 read C write C visit B4 26: iinc 2, 1 read J write J visit B5 29: iload_2 30: bipush 12 visit B6 read J 32: if_icmplt 7 35: return Example: int C = 0; for(int J=0; J<2; J++) if(Math.random()>0.3) C++; else C--; 0: iconst_0 1: istore_1 2: iconst_0 3: istore_2 4: goto 29 7: invokestatic #13; 10: ldc2_w #19; 13: dcmpl 14: ifle 23 17: iinc 1, 1 20: goto 26 23: iinc 1, -1 26: iinc 2, 1 29: iload_2 30: bipush 12 32: if_icmplt 7 35: return
ArchitectureRecompilation • Recompile the Java class bytecode into machine-code • Replace and reload class definition at memory
ArchitectureCode Execution • Running the profiled code • Collecting flow & memory access information and store it at the knowledge repository
ArchitectureBuilder • Analyze knowledge repository information and know: • Which Blocks can be grouped together • Which groups of blocks can be parallelized
ArchitectureBuilder • Program can be represented as a string (each character is a basic block). • Example: for (Integer i = 0; i < DIMx; i++) { for (Integer j = 0; j < DIMx; j++) { for (Integer k = 0; k < DIMy; k++) { C[i][j] += A[i][k] * B[k][j]; } } } abjbhcfefghcfefghijbhcfefghcfefghijk ab(jb(hcfefg)2hi)2jk
ArchitectureBuilder ab(jb(hcfefg)2hi)2k • Externalize common blocks patterns as methods • Generated methods may be nested • Reconstruct the program as producer-consumer pattern • Collector • Provides Executor with suitable blocks as Tasks to execute according to flow up-to time • Executor • Allocates core threads • Assign tasks to threads • Requests Collector for more blocks based on program flow, after all threads complete
ArchitectureBuilder • Problems • Threads may conflict when access the same variables • Threads may finish out of normal order • Collector may generate invalid tasks • Lets represents each Thread as Transaction • When two transactions conflicts abort one that has newer blocks relative to normal execution • Transaction will not commit unless its preceding one in timeline is finished • Transaction timeout if not reachable
ArchitectureCode Execution – revisit • Collects which transactions conflicts and commit rate • We can refine the constructed program • Builder re-organize generated blocks and recompile the code again
Ongoing & Future Work • Complete the implementation of HydraVM • Profiling by monitoring memory instead of generating new instructions • Automatically uses of Java NIO to handle I/O operations and generate callbacks to process it • Using thread scheduling techniques instead of TM • Formal verification of reconstructed programs matches desired semantics
Questions www.hydravm.org msaad@vt.edu