450 likes | 570 Views
Vienna University of Technology August 2 nd Dr. Ian Rogers, Research Fellow, The University of Manchester ian.rogers@manchester.ac.uk. Highly Parallel, Object-Oriented Computer Architecture (also the Jikes RVM and PearColator). Presentation Outline. The JAMAICA project
E N D
Vienna University of Technology August 2nd Dr. Ian Rogers, Research Fellow, The University of Manchester ian.rogers@manchester.ac.uk Highly Parallel, Object-Oriented Computer Architecture(also the Jikes RVM and PearColator)
Presentation Outline • The JAMAICA project • What are the problems? • Hardware design solutions • Software solutions • Where we are • The Jikes RVM – behind the scenes • PearColator – a quick overview
Problems • “The job of an engineer is to identify problems and find better ways to solve them”, James Dyson (I'm sure many others) • There are many problems currently in Computer Science and more on the horizon • Problem solving is adhoc, and many good solutions aren't successful • Let's look at the problems from the bottom up
Problems in Process • Heat and power are problems • Smaller sizes (<45nm) lead to problems with process variations, degradation over time, more transient errors. • 3D or stacked designs have significant problems • Simulation must be repeated for all possible design and environment variations so statistically a design should work
Problems in Architecture • Speed of interconnect • Die area is very large • Knowing your market “tricks” are key to realising performance – especially in the embedded space • Move away from general purpose design – GPUs, physics processors
Problems in Systems Software • Lag of systems software behind new hardware • How to virtualise systems with minimal cost • Lots of momentum in existing solutions • Problems with natively executable code: • needs to be run in a hardware/software sandbox • no dynamic optimisation of code with libraries and operating system • cost of hardware to support virtualization
Problems in Compilers • The hardware target keeps moving • The notion of stacked languages, and virtual machines isn't popular • Why aren't we better at instruction selection? • embedded designs have vasts amount of assembler code targeting exotic registers and ISAs • How to parallelize for thousands of contexts • Machine learning?
Problems for Applications • Application writers have an abundance of tools and wisdom to listen to, the wisdom often conflicts • Application concerns: • performance • maintainability (evolution?) • time to implement • elegance of solution
Problems for Consumers • Cost • Migration of systems • Legacy support
Recap • Process: lots of transistors, lots of problems • Architecture: speed of interconnect, complexity • Systems software: momentum • Compilers: stacking, using new architectures • Applications: lots of tools and wisdom, concerns • Consumers: how much does it cost? What about my current business systems and processes?
Oh dear, it's a horrible problem, what we need is a new programming language • Why? • parallel programming is done badly at the moment • we should teach CS undergraduates this language • we will then inherently get parallel software • Why not? • CSP has already been here [Hoare 1978] • clusters are already solving similar problems using existing languages • it's easy to blame the world's problems on C
Do we need another programming language? • What I think is needed are domain specific languages, or domain specific uses of a language: • extracting parallelism from Java implies work not necessary at a mathematical abstraction – MatlabP, Mathematica, Fortran 90 • codecs, graphics pipelines, network processors – languages devised here should express just what's necessary to do the job • message passing to avoid use of shared memory
Virtual Machine Abstraction • We don't need another programming language, we need common abstractions • This abstraction will be inherently parallel but not inherently shared memory • Java is a reasonable choice and brings momentum
Architecture • Point-to-point asynchronous interconnect • Simultaneous Multi-Threading to hide latencies (e.g. Sun Niagara, Intel HyperThreading) • Object-Oriented – improve GC, simplify directories • Transactional – remove locks, enable speculation
Object-Oriented Hardware • Proposed in the Mushroom project from Manchester • Recent papers by Wright and Wolzcko • Address the cache using object ID and offset L1 Data Cache Offset Object ID
Object-Oriented Hardware • On a cache miss the object ID is translated to the object’s address L1 Data Cache MISS! Offset Object ID
Object-Oriented Hardware • We can re-use the TLB • Having a map allows objects to be moved without altering references • Only object headers will contain locks Object to Address Map Virtual Memory Object ID TLB
Transactional Hardware • Threads reading the same object can do so regardless of their epoch • (based on [Ananian et al., 2005]) Transaction Object ID
Transactional Hardware • When a later epoch thread writes to an object a clone is made Transaction Object ID
Transactional Hardware • If an earlier thread writes to an object the later threads using that object rollback Transaction Object ID
Object-Oriented and Transactional Hardware • Again the TLB can remove some of the cost Transaction Object ID Map Virtual Memory TLB
Speculation • Speculative threads created with predicted input values and expected not to interact with other non-speculative threads • Transaction can complete if we didn’t rollback and inputs to thread were as predicted • Can speculate at: • Method calls • Loops
Operating Systems • Supporting an object based and virtual memory view of a system implies extra controls in our system • Therefore, we want the whole system software stack inside the VM
Where we are • Jikes RVM and JNode based Java operating systems • Open source dynamic binary translator (arguably state-of-the-art performance) • Simulated architecture • Parallelizing JVM, working for do-all loops, new work on speculation and loop pipelining • Lots of work on the other things I've talked about
The Jikes RVM • Overview of the adaptive compilation system: • Methods recompiled based on their predicted future execution time and the time taken to compile • Some optimisation levels are skipped
The baseline compiler • Used to compile code the first time it’s invoked • Very simple code generation: Load t0, [locals + 0] Store [stack+0], t0 Load t0, [locals + 4] Store [stack+4], t0 Load t0, [stack+0] Load t1, [stack+4] Add t0, t0, t1 Store [stack+0], t0 Load t0, [stack+0] Store [locals + 0], t0 • iload_0 • iload_1 • iadd • istore_0
The baseline compiler • Pros: • Easy to port – just write emit code for each bytecode • Minimal work needed to port runtime and garbage collector • Cons: • Very slow
The boot image • Hijack the view of memory (mapping of objects to addresses) • Compile list of primordial classes • Write view of memory to disk (the boot image) • The boot image runner loads the disk image and branches into the code block for VM.boot
The boot image • Problems: • Difference of views between: • Jikes RVM • Classpath • Bootstrap JVM • Fix by writing null to some fields • Jikes RVM runtime needs to keep pace with Classpath
The runtime • M-of-N threading • Thread yields are GC points • Native code can deadlock the VM • JNI written in Java with knowledge of C layout • Classpath interface written in Java
The Jikes RVM • Overview of the adaptive compilation system: • Methods recompiled based on their predicted future execution time and the time taken to compile • Some optimisation levels are skipped
The optimizing compiler • Structured from compiler phases based on HIR, LIR and MIR phases from Muchnick • IR object holds instructions in linked lists in a control flow graph • Instructions are an object with: • One operator • Variable number of use operands • Variable number of def operands • Support for def/use operands • Some operands and operators are virtual
The optimizing compiler • HIR: • Infinite registers • Operators correspond to bytecodes • SSA phase performed • LIR: • Load/store operators • Java specific operators expanded • GC barrier operators • SSA phase performed • MIR: • Fixed number of registers • Machine operators
The optimizing compiler • Factored control graph: • Don’t terminate blocks on Potentially Exceptioning Instructions (PEIs) • Bound check • Null check • Checks define guards which are used by: • Putfield, getfield, array load/store, invokevirtual • Eliminating guards requires propagation of use
The optimizing compiler • Java – can we capture and benefit from strong type information? • Extended Array SSA: • Single assignment • Array – Fortran style - a float and an int array can’t alias • Extended – different fields and different objects can’t alias • Phi operator – for registers, heaps and exceptions • Pi operator – define points where knowledge of a variable is exposed. E.g. A = new int[100], later uses of A can know the array length is 100 (ABCD)
The optimizing compiler • HIR: Simplification, tail recursion elimination, estimate execution frequencies, loop unrolling, branch optimizations, (simple) escape analysis, local copy and constant propagation, local common sub-expression elimination • SSA in HIR: load/store elimination, redundant branch elimination, global constant propagation, loop versioning • AOS framework
The optimizing compiler • LIR: Simplification, estimate execution frequencies, basic block reordering, branch optimizations, (simple) escape analysis, local copy and constant propagation, local common sub-expression elimination • SSA in LIR: global code placement, live range splitting • AOS framework
The optimizing compiler • MIR: instruction selection, register allocation, scheduling, simplification, branch optimizations • Fix-ups for runtime
Speculative Optimisations • Often in a JVM there’s potentially not a complete picture, in particular for dynamic class loading • On-stack replacement allows optimisation to proceed with a get out clause • On-stack replacement is a virtual Jikes RVM instruction
Applications of on-stack replacement • Safe invalidation for speculative optimisation • Class hierarchy-based inlining • Deferred compilation • Don’t compile uncommon cases • Improve dataflow optimization and improve compile time • Debug optimised code via dynamic deoptimisaton • At break-point, deoptimize activation to recover program state • Runtime optimization of long-running activities • Promote long-running loops to higher optimisation levels
PearColator • Decoder: • Disassembler • Interpreter (Java threaded) • Translator • Generic components: • Loaders • System calls • Memory
Thanks and… any questions?