240 likes | 369 Views
Titanium: A High Performance Java-Based Language. Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul Hilfinger, Arvind Krishnamurthy, Ben Liblit, Carleton Miyamoto, Geoff Pike, Luigi Semenzato,. Talk Outline. Motivation Extensions for uniprocessor performance
E N D
Titanium: A High Performance Java-Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul Hilfinger, Arvind Krishnamurthy, Ben Liblit, Carleton Miyamoto, Geoff Pike, Luigi Semenzato,
Talk Outline • Motivation • Extensions for uniprocessor performance • Extensions for parallelism • A framework for domain-specific languages • Status and performance
Programming Challenges on Millennium • Large scale computations • Optimized simulation algorithms are complex • Use of hierarchical parallel machine • Cost-conscious programming Minimization algorithms Unstructured meshes ? Adaptive meshes
Titanium Approach • Performance is primary goal • High uniprocessor performance • Designed for shared and distributed memory • Parallelism constructs with programmer control • Optimizing compiler for caches, communication scheduling, etc. • Expressiveness secondary goal • Based on safe language: Java • Safety simplifies programming and compiler analysis • Framework for domain-specific language extensions
New Language Features • Immutable classes • Multidimensional arrays • also: points and index sets as first-class values • multidimensional iterators • Memory management • semi-automated zone-based allocation • Scalable parallelism • SPMD model of execution with global address space • Language-level synchronization • Support for grid-based computation
Java Objects • Primitive scalar types: boolean, double, int, etc. • access is fast • Objects: user-defined and from the standard library • has level of indirection (pointer to) implicit • arrays are objects • all objects can be checked for equality and a few other operations 3 true r: 7.1 i: 4.3
Immutable Classes in Titanium • For small objects, would sometimes prefer • to avoid level of indirection • pass by value • extends the idea of primitive values (1, 4.2, etc.) to user-defined values • Titanium introduces immutable classes • all fields are final(implicitly) • cannot inherit from (extend) or be inherited by other classes • needs to have 0-argument constructor, e.g., Complex () immutable class Complex { ... } Complex c = new Complex(7.1, 4.3);
Arrays in Java • Arrays in Java are objects • Only 1D arrays are directly supported • Array bounds are checked (as in Fortran) • Multidimensional arrays as arrays of arrays are slow and cannot transform into contiguous memory
Titanium Arrays • Fast, expressive arrays • multidimensional • lower bound, upper bound, stride • concise indexing: A[p] instead of A(i, j, k) • Points • tuple of integers as primitive type • Domains • rectangular sets of points (bounds and stride) • arbitrary sets of points • Multidimensional iterators
Example: Point, RectDomain, Array Point<2> lb = [1, 1]; Point<2> ub = [10, 20]; RectDomain<2> R = [lb : ub : [2, 2]]; double [2d] A = new double[R]; … foreach (p in A.domain()) { A[p] = B[2 * p]; } • Standard optimizations: • strength reduction • common subexpression elimination • invariant code motion • removing bounds checks from body
Memory Management • Java implemented with garbage collection • Distributed GC too unpredictable • Compile-time analysis can improve performance • Zone-based memory management • extends existing model • good performance • safe • easy to use
Zone-Based Memory Management • Allocate objects in zones • Release zones manually Z1 Zone Z1 = new Zone(); Zone Z2 = new Zone(); T x = new(Z1) T(); x T y = new(Z2) T(); x.field = y; x = y; delete Z1; Z2 y delete Z2; // error
Sequential Performance Times in seconds (lower is better).
Model of Parallelism { • Single Program, Multiple Data • fixed number of processes • each process has own local data • global synchronization (barrier) n processes start ... barrier ... barrier ... ... barrier ... end
lv lv lv lv lv lv gv gv gv gv gv gv Global Address Space • Each process has its own heap • References can span process boundaries Other processes Process 0 LOCAL HEAP LOCAL HEAP Class T { … } T gv; T lv = null; if (thisProc() == 0) { lv = new T(); // allocate locally } gv = broadcast lv from 0; // distribute … gv.field ...
Global vs. Local References • Global references may be slow • distributed memory: overhead of a few instructions when using a global reference to access a local object • shared memory: no performance implications • Solution: use local qualifier • statically restrict references to local objects • example: T local lv = null; • use only in critical sections
Global Synchronization Analysis • In Titanium, processes must synchronize at the same textual instances of barrier() doThis(); barrier(); boolean x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();
Global Synchronization Analysis • In Titanium, processes must synchronize at the same textual instances of barrier() • Singleness analysis statically guarantees correctness by restricting the values of variables that control program flow doThis(); barrier(); boolean single x = someCondition(); if (x) { doThat(); barrier(); } doSomeMore(); barrier();
Support for Grid-Based Computation R Point<2> lb = [0, 0]; Point<2> ub = [6, 4]; RectDomain<2> R = [lb : ub : [2, 2]]; … Domain<2> red = R + (R + [1, 1]); foreach (p in red) { … } (6, 4) (0, 0) R + [1, 1] (7, 5) (1, 1) red (7, 5) Gauss-Seidel relaxation with red-black ordering (0, 0)
Implementation • Strategy • compile Titanium into C (currently C++) • Posix threads for SMPs (currently Solaris threads) • Lightweight Active Messages for communication • Status • runs on SUN Enterprise 8-way SMP • runs on Berkeley NOW • trivial ports to 1/2 dozen other architectures • tuning for sequential performance
Titanium Status • Titanium language definition complete. • Titanium compiler running. • Compiles for uniprocessors, NOW; others soon. • Application developments ongoing. • Many research opportunities.
Parallel Performance Speedup • Numbers from Ultrasparc SMP • Parallel efficiency good • EM3D (unstructured kernel) • 3D AMR limited by algorithm Number of processors
Future Directions • Use of framework for domain-specific languages • Fluids and AMR done • Unstructured meshes and sparse solvers • Better programming tools • debuggers, performance analysis • Optimizations • analysis of parallel code and synchronization done • optimizations for caches on uniprocessors and SMPs underway • load balancing on clusters of SMPs
Conclusions • Performance • sequential performance consistently close to C/FORTRAN • currently: 80% slower to 25% faster • sequential efficiency very high • Expressiveness • safety of Java with small set of performance features • extensible to new application domains • Portability, compatibility, etc. • no gratuitous departures from Java standard • compilation model easily supports new platforms