270 likes | 413 Views
Performance Analysis O f Generics I n Scientific Computing. Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western Ontario SYNASC 2005. Overview. Motivation Parametric Polymorphism Implementation Generalizing A Numeric Benchmark
E N D
Performance Analysis Of Generics In Scientific Computing Laurentiu Dragan Stephen M. Watt Ontario Research Centre for Computer Algebra University of Western Ontario SYNASC 2005
Overview • Motivation • Parametric Polymorphism Implementation • Generalizing A Numeric Benchmark • Language Issues • Results • Potential Optimizations • Conclusion
Motivation • Increasing demand for generic code • Scientific code requires high-performance making optimizations very important • Generic code – not as fast as specialized code • No tools to measure performance of generic code • Benchmarks – tool to measure the performance • SciGMark – benchmark for generic code • Compilers – optimize the generic code – performance close to hand specialized code
Parametric Polymophism Implementation • Some languages with support for Generics • Aldor, C++ • Java, C# • Some types can be given as parameters • Implementations • Homogeneous: Java, C# • Share the generic code • Example: Vector<Integer> → Vector with elements of type Object • Heterogeneous: C++, C# • Specialize the generic code • Example: std::vector<int> → new specialized class
Generalizing A Numeric Benchmark • SciMark 2 • Polynomial Multiplication • Implemented in Aldor, C++, C#, Java
SciMark 2 • Fast Fourier transform – 1024 • Complex arithmetic, shuffling, non-constant memory reference, trigonometric functions • Jacobi successive over-relaxation – 100x100 • Typical access patterns in finite difference applications • Monte Carlo integration • Random number generator, function inlining • Sparse matrix multiplication – 1000, 5000 non-zero • Indirection addressing, non-regular memory references • Dense LU factorization – 100x100 • Dense matrix operations
Class SOR { double[] array; } R R a(R o) double Class SOR < R extends IRing<R> > { R [ ] array; } + DoubleRing void ae(R o) From SciMark 2 to SciGMark • SciMark – double hardcoded • Arrays are of type double • Any change – extensive modifications to the code • SciGMark – classes are parametric • Change representation – minimal code changes • Double becomes parameter R
Basic Generic Types • IRing • Provides operations for addition, subtraction, multiplication, division – mutable, non-mutable • Conversions to and from int and double • Factories to produce new elements of these type • DoubleRing – wrapper for double • Implements IRing • Complex • Implements IComplex (simple extension to IRing) • Complex<R extends IRing<R>> implements IComplex<Complex<R>,R>
Generic Tests • GenFFT • Uses R: Complex<DoubleRing> • Complex numbers – two consecutive entries in the array • Depending on the application – different representation (e.g. Hermitian matrix) • GenMat, GenLU • Use R: DoubleRing • The classes contain more methods – the whole class contains a type parameter • GenSOR, GenMonteCarlo • Use R: DoubleRing • Have single static method with a type parameter
Polynomial Multiplication • 40 coefficients • Dense representation unidimensional array • Regular memory access, temporary objects creation (memory allocation) • Implementation • DensePolynomial • DensePolynomialG <E extends IRing<E> > implements IRing<DensePolynomialG<E> > • SmallPrimeField • Represented by an int • SmallPrimeFieldG implements IRing<SmallPrimeFieldG>
Specializing Polynomial Multiplication • The code was initially implemented using generics • Inlined all the calls to SmallPrimeField • Replaced all the instances of SmallPrimeField with int • Essentially the inverse of the operation performed to “generalize” the SciMark • No changes to the algorithm – all changes could be performed automatically
Language Issues • Java • No operator overloading • Homogeneous – erasure technique – subclassing • Implemented at language level – no virtual machine support – limitations – require object factory • Type inference for generics is invariant – Pass the type as argument • Complex <R extends IRing<R>> implements IComplex<Complex<R>,R> • C# • Reference types (homogeneous) – Java; primitive types (heterogeneous) – C++ • Structures instead of classes – structures in collections are boxed
Language Issues • C++ • Heterogeneous • Parametric polymorphism (templates) macro processor • No bounded polymorphism • No way to test the generic class until is instantiate • Aldor • Homogeneous • Supports dependent types • Polymorphic types constructed using domain constructing functions
SciGMark Results • Results in MFlops • Testing environment: • Pentium IV – 3.2GHz (1MB cache), 2 GB RAM • Windows XP SP2 • Cygwin/GCC 3.4.4 • Sun JDK 1.5.0_04 • Microsoft .NET v2.0.50215 • Aldor 1.0.2
Test C++ Java C# Aldor Size Gen Spe Gen Spe Gen Spe Gen Spe FFT 59 365 23 321 7 242 1 340 1024 SOR 71 419 66 681 22 417 15 417 100x100 MC 46 65 22 26 28 62 90 203 N/A MM 87 739 111 410 39 477 4 485 1000, 5000 LU 103 780 74 982 18 403 5 553 100x100 PM 62 365 48 227 28 321 6 156 40 Comp. 71 434 57 441 24 320 20 359 N/A SciGMark Results
Test C++ Java C# Aldor Size Gen Spe Gen Spe Gen Spe Gen Spe FFT 59 365 23 321 7 242 1 340 1024 SOR 71 419 66 681 22 417 15 417 100x100 MC 46 65 22 26 28 62 90 203 N/A MM 87 739 111 410 39 477 4 485 1000, 5000 LU 103 780 74 982 18 403 5 553 100x100 PM 62 365 48 227 28 321 6 156 40 Comp. 71 434 57 441 24 320 20 359 N/A SciGMark Results
Test C++ Java C# Aldor Size Gen Spe Gen Spe Gen Spe Gen Spe FFT 59 365 23 321 7 242 1 340 1024 SOR 71 419 66 681 22 417 15 417 100x100 MC 46 65 22 26 28 62 90 203 N/A MM 87 739 111 410 39 477 4 485 1000, 5000 LU 103 780 74 982 18 403 5 553 100x100 PM 62 365 48 227 28 321 6 156 40 Comp. 71 434 57 441 24 320 20 359 N/A SciGMark Results
Aldor Results • Testing environment: • Pentium IV – 3.2GHz (1MB cache), 2 GB RAM • Linux Fedora Core 3 • Aldor 1.0.2 • Stanford benchmark • Aldor’s performance can be almost as good as C++
Test Aldor C++ Time Iterations Time Iterations Permutations 0.43 23400 0.37 26901 Towers 0.58 17297 0.21 46924 8-Queen 0.52 19700 0.45 21987 Mat Mult 0.65 15386 0.20 49155 Puzzle 2.89 3484 2.16 4626 Quick Sort 0.79 12538 0.66 15214 Bubble Sort 0.74 13526 0.53 19089 Tree Sort 1.00 10 2.00 10 FP Mat Mult 0.69 14342 0.49 20355 Oscar FFT 0.38 26838 0.24 40719 Comp FP 1.07 1.05 Comp int 1.43 1.29 Aldor Results
Potential Optimizations • 6-18 times performance improvement • Specialized code • Same algorithm • Generic types replaced by specialized types • Eliminate generic wrapper objects – primitive types
Test Case Aldor • Domain producing function: PolynomialVect(C: Ring) == add { Rep == Vector Polynomial C; (f: %) + (g: %): % == { res := new(#f); rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat res(k) := i + j; per res }}PC == PolynomialVect(Complex DoubleFloat);PQ == PolynomialVect(Rational);
Test Case Aldor Domain producing function: PolynomialVect(C: Ring) == add { Rep == Vector Polynomial C; (f: %) + (g: %): % == { res := new(#f); rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat res(k) := i + j; per res }}PC == PolynomialVect(Complex DoubleFloat);PQ == PolynomialVect(Rational);
Test Case Aldor • Specialize the domain producing function PC == add { Rep == Vector Polynomial Complex DoubleFloat; (f: %) + (g: %): % == { res := new(#f); rf := rep f; rg := rep g; for k in 1..#f for i in rf for j in rg repeat res(k) := i + j; -- ‘+’ from Complex per res }}
Optimize Data Representation • Scalar product of vector of complex numbers dot(u: Vector Complex R, v: Vector Complex R): Complex R == {s: Complex R := 0; for i in 1..n repeat s := s + u.i*v.i; return s;} dot(u: Vector Complex R, v: Vector Complex R): Complex R == {x: R := 0; y: R := 0; for i in 1..n repeat {x := x + real(u.i)*real(v.i) - imag(u.i)*imag(v.i);y := y + real(u.i)*imag(v.i) + imag(u.i)*real(v.i); } return complex(x,y);}
Conclusion • Generics important for scientific computing – rich mathematical models – easy to implement with generic code • Need a tool to measure the compiler ability to produce efficient code • We have seen difference of 6-18 times between generic and specialized code – room for improvement in compilers capabilities • Presented some optimizations ideas • http://www.orrca.on.ca/benchmarks/scigmark/1.0/