190 likes | 320 Views
Rethinking Component Based Software Engineering. Don Batory, Bryan Marker, Rui Gonçalves , Robert van de Geijn, and Janet Siegmund Department of Computer Science University of Texas at Austin Austin, Texas 78746. Introduction.
E N D
Rethinking Component BasedSoftware Engineering Don Batory, Bryan Marker, Rui Gonçalves, Robert van de Geijn, and Janet SiegmundDepartment of Computer ScienceUniversity of Texas at Austin Austin, Texas 78746
Introduction • Software Engineering (SE)largely aims at techniques, tools to aid masses of programmers whose code is used by hoards • these programmers need all the help they can get • Many areas where programming tasks are so difficult, only a few expert programmers – and their code is used by hoards • these experts need all the help they can get too
Our Focus is CBSE for… • Dataflow domains: • nodes are computations • edges denote nodeinputs and outputs • General: Virtual Instruments (LabVIEW), applications of streaming languages… • Our domains: • Distributed-Memory Dense Linear Algebra Kernels • Parallel Relational Query Processors • Crash Fault-Tolerant File Servers
Approach • CBSE Experts produce “Big Bang”spaghetti diagrams (dataflow graphs) • We derive dataflow graphs from domain knowledge (DxT) • When we have proofs of each step: • Details later… Correct By Construction
State of Art forDistributed Memory Dense Linear Algebra Kernels • Portability of DLA kernels is problem: • may not work – distributed memory kernels don’t work on sequential machines • may not perform well • choice of algorithms to use may be different • cannot “undo” optimizations and reapply others • if hardware is different enough, code kernels from scratch
Why? Because Performance is Key! • Applications that make DLA kernel calls are common to scientific computing: • simulation of airflow, climate change, weather forecasting • Applications are run on extraordinarily expensive machines • time on these machines = $$ • higher performance means quicker/cheaper runs or more accurate results • Application developers naturally want peak performance to justify costs
Distributed DLA Kernels • Deals with SPMD (Single Program, Multiple Data) architectures • same program is run on each processor but with different inputs • Expected operations to support are fixed – but with lots of variants Level 3 Basic Linear Algebra Subprograms (BLAS3) basically matrix-matrix operations
Distributed DLA Kernels • Deals with SIMD (Single Instruction, multiple data) architectures • same program is run on each processor but with different inputs • Expected operations to support are fixed – but with lots of variants general matrix-matrix multiply Hermitian matrix-matrix multiply symmetric matrix-matrix multiply triangular matrix-matrix multiply solving non-singular triangularsystem of eqns
Distributed DLA Kernels • Deals with SIMD (Single Instruction, multiple data) architectures • same program is run on each processor but with different inputs • Expected operations to support are fixed – but with lots of variants general matrix-matrix multiply Hermitian matrix-matrix multiply symmetric matrix-matrix multiply triangular matrix-matrix multiply solving non-singular triangularsystem of eqns
12 Variants of Distributed Gemm • Where: and: • Specialize implementation for distributed memory based on , , or is largest • Similar distinctions for other operations
Further • Want to optimize “LAPACK-level” algorithms which call DLA and BLAS3 operations: • solvers • decomposition functions (e.g. Cholesky factorization) • eigenvalue problems • Have to generate high-performance algorithms for these operations too • Our work mechanizes the decisions of experts on van de Geijn’sFLAME project, in particular Elemental library (J. Poulson) • rests on 20 years of polishing, creating elegant layered designs of DLA libraries and their computations
Performance Results • Target machines: • Benchmarked against ScaLAPACK • vendors standard option for distributed memory machines;auto-tuned or manually-tuned • only alternative available for target machines except for FLAME • DxT automatically generated & optimized BLAS3 and Cholesky FLAME algorithms
DxT Not Limited to DLA • DLA components are stateless – DxT does not require stateless components • DxT originally developed for stateful Crash-Fault-Tolerant Servers • Correct by Construction, can design high performing programs, and best of all:can teach it to undergrads! • Gave project to an undergraduate class of 30+ students • Had them build Gamma – classical parallel join algorithm circa 1990susing same DxT techniques we used for DLA code generation • We asked them to compare this with “big bang” approach which directly implements the spaghetti diagram (final design)
Preliminary User Study#s • Compared to “Big Bang” 25/28 = 89%
They Really Loved It I have learned the most from this project than any other CS project I have ever done. I even made my OS group do DxT implementation on the last 2 projects due to my experience implementing gamma. Honestly, I don't believe that software engineers ever have a source (to provide a DxT explanation) in real life. If there was such a thing we would lose our jobs, because there is an explanation which even a monkey can implement. It's so much easier to implement (using DxT). The big-bang makes it easy to make so many errors, because you can't test each section separately. DxT might take a bit longer, but saves you so much time debugging, and is a more natural way to build things. You won't get lost in your design trying to do too many things at once.
What are Secrets Behind DxT? I’m sorry – I ran out of time… questions?