520 likes | 643 Views
Summit On Self Adapting Software and Algorithms August 7th and 8th, 2002. University of Tennessee Claxton Complex, Room 233 Hosted by the Innovative Computing Laboratory. Agenda. Presentations from various groups
E N D
Summit On Self Adapting Software and AlgorithmsAugust 7th and 8th, 2002 University of TennesseeClaxton Complex, Room 233 Hosted by the Innovative Computing Laboratory
Agenda • Presentations from various groups • These presentations should give all participants the opportunity to get information about the current activities of all groups and about their future plans and visions. • Discussion for a consensus on common goals. • Formulate common goals and try to agree upon ways how to reach them. • Discussion of possible funding opportunities
Self Adapting Numerical Software andUpdate on NetSolve University of Tennessee
Outline • Overview of the SANS-Effort • Review of ATLAS • Sparse Optimization • LAPACK For Clusters (LFC) • Review of NetSolve • Grid enabled numerical libraries
Innovative Computing Laboratory • Numerical Linear Algebra • Heterogeneous Distributed Computing • Software Repositories • Performance Evaluation Software and ideas have found there way into many areas of Computational Science Around 40 people: At the moment... 16 Researchers: Research Assoc/Post-Doc/Research Prof 15 Students: Graduate and Undergraduate 8 Support staff: Secretary, Systems, Artist 1 Long term visitors (Japan) Responsible for about $3-4M/years in research funding from NSF, DOE, DOD, etc
TUNING SYSTEM Different Algorithms, Segment Sizes Data Structure Best Algorithm, Segment Size Data Structure Self-Adapting Numerical Software (SANS) Effort • The complexities of modern processors or clusters makes it difficult to achieve high performance • Hardware, compilers, and software have a large design space w/many parameters • Kernels of Computation Routines • Focus on where the most time is spent • Need for quick/dynamic deployment of optimized routines. • Algorithm layout and implementation • Look at the different ways to express implementation
Takes ~ 20 minutes to run, generates Level 1,2, & 3 BLAS “New” model of high performance programming where critical code is machine generated using parameter optimization. Designed for RISC arch Super Scalar Need reasonable C compiler Today ATLAS in used within various ASCI and SciDAC activities and by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE,… Parameter study of the hw Generate multiple versions of code, w/difference values of key performance parameters Run and measure the performance for various versions Pick best and generate library Level 1 cache multiply optimizes for: TLB access L1 cache reuse FP unit usage Memory fetch Register reuse Loop overhead minimization Software Generation Strategy - ATLAS BLAS
ATLAS Matrix Multiply Intel Pentium 4 at 2.53GHz – using SSE2 P4 32-bit fl pt using SSE2 P4 64-bit fl pt using SSE2 ~$1000 for system => $0.25/Mflops !!
Experiments with C, Fortran, and Java for ATLAS (DGEMM kernel)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Recursive Approach for Other Level 3 BLAS Recursive TRMM • Recur down to L1 cache block size • Need kernel at bottom of recursion • Use gemm-based kernel for portability
Intel PIII 933 MHzMKL 5.0 vs ATLAS 3.2.0 using Windows 2000 • ATLAS is faster than all other portable BLAS implementations and it is comparable with machine-specific libraries provided by the vendor.
Machine-Assisted Application Development and Adaptation • Communication libraries • Optimize for the specifics of one’s configuration. • Algorithm layout and implementation • Look at the different ways to express implementation
Sequential Binary Binomial Ring Root Message Size Optimal algorithm Buffer Size (bytes) (bytes) 8 binomial 8 16 binomial 16 32 binary 32 64 binomial 64 128 binomial 128 256 binomial 256 512 binomial 512 1K sequential 1K 2K binary 2K 4K binary 2K 8K binary 2K 16K binary 4K 32K binary 4K 64K ring 4K 128K ring 4K 256K ring 4K 512K ring 4K 1M binary 4K Work in Progress:ATLAS-like Approach Applied to Broadcast (PII 8 Way Cluster with 100 Mb/s switched network)
Reformulating/Rearranging/Reuse • Example is the reduction to narrow band from for the SVD • Fetch each entry of A once • Restructure and combined operations • Results in a speedup of > 30%
CG Variants by Dynamic Selection at Run Time • Variants combine inner products to reduce communication bottleneck at the expense of more scalar ops. • Same number of iterations, no advantage on a sequential processor • With a large number of processor and a high-latency network may be advantages. • Improvements can range from 15% to 50% depending on size.
CG Variants by Dynamic Selection at Run Time • Variants combine inner products to reduce communication bottleneck at the expense of more scalar ops. • Same number of iterations, no advantage on a sequential processor • With a large number of processor and a high-latency network may be advantages. • Improvements can range from 15% to 50% depending on size.
Solving Large Sparse Non-Symmetric Systems of Linear Equations Using BiCG-Stab • Example of optimizations • Combining 2 vector ops into 1 loop • Simplifying indexing • Removal of “if test” within loop
Split ADI Method • Originally from a radiation transport problem • Solution of medium-size dense blocks • Well-conditioned => iterative method Write A=D-E=D-1(I-N); M-1 =(I+N)D -1 Left-preconditioning: M-1 A = (I-N2) D-1 is cheap, question is how efficient is y = N2 x; Can we beat 2 matrix-vector operations?
Performance Of L1Cache A2x Kernel • Avoiding 2 calls to MV routine • Do blocking to capture peak • 10% improvement by simple optimization
ScaLAPACK • ScaLAPACK is a portable distributed memory numerical library • Complete numerical library for dense matrix computations • Designed for distributed parallel computing (MPP & Clusters) using MPI • One of the first math software packages to do this • Numerical software that will work on a heterogeneous platform • Funding from DOE, NSF, and DARPA • In use today by IBM, HP-Convex, Fujitsu, NEC, Sun, SGI, Cray, NAG, IMSL, … • Tailor performance & provide support
To Use ScaLAPACK a User Must: • Download the package and auxiliary packages (like PBLAS, BLAS, BLACS, & MPI) to the machines. • Write a SPMD program which • Sets up the logical 2-D process grid • Places the data on the logical process grid • Calls the numerical library routine in a SPMD fashion • Collects the solution after the library routine finishes • The user must allocate the processors and decide the number of processes the application will run on • The user must start the application • “mpirun –np N user_app” • Note: the number of processors is fixed by the user before the run, if problem size changes dynamically … • Upon completion, return the processors to the pool of resources
ScaLAPACK Cluster Enabled • Implement a version of a ScaLAPACK library routine that runs on clusters. • Make use of resources at the user’s disposal • Provide the best time to solution • Proceed without the user’s involvement • Make as few changes as possible to the numerical software.
ScaLAPACK Performance Model • Total number of floating-point operations per processor • Total number of data items communicated per processor • Total number of messages • Time per floating point operation • Time per data item communicated • Time per message
LAPACK For Clusters • Developing middleware which couples cluster system information with the specifics of a user problem to launch cluster based applications on the “best” set of resource available. • Using ScaLAPACK as the prototype software 1
Big Picture… User has problem to solve (e.g. Ax = b) Natural Data (A,b) Natural Answer (x) Middleware Structured Data (A’,b’) Structured Answer (x’) Application Library (e.g. LAPACK, ScaLAPACK, PETSc,…)
Numerical Libraries for Clusters User Stage data to disk A b
Numerical Libraries for Clusters User A b Library Middle-ware
Numerical Libraries for Clusters User A b Library Middle-ware NWS Autopilot MDS Time function minimization Resource Selection
Numerical Libraries for Clusters 0,0 0,1 … User A b Library Middle-ware NWS Autopilot MDS Time function minimization Resource Selection Can use Grid infrastructure, i.e.Globus/NWS, but doesn’t have to.
Resource Selector • Use information on Bandwidth/Latency/Load/Memory/CPU performance • 2 matrices (bw,lat) 3 arrays (load, cpu, memory available) • Generated dynamically by library routine CPU Performance Memory Bandwidth Load Latency
LFC will automate much of the decisions in the Cluster environment to provide best time to solution. Adaptivity to the dynamic environment. As the complexities of the Clusters and Grid increase need to develop strategies for self adaptability. Handcrafted developed leading to an automated design. Developing a basic infrastructure for computational science applications and software in the Cluster and Grid environment. Lack of tools is hampering development today. Plan to do LU, Cholesky, QR, Symmetric eigenvalue, and Nonsymmetric eigenvalue LAPACK For Clusters (LFC)
Future SANS Effort • Intelligent Component (IC) • Automates method selection based on data, algorithm, and system attributes • System component • Provides intelligent management of and access to the computational grid • History database • Records relevant info generated by the IC and maintains past performance data
NetSolve to Determine the Best Algorithm/Software (A bit in the future) Matrix type • x = netsolve(‘linearsolve’,A,b) • NetSolve examines the user’s data together with the resources available (in the grid sense) and makes decisions on best time to solution dynamically. • Decision based on size of problem, hardware/software, network connection etc. • Method and Placement. Sparse Dense Direct Iterative Symmetric General Rectangular General Symmetric Symmetric General
Decisions • What is the matrix shape? • Rectangular or square • How dense is the matrix? • Checks the percent of non-zeros • Is the matrix banded? • What is the diagonal structure? • Zeros on diagonal • Alternating signs • Are there dense blocks? • Special structure • Zero 2x2 blocks
Sparse Solvers • Direct or iterative? • Size of the problem • Fill-in • Memory available? • Numerical properties • Start iterative method and monitor progress • Estimate number of iterations
Heuristics • If sparse and symmetric • Run CG and see if breakdown occurs • Try on A+I, where is some fraction of what’s needed to make the matrix diagonally dominant (QMR or GMRES might work) • For nonsymmetric case, adaptively assess spectrum, monitor residual and Ritz values
Structure • Domain-partitioning methods • Additive or Multiplicative Schwarz, or Schur-complement Domain Decomposition. • Algebraic MultiGrid methods can also be used if the matrix has a discernible structure. • Close enough to elliptic • For more general matrices, we can use an ILU preconditioner. • One variant that is reasonably robust is the Manteuffel ILU preconditioner
NetSolve - Grid Enabled Server • NetSolve is an example of a Grid based hardware/software server. • Based on a Remote Procedure Call model but with … • resource discovery, dynamic problem solving capabilities, load balancing, fault tolerance asynchronicity, security, … • Easy-of-use paramount • Other examples are NEOS from Argonne and NINF Japan.
NetSolve: The Big Picture Client Schedule Database AGENT(s) Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 S1 S2 No knowledge of the grid required, RPC like.
NetSolve: The Big Picture Client Schedule Database AGENT(s) A, B Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 S1 S2 No knowledge of the grid required, RPC like.
NetSolve: The Big Picture Client Schedule Database AGENT(s) Handle back Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 S1 S2 No knowledge of the grid required, RPC like.
Request S2 ! A, B OP, handle Answer (C) NetSolve: The Big Picture Client Schedule Database AGENT(s) Matlab, Octave, Scilab Mathematica C, Fortran IBP Depot S3 S4 Op(C, A, B) S1 S2 No knowledge of the grid required, RPC like.
Hiding the Parallel Processing • User maybe unaware of parallel processing • NetSolve takes care of the starting the message passing system, data distribution, and returning the results.