290 likes | 317 Views
High Performance Information Retrieval and MEMS CAD on Intel Itanium. James Demmel Mathematics and EECS UC Berkeley www.cs.berkeley.edu/~demmel/Itanium_121001.ppt Joint work with David Culler, Michael Jordan, William Kahan, Katherine Yelick, Zhaojun Bai (UC Davis). Outline.
E N D
High Performance Information Retrieval and MEMS CADon Intel Itanium James Demmel Mathematics and EECS UC Berkeley www.cs.berkeley.edu/~demmel/Itanium_121001.ppt Joint work with David Culler, Michael Jordan, William Kahan, Katherine Yelick, Zhaojun Bai (UC Davis)
Outline • What’s up with Millennium • Automatic performance tuning • Applications to • SUGAR – a MEMS CAD system • Information Retrieval • Future Work
Millennium • Cluster of clusters at UC Berkeley • 309 CPU cluster in Soda Hall • Smaller clusters across campus • Made possible by Intel equipment grant • Significant other support • NSF, Sun, Microsoft, Nortel, campus • www.millennium.berkeley.edu
Millennium Usage Oct 1 – 11, 2001 100% utilization for last few days About half the jobs are parallel
Usage highlights • AMANDA • Antarctic Muon And Neutrino Detector Array • amanda.berkeley.edu • 128 scientists from 15 universities and institutes in the U.S. and Europe. • TEMPEST • EUV lithography simulations via 3D electromagnetic scattering • cuervo.eecs.berkeley.edu/Volcano/ • study the defect printability on multilayer masks • Titanium • High performance Java dialect for scientific computing • www.cs.berkeley.edu/projects/titanium • Implementation of shared address space, and use of SSE2 • Digital Library Project • Large database of images • elib.cs.berkeley.edu/ • Used to run spectral image segmentation algorithm for clustering, search on images
Usage highlights (continued) • CS 267 • Graduate class in parallel computing, 33 enrolled • www.cs.berkeley.edu/~dbindel/cs267ta • Homework • Disaster Response • Help find people after Sept 11, set up immediately afterwards • safe.millennium.berkeley.edu • 48K reports in database, linked to other survivor databases • MEMS CAD (MicroElectroMechanical Systems Computer Aided Design) • Tool to help design MEMS systems • Used this semester in EE 245, 93 enrolled • sugar.millennium.berkeley.edu • More later in talk • Information Retrieval • Development of faster information retrieval algorithms • www.cs.berkeley.edu/~jordan • More later in talk • Many applications are part of CITRIS
Performance Tuning • Motivation: performance of many applications dominated by a few kernels • MEMS CAD Nonlinear ODEs Nonlinear equations Linear equations Matrix multiply • Matrix-by-matrix or matrix-by-vector • Dense or Sparse • Information retrieval by LSI Compress term-document matrix … Sparse mat-vec multiply • Information retrieval by LDA Maximum likelihood estimation … Solve linear systems • Many other examples (not all linear algebra)
Conventional Performance Tuning • Vendor or user hand tunes kernels • Drawbacks: • Very time consuming and difficult work • Even with intimate knowledge of architecture and compiler, performance hard to predict • Must be redone for every architecture, compiler • Compiler technology often lags architecture • Not just a compiler problem: • Best algorithm may depend on input, so some tuning must occur at run-time. • Not all algorithms semantically or mathematically equivalent
Automatic Performance Tuning • Approach: for each kernel • Identify and generate a space of algorithms • Search for the fastest one, by running them • What is a space of algorithms? • Depends on kernel and input • May vary • instruction mix and order • memory access patterns • data structures • mathematical formulation • When do we search? • Once per kernel and architecture • At compile time • At run time
Tuning pays off – ATLAS (Dongarra, Whaley) Extends applicability of PHIPAC Incorporated in Matlab (with rest of LAPACK)
Other Automatic Tuning Projects • FFTs and Signal Processing • FFTW (www.fftw.org) • Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of n • Widely used, won 1999 Wilkinson Prize for Numerical Software • SPIRAL (www.ece.cmu.edu/~spiral) • Extensions to other transforms, DSPs • UHFFT • Extensions to higher dimension, parallelism • Special session at ICCS 2001 • Organized by Yelick and Demmel • www.ucalgary.ca/iccs • Proceedings available • Pointers to automatic tuning projects at www.cs.berkeley.edu/~yelick/iccs-tune
Search for optimal register tile sizes on Sun Ultra 10 16 registers, but 2-by-3 tile size fastest
Search for Optimal L0 block size in dense matmul 4% of versions exceed 60% of peak on Pentium II-300
Tuning Sparse matrix operations • Sparsity • Optimizes y = A*x for a particular sparse A • Im and Yelick • Algorithm space • Different code organization, instruction mixes • Different register blockings (change data structure and fill of A) • Different cache blocking • Different number of columns of x • Different matrix orderings • Software and papers available • www.cs.berkeley.edu/~yelick/sparsity
Tuning other sparse operations • Symmetric matrix-vector multiply A*x • Solve a triangular system of equations T-1*x • AT*A*x • Kernel of Information Retrieval via LSI • Same number of memory references as A*x • A2*x, Ak*x • Kernel of Information Retrieval used by Google • Changes calling algorithm • AT*M*A • Matrix triple product • Used in multigrid solver • …
Applications to SUGAR – a tool for MEMS CAD • Demmel, Bai, Pister, Govindjee, Agogino, Gu, … • Input: description of MicroElectroMechanical System (as netlist) • Output: • DC, steady state, modal, transient analyses to assess behavior • CIF for fabrication • Simulation capabilities • Beams and plates (linear, nonlinear, prestressed,…) • Electrostatic forces, circuits • Thermal expansion, Couette damping • Availability • Matlab • Publicly available • www-bsac.eecs.berkeley.edu/~cfm • 249 registered users, many unregistered • Web service – M & MEMS • Runs on Millennium • sugar.millennium.berkeley.edu • Now in use in EE 245 at UCB…96 users • Lots of new features being added, including interface to measurements
Micromirror (Last, Pister) • Laterally actuated torsionally suspended micromirror • Over 10K dof, 100 line netlist (using subnets) • DC and frequency analysis • All algorithms reduce to previous kernels
Information Retrieval • Jordan • Collaboration with Intel team building probabilistic graphical models • Better alternatives to LSI for document modeling and search • Latent Dirichlet Allocation (LDA) • Model documents as union of themes, each with own word distribution • Maximum likelihood fit to find themes in set of documents, classify them • Computational bottleneck is solution of enormous linear systems • One of largest Millennium users • Kernel ICA • Estimate set of sources s and mixing matrix A from samples x = A*s • New way to sample such that sources are as independent as possible • Again reduces to linear algebra kernels • Identifying influential documents • Given hyperlink patterns of documents, which are most influential? • Basis of Google (eigenvector of link matrix sparse matrix vector multiply) • Applying Markov chain and perturbation theory to assess reliability
Future Work • Exploit Itanium Architecture • 128 (82-bit) floating pointer registers • fused multiply-add instruction • predicated instructions • rotating registers for software pipelining • prefetching instructions • three levels of cache • Tune current and wider set of kernels • Incorporate into • SUGAR • Information Retrieval • Further automate performance tuning • Generation of algorithm space generators