Resolution of large symmetric eigenproblems on a world-wide grid

Resolution of large symmetric eigenproblems on a world-wide grid Laurent Choy, Serge Petiton, Mitsuhisa Sato CNRS/LIFL HPCS Lab. University of Tsukuba 2nd NEGST workshop at Tokyo May 28-29th, 2007

Outlines • Introduction • Distribution of the numerical method • Experiments • Experiments on world-wide grids: platforms, numerical settings • Experiments on Grid'5000: motivations, platforms, numerical settings • Results • YML • Progress of YML • YvetteML workflow of the real symmetric eigenproblem • First experiments • Conclusion

Introduction • Huge number of nodes connected to Internet • Clusters and NOWs of institutions,PCs of individual users • Volunteer • Constant availability of nodes, on-demand access • HPC and large Grid Computing are complementary • We do not target the highest performances • We target a different community of users • Why the real symmetric eigenproblem? • Requires a lot of resources on the nodes • Communications, synchronization points • Useful problem • Few similar studies for very large Grid Computing

Distribution of the numerical method (1/2) • Real symmetric eigenproblem • Au=lu, A real symmetric • Main steps: • Lanczos tridiagonalization • T=QtAQ, T real symmetric tridiagonal • Data accessed by means of MVP • Bisection and Inverse Iteration • Tv=lv, same eigenvalues as A (Ritz eigenvalues) • Communication-free parallelism: task-farming • Ritz eigenvectors computations (u) • Accuracy tests |Au-lu|2<eps

Distribution of thenumerical method (2/2) • Reducing the memory usage • Out-of-core • Restarted scheme • Reorthogonalization • Bisection, Inverse Iteration • Reduces the disk usage too • Volume of communications • Data-persistence (A and Q) • Number of communications • Task-farming • Other issue to be improved • Distribution of A

World-wide grid experimentsExperimental platforms, numerical settings (1/2) • Computing and network resources • University of Tsukuba • Homogeneous dedicated clusters • Dual Xeon ~3GHz, 1 to 4 GB • University of Lille 1 • Heterogeneous NOWs • Celeron 1.4 GHz to P4 3.2 Ghz • 128MB to 1GB • Shared with students • Internet

World-wide grid experimentsExperimental platforms, numerical settings (2/2) • 4 Platforms • OmniRPC • 2 local platforms: 29 / 58 nodes, Lille • 2 world-wide platforms • 58 (29 Lille+ 29 Tsukuba dual-proc.) • 116 (58 Lille, 58 Tsukuba dual-proc.) • Matrix • N=47792 • 2.5 million elements, avg 48 nnz/row • Parameters • M=10, 15, 20, 25 • K=1, 2, 3, 4

Grid'5000 experimentsPresentation, motivations • Up to 9 sites distributed in France • Dedicated PC with reservation policy • Fast and dedicated Network • RENATER (1GBit/s to 10GBit/s) • PC are homogeneous (few exceptions) • Homogeneous environment • (deployment strategy) • For those experiments • Orsay: up to 300 single-CPU nodes • Lille: up to 60 single-CPU nodes • Nice: up to 60 dual-CPU nodes • Rennes: up to 70 dual-CPU nodes

Grid'5000 experimentsPlatforms and numerical settings (1/2) • Step 1: • Goal: improving previous analysis. • Platforms • 29 Orsay, single-proc • 58 Orsay, single-proc • 58 Lille, Sophia dual-proc • 116 Orsay, Sophia dual-proc (1 core/proc) • + 116 Orsay, Lille, Sophia dual-proc (1 core/proc) • 1 process/dual-processor • Numerical settings • Matrix: N=47792 , 2.5 million elements, avg 48 nnz/row • Parameters • m=10, 15, 20, 25 • k=1, 2, 3, 4

Grid'5000 experimentsPlatforms and numerical settings (2/2) Step 2: Goal: increasing the size of the problem. In progress N=430128, 193 million elements 7 OmniRPC relay nodes, 206 CPU 3 sites 11 OmniRPC relay nodes, 412 CPU 4 sites k=1, m=15

World-wide grid experimentsResults 29 Sing. Proc. Lille 58 Sing. Proc Lille 58 Sing. Proc. Lille Dual. Proc. Tsukuba (all proc. Used) 116 Sing. Proc. Orsay Dual. Proc. Tsukuba (all proc. Used)

Grid'5000 experiments – step 1Results 29 Sing. Proc. Orsay 58 Sing. Proc Orsay 58 Sing. Proc. Lille Dual. Proc. Sophia (all proc. Used) 116 Sing. Proc. Orsay Dual. Proc. Sophia (all proc. Used) 116 Sing. Proc. Orsay Sing. Proc. Lille Dual. Proc. Sophia (1 proc. Used)

Grid'5000 experiments – step 2Results Details for N=430128, m=15, k=1 Wall-clock times in seconds Number cpu 206 412 Wall-clock time 10962 13150 Lanczos tridiagonalization Send new column of Q: 22 MVP: 10106 Reorthog: 129 Send new column of Q: 20 MVP: 12311 Reorthog: 159 Bisection + Inverse Iteration <1 9 Ritz eigenvector 9 11 |Au-lu| < eps 691 810 • Evaluation of the wall-clock-time for 1 MVP with the matrix A • In the tridiagonalization: • 15(m)*5(nb restarts)=75 MVPs • 134 sec (206 cpu) and 164 sec (412 cpu) per MVP • In the tests of convergence: • 5(nb restarts) MVPs • 138 sec (206 cpu) and 162 sec (412 cpu) per MVP

Progress of YML • YML 1.0.5 • Stability, error reporting • Collections of data • out-of-core • Variable lists of parameters • Parameters in/out of the Workflow • Mainly developed at the PRiSM laboratory, University of Versailles • http://yml.prism.uvsq.fr/ • Olivier Delannoy, Nahid Emad

Resolution of the eigenproblem with YML • No data persistence • Future work: binary cache • Re-usability / aggregation of components

Experiments with YML & OmniRPC back-end YML + OmniRPC back-end (wall-clock times in min) OmniRPC (wall-clock times in min) • Sources of overhead • No computation in the YvetteML workflow • Sheduler, (un)packing the parameters • Transfers of binaries Overhead (in %)

Conclusion (1/3) • Reminder of the scope of this work • Large grid computing and HPC: complementary tools • Used by people that have no access to HPC • Significant computations (size of the problem) • We do not (cannot) target the high performances • The resources are not dedicated • Slow networks, heterogeneous machines, external perturbations, etc • Linear algebra problems are useful for many general applications • Differences with HPC and cluster computing • We must not have a “speed-up” approach of the computations • Recommendations to save resources on nodes

Conclusion (2/3) • We propose • Scalable real symmetric eigensolver for large grids • Next expected bounding limit: disk space for much larger or very dense matrix • Before the implementation of the method, key choices must be done • Numerical methods and programming paradigms • Bisection (Task-farming) • Restarted scheme (memory and disk) • Out-of-core (memory) • Data persistence (communication) • New version of YML • Workflow of the eigensolver and re-usable components • In progress

Conclusion (3/3) • Topics of study for the eigensolver • Improving the distribution of A • Testing more matrices • Different kind of matrices (e.g. sparse, dense) • Larger matrices • Scheduling level • adapting the workload balancing to the heterogeneity of the platforms • Current and future work on YML • Finishing the multi back-end support • Binary cache

Resolution of large symmetric eigenproblems on a world-wide grid

Resolution of large symmetric eigenproblems on a world-wide grid

Presentation Transcript

Resource Management of Large-Scale Applications on a Grid

Decentralization: A World-Wide Phenomenon

Decentralization: A World-Wide Phenomenon

Parkour a world wide Subculture

Sensitivity of Eigenproblems

World Wide Consortium for the Grid

Resource Management of Large-Scale Applications on a Grid

Grid-wide Intrusion Detection

World wide LHC Computing Grid WLCG

A map of public gardens world wide

The LHC Computing Grid A World-Wide Computer Centre

A Nation Wide Experimental Grid

Wide World

It’s a Wide World

VISTAS Grid Resolution Sensitivity

RKPACK A numerical package for solving large eigenproblems

DS-Grid: Large Scale Distributed Simulation on the Grid

A nation wide Experimental Grid

RKPACK A numerical package for solving large eigenproblems

Meteo-GRID: World-wide Local Weather Forecasts by GRID Computing

A nation wide Experimental Grid